idnits 2.17.1
draft-burger-speechsc-reqts-00.txt:
** The Abstract section seems to be numbered
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** Looks like you're using RFC 2026 boilerplate. This must be updated to
follow RFC 3978/3979, as updated by RFC 4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
** The document seems to lack a 1id_guidelines paragraph about 6 months
document validity -- however, there's a paragraph with a matching
beginning. Boilerplate error?
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack an IANA Considerations section. (See Section
2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
when there are no actions for IANA.)
** The document seems to lack separate sections for Informative/Normative
References. All references will be assumed normative when checking for
downward references.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (June 13, 2002) is 7985 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
-- Missing reference section? '1' on line 13 looks like a reference
-- Missing reference section? '2' on line 56 looks like a reference
-- Missing reference section? '3' on line 72 looks like a reference
-- Missing reference section? '4' on line 72 looks like a reference
-- Missing reference section? '5' on line 73 looks like a reference
-- Missing reference section? '6' on line 73 looks like a reference
-- Missing reference section? '7' on line 239 looks like a reference
-- Missing reference section? '8' on line 84 looks like a reference
-- Missing reference section? '9' on line 129 looks like a reference
-- Missing reference section? '10' on line 163 looks like a reference
-- Missing reference section? '11' on line 172 looks like a reference
-- Missing reference section? '12' on line 182 looks like a reference
-- Missing reference section? '13' on line 182 looks like a reference
-- Missing reference section? '14' on line 195 looks like a reference
-- Missing reference section? '15' on line 227 looks like a reference
-- Missing reference section? '16' on line 300 looks like a reference
Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 18 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
1 Network Working Group E. Burger
2 Internet Draft SnowShore Networks, Inc.
3 Document: draft-burger-speechsc-reqts-00.txt D. Oran
4 Category: Informational Cisco Systems, Inc.
5 Expires August 2002 June 13, 2002
7 Requirements for Distributed Control of ASR, SV and TTS Resources
9 Status of this Memo
11 This document is an Internet-Draft and is in full conformance with
12 all provisions of Section 10 of RFC2026 [1].
14 Internet-Drafts are working documents of the Internet Engineering
15 Task Force (IETF), its areas, and its working groups. Note that
16 other groups may also distribute working documents as Internet-
17 Drafts. Internet-Drafts are draft documents valid for a maximum of
18 six months and may be updated, replaced, or obsoleted by other
19 documents at any time. It is inappropriate to use Internet- Drafts
20 as reference material or to cite them other than as "work in
21 progress."
23 The list of current Internet-Drafts can be accessed at
24 http://www.ietf.org/ietf/1id-abstracts.txt
26 The list of Internet-Draft Shadow Directories can be accessed at
27 http://www.ietf.org/shadow.html.
29 1. Abstract
31 This document outlines the needs and requirements for a protocol to
32 control distributed speech processing of audio streams. By speech
33 processing, this document specifically means automatic speech
34 recognition, speaker verification and text-to-speech. Other IETF
35 protocols, such as SIP and RTSP, address rendezvous and control for
36 generalized media streams. However, speech processing presents
37 additional requirements that none of the extant IETF protocols
38 address.
40 Discussion of this and related documents is on the MRCP list. To
41 subscribe, send the message "subscribe mrcp" to
42 majordomo@snowshore.com. The public archive is at
43 http://flyingfox.snowshore.com/mrcp_archive/maillist.html.
45 NOTE: This mailing list will be superseded by an official working
46 group mailing list, cats@ietf.org, once the WG is formally
47 chartered.
49 Distributed Media Control Requirements February 2002
51 2. Conventions used in this document
53 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
54 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
55 this document are to be interpreted as described in RFC-2119 [2].
57 FORMATTING NOTE: Notes, such at this one, provide additional,
58 nonessential information that the reader may skip without missing
59 anything essential. The primary purpose of these non-essential
60 notes is to convey information about the rationale of this document,
61 or to place this document in the proper historical or evolutionary
62 context. Readers whose sole purpose is to construct a conformant
63 implementation may skip such information. However, it may be of use
64 to those who wish to understand why we made certain design choices.
66 OPEN ISSUES: This document highlights questions that are, as yet,
67 undecided as "OPEN ISSUES".
69 3. Introduction
71 There are multiple IETF protocols for establishment and termination
72 of media sessions (SIP[3]), low-level media control (MGCP[4] and
73 MEGACO[5]), and media record and playback (RTSP[6]). This document
74 focuses on requirements for one or more protocols to support the
75 control of network elements that perform Automated Speech
76 Recognition (ASR), speaker verification (SV), and rendering text
77 into audio, a.k.a. Text-to-Speech (TTS). Many multimedia
78 applications can benefit from having automatic speech recognition
79 (ASR) and text-to-speech (TTS) processing available as a
80 distributed, network resource. This requirements document limits
81 its focus on the distributed control of ASR, SV and TTS servers.
83 To date, there are a number of proprietary ASR and TTS API's, as
84 well as two IETF drafts that address this problem [7] [8]. However,
85 there are serious deficiencies to the existing drafts. In
86 particular, they mix the semantics of existing protocols yet are
87 close enough to other protocols as to be confusing to the
88 implementer.
90 This document sets forth requirements for protocols to support
91 distributed speech processing of audio streams.
93 For simplicity, and to remove confusion with existing protocol
94 proposals, this document presents the requirements as being for a
95 "new protocol" that addresses the distributed control of speech
96 resources It refers to such a protocol as "SRCP", for Speech
97 Resource Control Protocol.
99 4. SRCP Framework
101 The following is the SRCP framework for speech processing.
103 Distributed Media Control Requirements February 2002
105 +-------------+
106 | Application |
107 | Server |
108 +-------------+
109 SIP or whatever /
110 /
111 +------------+ / +--------+
112 | Media |/ SRCP | ASR |
113 | Processing |-------------------------| and/or |
114 RTP | Entity | RTP | TTS |
115 =====| |=========================| Server |
116 +------------+ +--------+
118 The "Media Processing Entity" is a network element that processes
119 media. The "Application Server" is a network element that instructs
120 the Media Processing Entity on what transformations to make to the
121 media stream. The "ASR and/or TTS Server" is a network element that
122 either generates a RTP stream based on text input (TTS) or returns
123 speech recognition results in response to an RTP stream as input
124 (ASR). The Media Processing Entity controls the ASR or TTS Server
125 using SRCP as a control protocol.
127 Physical embodiments of the entities can reside in one physical
128 instance per entity, or some combination of entities. For example,
129 a VoiceXML [9] Gateway may combine the ASR and TTS functions on the
130 same platform as the Media Processing Entity. Note that VoiceXML
131 Gateways themselves are outside the scope of this protocol.
133 Likewise, one can combine the Application Server and Media
134 Processing Entity, as would be the case in an interactive voice
135 response (IVR) platform.
137 One can also decompose the Media Processing Entity into an entity
138 that controls media endpoints and entities that process media
139 directly. Such would be the case with a decomposed gateway using
140 MGCP or megaco. However, this decomposition is again orthogonal to
141 the scope of SRCP.
143 5. General Requirements
145 5.1. Reuse Existing Protocols
147 To the extent feasible, the SRCP framework SHOULD use existing
148 protocols.
150 5.2. Maintain Existing Protocol Integrity
152 In meeting requirement 5.1, the SRCP framework MUST NOT redefine the
153 semantics of an existing protocol.
155 Distributed Media Control Requirements February 2002
157 Said differently, we will not break existing protocols or cause
158 backward compatibility problems.
160 5.3. Avoid Duplicating Existing Protocols
162 To the extent feasible, SRCP SHOULD NOT duplicate the functionality
163 of existing protocols. For example, SIP with msuri [10] and RTSP
164 already define how to request playback of audio.
166 The focus of SRCP is new functionality not addressed by existing
167 protocols or extending existing protocols within the strictures of
168 requirement 5.2.
170 5.4. Explicit invocation of services
172 The SRCP framework MUST be compliant with the IAB OPES[11]
173 framework. The applicability of the SRCP protocol will therefore be
174 specified as occurring between clients and servers at least one of
175 which is operating directly on behalf of the user requesting the
176 service.
178 5.5. Server Location and Load Balancing
180 To the extent feasible, the SRCP framework SHOULD exploit existing
181 schemes for performing service location and load balancing, such as
182 the Service Location Protocol[12] or DNS SRV records[13]. Where such
183 facilities are not deemed adequate, the SRCP framework MAY define
184 additional load balancing techniques.
186 6. TTS Requirements
188 The SRCP framework MUST allow a Media Processing Entity, using a
189 control protocol, to request the TTS Server to playback text as
190 voice in an RTP stream.
192 The TTS Server MUST support the reading of plain text. For reading
193 plain text, the language and voicing is a local matter.
195 The TTS Server SHOULD support the reading of SSML [14] text.
197 OPEN ISSUE: Should the TTS Server infer the text is SSML by
198 detecting a legal SSML document, or must the protocol tell the TTS
199 Server the document type?
201 The TTS Server MUST accept text over the SRCP connection for reading
202 over the RTP connection. The server MUST accept text either ?by
203 value? (embedded in the protocol), or ?by reference? (by de-
204 referencing a URI embedded in the protocol).
206 OPEN ISSUE: Should we allow (or require) the TTS Server to use long-
207 lived control channels?
208 Distributed Media Control Requirements February 2002
210 The TTS Server SHOULD support, and the SRCP framework MUST support
211 the specification of, "VCR Controls", such as skip forward, skip
212 backward, play faster, and play slower.
214 OPEN ISSUE: Should we allow for session parameters, like prosody and
215 voicing, as is specified for MRCP over RTSP [7]?
217 OPEN ISSUE: Should we allow for speech markers, as is specified for
218 MRCP over RTSP [7]?
220 7. ASR Requirements
222 The SRCP framework MUST allow a Media Processing Entity to request
223 the ASR Server to perform automatic speech recognition on an RTP
224 stream, returning the results over SRCP.
226 The ASR Server MUST support the XML specification for speech
227 recognition [15].
229 The ASR Server MUST accept grammar specifications either ?by value?
230 (embedded in the protocol), or ?by reference? (by de-referencing a
231 URI embedded in the protocol).
233 OPEN ISSUE: Should we allow the ASR Server to support alternative
234 grammar formats? If so, we need mechanisms to specify what format
235 the grammar is in, capability discovery, and handling unsupported
236 grammars.
238 OPEN ISSUE: Is there a need for all of the parameters specified for
239 MRCP over RTSP [7]? Most of them are part of the W3C speech
240 recognition grammar.
242 The ASR Server SHOULD support a method for capturing the input media
243 stream for later analysis and tuning of the ASR engine.
244 The ASR Server SHOULD support sharing grammars across sessions.
245 This supports applications with large grammars for which it is
246 unrealistic to dynamically load. An example is a city-country
247 grammar for a weather service.
249 8. Speaker Verification Requirements
251 The SRCP framework MUST allow a Media Processing Entity to request
252 the SV Server to perform speaker verification on an RTP stream,
253 returning the results over SRCP.
255 The SV Server MUST The server MUST accept grammar specifications
256 either ?by value? (embedded in the protocol), or ?by reference? (by
257 de-referencing a URI embedded in the protocol).
259 The SRCP framework MUST accommodate an identifier for each
260 verification resource and permit control of that resource by ID,
261 because voiceprint format and contents are vendor specific
262 Distributed Media Control Requirements February 2002
264 The SRCP framework MUST work with SV servers which maintain state to
265 handle multi-utterance verification.
267 The SV Server SHOULD support a method for capturing the input media
268 stream for later analysis and tuning of the SV engine.
270 9. Dual-Mode Requirements
272 One very important requirement for an interactive speech-driven
273 system is that user perception of the quality of the interaction
274 depends strongly on the ability of the user to interrupt a prompt or
275 rendered TTS with speech. Interrupting, or barging, the speech
276 output requires more than energy detection from the user's
277 direction. Many advanced systems halt the media towards the user by
278 employing the ASR engine to decide if an utterance is likely to be
279 real speech, as opposed to a cough, for example.
281 To achieve low latency between utterance detection and halting of
282 playback, many implementations combine the speaking and ASR
283 functions. The SRCP framework MUST support such dual-mode
284 implementations.
286 10. Thoughts to Date (non-normative)
288 The protocol assumes RTP carriage of media. Assuming session-
289 oriented media transport, the protocol will use SDP to describe the
290 session.
292 The working group will not be investigating distributed speech
293 recognition (DSR), as exemplified by the ETSI Aurora project. The
294 working group will not be recreating functionality available in
295 other protocols, such as SIP or SDP.
297 TTS looks very much like playing back a file. Extending RTSP looks
298 promising for when one requires VCR controls or markers in the text
299 to be spoken. When one does not require VCR controls, SIP in a
300 framework such as Network Announcements [16] works directly without
301 modification.
303 ASR has an entirely different set of characteristics. For barge-in
304 support, ASR requires real-time return of intermediate results.
305 Barring the discovery of a good reuse model for an existing
306 protocol, this will most likely become the focus of SRCP.
308 11. Security Considerations
310 Protocols relating to speech processing must take security into
311 account. This is particularly important as popular uses for TTS
312 include reading financial information. Likewise, popular uses for
313 ASR include executing financial transactions and shopping.
315 Distributed Media Control Requirements February 2002
317 We envision that rather than providing application-specific security
318 mechanisms in SRCP itself, the resulting protocol will employ
319 security machinery of either containing protocols or the transport
320 on which it runs. For example, we will consider solutions such as
321 using TLS for securing the control channel, and SRTP for securing
322 the media channel.
324 12. References
326 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP
327 9, RFC 2026, October 1996.
329 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement
330 Levels", BCP 14, RFC 2119, March 1997
332 3 Handley, M., Schulzrinne, H., Schooler, E., and Rosenberg, J.,
333 "SIP: Session Initiation Protocol", RFC 2543, March 1999
335 4 Arango, M., Dugan, A., Elliott, I., Huitema, C., and Pickett, S.,
336 "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 2705,
337 October 1999
339 5 Cuervo, F., Greene, N., Rayhan, A., Huitema, C., Rosen, B., and
340 Segers, J., "Megaco Protocol Version 1.0", RFC 3015, November 2000
342 6 Schulzrinne, H., Rao, A., and Lanphier, R., "Real Time Streaming
343 Protocol (RTSP)", RFC 2326, April 1998
345 7 Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource
346 Control Protocol", draft-shanmugham-mrcp-01.txt, November 2001,
347 work in progress
349 8 Robinson, F., Marquette, B., and R. Hernandez, "Using Media
350 Resource Control Protocol with SIP", draft-robinson-mrcp-sip-
351 00.txt, September 2001, work in progress
353 9 World Wide Web Consortium, "Voice Extensible Markup Language
354 (VoiceXML) Version 2.0", W3C Working Draft,
355 ,
356 October 2001, work in progress
358 10 Van Dyke, J. and Burger, E., "SIP URI Conventions for Media
359 Servers", draft-burger-sipping-msuri-01, July 2001, work in
360 progress (expired)
362 11 Floyd, S., Daigle, L., ?IAB Architectural and Policy
363 Considerations for Open Pluggable Edge Services,? RFC3238,
364 January 2002.
366 Distributed Media Control Requirements February 2002
368 12 Guttman, E., Perkins, C., Veizades, J., Day, M. , "Service
369 Location Protocol, Version 2,? RFC 2608, June 1999.
371 13 Gulbrandson, A, Vixie, P., Esibov, L., ?A DNS RR for specifying
372 the location of services (DNS SRV)?, RFC2782, February 2000.
374 14 World Wide Web Consortium, "Speech Synthesis Markup Language
375 Specification for the Speech Interface Framework", W3C Working
376 Draft, , January 2001,
377 work in progress
379 15 World Wide Web Consortium, "Speech Recognition Grammar
380 Specification for the W3C Speech Interface Framework", W3C
381 Working Draft, , August
382 2001, work in progress
384 16 O'Connor, W., Burger, E., "Network Announcements with SIP",
385 draft-ietf-sipping-netann-01.txt, November 2001, work in progress
387 13. Acknowledgments
389 Brian Eberman came up with the new name. It is catchy and describes
390 what we are working on.
392 14. Author's Addresses
394 Eric W. Burger
395 SnowShore Networks, Inc.
396 Chelmsford, MA
397 USA
398 Email: eburger@snowshore.com
400 David R. Oran
401 Cisco Systems, Inc.
402 Acton, MA
403 USA
404 Email: oran@cisco.com
406 15. Change Log
408 From version draft-burger-mrcp-reqts-00 to version draft-burger-
409 speechsc-reqts-00:
410 - draft name changed per area director advice
411 - added speaker verification to the areas addressed, including
412 speaker verification requirements, per Dan Burnet?s
413 presentation at the Minneapolis BoF (see minutes).
415 Distributed Media Control Requirements February 2002
417 - based on mailing list discussion, added requirement to handle
418 both ?by value? and ?by reference? data. This is both for TTS
419 to be played out and grammar(s) to be applied to ASR.
420 - Based on discussion at the BoF in Minneapolis, added a
421 requirement concerning the use of load balancing schemes,
422 including those based on SRVLOC, SRV.
423 - Added a requirement for OPES compliance, per a discussion
424 with Sally Floyd as IAB observer for the BoF.
426 Distributed Media Control Requirements February 2002
428 Full Copyright Statement
430 Copyright (C) The Internet Society (2002). All Rights Reserved.
432 This document and translations of it may be copied and furnished to
433 others, and derivative works that comment on or otherwise explain it
434 or assist in its implementation may be prepared, copied, published
435 and distributed, in whole or in part, without restriction of any
436 kind, provided that the above copyright notice and this paragraph are
437 included on all such copies and derivative works. However, this
438 document itself may not be modified in any way, such as by removing
439 the copyright notice or references to the Internet Society or other
440 Internet organizations, except as needed for the purpose of
441 developing Internet standards in which case the procedures for
442 copyrights defined in the Internet Standards process must be
443 followed, or as required to translate it into languages other than
444 English.
446 The limited permissions granted above are perpetual and will not be
447 revoked by the Internet Society or its successors or assigns. This
448 document and the information contained herein is provided on an "AS
449 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
450 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
451 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL
452 NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY
453 OR FITNESS FOR A PARTICULAR PURPOSE.
455 Acknowledgement
457 The Internet Society currently provides funding for the RFC Editor
458 function.