idnits 2.17.1 

draft-burger-speechsc-reqts-00.txt:
  ** The Abstract section seems to be numbered


  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 13, 2002) is 7985 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Missing reference section? '1' on line 13 looks like a reference

  -- Missing reference section? '2' on line 56 looks like a reference

  -- Missing reference section? '3' on line 72 looks like a reference

  -- Missing reference section? '4' on line 72 looks like a reference

  -- Missing reference section? '5' on line 73 looks like a reference

  -- Missing reference section? '6' on line 73 looks like a reference

  -- Missing reference section? '7' on line 239 looks like a reference

  -- Missing reference section? '8' on line 84 looks like a reference

  -- Missing reference section? '9' on line 129 looks like a reference

  -- Missing reference section? '10' on line 163 looks like a reference

  -- Missing reference section? '11' on line 172 looks like a reference

  -- Missing reference section? '12' on line 182 looks like a reference

  -- Missing reference section? '13' on line 182 looks like a reference

  -- Missing reference section? '14' on line 195 looks like a reference

  -- Missing reference section? '15' on line 227 looks like a reference

  -- Missing reference section? '16' on line 300 looks like a reference


     Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 18 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Network Working Group                                        E. Burger
2	Internet Draft                                SnowShore Networks, Inc.
3	Document: draft-burger-speechsc-reqts-00.txt                   D. Oran
4	Category: Informational                            Cisco Systems, Inc.
5	Expires August 2002                                      June 13, 2002

7	   Requirements for Distributed Control of ASR, SV and TTS Resources

9	Status of this Memo

11	   This document is an Internet-Draft and is in full conformance with
12	   all provisions of Section 10 of RFC2026 [1].

14	   Internet-Drafts are working documents of the Internet Engineering
15	   Task Force (IETF), its areas, and its working groups. Note that
16	   other groups may also distribute working documents as Internet-
17	   Drafts. Internet-Drafts are draft documents valid for a maximum of
18	   six months and may be updated, replaced, or obsoleted by other
19	   documents at any time. It is inappropriate to use Internet- Drafts
20	   as reference material or to cite them other than as "work in
21	   progress."

23	   The list of current Internet-Drafts can be accessed at
24	   http://www.ietf.org/ietf/1id-abstracts.txt

26	   The list of Internet-Draft Shadow Directories can be accessed at
27	   http://www.ietf.org/shadow.html.

29	1. Abstract

31	   This document outlines the needs and requirements for a protocol to
32	   control distributed speech processing of audio streams.  By speech
33	   processing, this document specifically means automatic speech
34	   recognition, speaker verification and text-to-speech.  Other IETF
35	   protocols, such as SIP and RTSP, address rendezvous and control for
36	   generalized media streams.  However, speech processing presents
37	   additional requirements that none of the extant IETF protocols
38	   address.

40	   Discussion of this and related documents is on the MRCP list.  To
41	   subscribe, send the message "subscribe mrcp" to
42	   majordomo@snowshore.com.  The public archive is at
43	   http://flyingfox.snowshore.com/mrcp_archive/maillist.html.

45	   NOTE: This mailing list will be superseded by an official working
46	   group mailing list, cats@ietf.org, once the WG is formally
47	   chartered.

49	                Distributed Media Control Requirements  February 2002

51	2. Conventions used in this document

53	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
54	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in
55	   this document are to be interpreted as described in RFC-2119 [2].

57	   FORMATTING NOTE: Notes, such at this one, provide additional,
58	   nonessential information that the reader may skip without missing
59	   anything essential.  The primary purpose of these non-essential
60	   notes is to convey information about the rationale of this document,
61	   or to place this document in the proper historical or evolutionary
62	   context.  Readers whose sole purpose is to construct a conformant
63	   implementation may skip such information.  However, it may be of use
64	   to those who wish to understand why we made certain design choices.

66	   OPEN ISSUES: This document highlights questions that are, as yet,
67	   undecided as "OPEN ISSUES".

69	3. Introduction

71	   There are multiple IETF protocols for establishment and termination
72	   of media sessions (SIP[3]), low-level media control (MGCP[4] and
73	   MEGACO[5]), and media record and playback (RTSP[6]). This document
74	   focuses on requirements for one or more protocols to support the
75	   control of network elements that perform Automated Speech
76	   Recognition (ASR), speaker verification (SV), and rendering text
77	   into audio, a.k.a. Text-to-Speech (TTS). Many multimedia
78	   applications can benefit from having automatic speech recognition
79	   (ASR) and text-to-speech (TTS) processing available as a
80	   distributed, network resource.  This requirements document limits
81	   its focus on the distributed control of ASR, SV and TTS servers.

83	   To date, there are a number of proprietary ASR and TTS API's, as
84	   well as two IETF drafts that address this problem [7] [8].  However,
85	   there are serious deficiencies to the existing drafts.  In
86	   particular, they mix the semantics of existing protocols yet are
87	   close enough to other protocols as to be confusing to the
88	   implementer.

90	   This document sets forth requirements for protocols to support
91	   distributed speech processing of audio streams.

93	   For simplicity, and to remove confusion with existing protocol
94	   proposals, this document presents the requirements as being for a
95	   "new protocol" that addresses the distributed control of speech
96	   resources It refers to such a protocol as "SRCP", for Speech
97	   Resource Control Protocol.

99	4. SRCP Framework

101	   The following is the SRCP framework for speech processing.

103	                Distributed Media Control Requirements  February 2002

105	                       +-------------+
106	                       | Application |
107	                       |   Server    |
108	                       +-------------+
109	         SIP or whatever /
110	                        /
111	        +------------+ /                       +--------+
112	        |   Media    |/          SRCP          |  ASR   |
113	        | Processing |-------------------------| and/or |
114	    RTP |   Entity   |           RTP           |  TTS   |
115	   =====|            |=========================| Server |
116	        +------------+                         +--------+

118	   The "Media Processing Entity" is a network element that processes
119	   media.  The "Application Server" is a network element that instructs
120	   the Media Processing Entity on what transformations to make to the
121	   media stream.  The "ASR and/or TTS Server" is a network element that
122	   either generates a RTP stream based on text input (TTS) or returns
123	   speech recognition results in response to an RTP stream as input
124	   (ASR).  The Media Processing Entity controls the ASR or TTS Server
125	   using SRCP as a control protocol.

127	   Physical embodiments of the entities can reside in one physical
128	   instance per entity, or some combination of entities.  For example,
129	   a VoiceXML [9] Gateway may combine the ASR and TTS functions on the
130	   same platform as the Media Processing Entity. Note that VoiceXML
131	   Gateways themselves are outside the scope of this protocol.

133	   Likewise, one can combine the Application Server and Media
134	   Processing Entity, as would be the case in an interactive voice
135	   response (IVR) platform.

137	   One can also decompose the Media Processing Entity into an entity
138	   that controls media endpoints and entities that process media
139	   directly.  Such would be the case with a decomposed gateway using
140	   MGCP or megaco. However, this decomposition is again orthogonal to
141	   the scope of SRCP.

143	5. General Requirements

145	5.1. Reuse Existing Protocols

147	   To the extent feasible, the SRCP framework SHOULD use existing
148	   protocols.

150	5.2. Maintain Existing Protocol Integrity

152	   In meeting requirement 5.1, the SRCP framework MUST NOT redefine the
153	   semantics of an existing protocol.

155	                Distributed Media Control Requirements  February 2002

157	   Said differently, we will not break existing protocols or cause
158	   backward compatibility problems.

160	5.3. Avoid Duplicating Existing Protocols

162	   To the extent feasible, SRCP SHOULD NOT duplicate the functionality
163	   of existing protocols.  For example, SIP with msuri [10] and RTSP
164	   already define how to request playback of audio.

166	   The focus of SRCP is new functionality not addressed by existing
167	   protocols or extending existing protocols within the strictures of
168	   requirement 5.2.

170	5.4. Explicit invocation of services

172	   The SRCP framework MUST be compliant with the IAB OPES[11]
173	   framework. The applicability of the SRCP protocol will therefore be
174	   specified as occurring between clients and servers at least one of
175	   which is operating directly on behalf of the user requesting the
176	   service.

178	5.5. Server Location and Load Balancing

180	   To the extent feasible, the SRCP framework SHOULD exploit existing
181	   schemes for performing service location and load balancing, such as
182	   the Service Location Protocol[12] or DNS SRV records[13]. Where such
183	   facilities are not deemed adequate, the SRCP framework MAY define
184	   additional load balancing techniques.

186	6. TTS Requirements

188	   The SRCP framework MUST allow a Media Processing Entity, using a
189	   control protocol, to request the TTS Server to playback text as
190	   voice in an RTP stream.

192	   The TTS Server MUST support the reading of plain text.  For reading
193	   plain text, the language and voicing is a local matter.

195	   The TTS Server SHOULD support the reading of SSML [14] text.

197	   OPEN ISSUE: Should the TTS Server infer the text is SSML by
198	   detecting a legal SSML document, or must the protocol tell the TTS
199	   Server the document type?

201	   The TTS Server MUST accept text over the SRCP connection for reading
202	   over the RTP connection. The server MUST accept text either ?by
203	   value? (embedded in the protocol), or ?by reference? (by de-
204	   referencing a URI embedded in the protocol).

206	   OPEN ISSUE: Should we allow (or require) the TTS Server to use long-
207	   lived control channels?
208	                Distributed Media Control Requirements  February 2002

210	   The TTS Server SHOULD support, and the SRCP framework MUST support
211	   the specification of, "VCR Controls", such as skip forward, skip
212	   backward, play faster, and play slower.

214	   OPEN ISSUE: Should we allow for session parameters, like prosody and
215	   voicing, as is specified for MRCP over RTSP [7]?

217	   OPEN ISSUE: Should we allow for speech markers, as is specified for
218	   MRCP over RTSP [7]?

220	7. ASR Requirements

222	   The SRCP framework MUST allow a Media Processing Entity to request
223	   the ASR Server to perform automatic speech recognition on an RTP
224	   stream, returning the results over SRCP.

226	   The ASR Server MUST support the XML specification for speech
227	   recognition [15].

229	   The ASR Server MUST accept grammar specifications either ?by value?
230	   (embedded in the protocol), or ?by reference? (by de-referencing a
231	   URI embedded in the protocol).

233	   OPEN ISSUE: Should we allow the ASR Server to support alternative
234	   grammar formats?  If so, we need mechanisms to specify what format
235	   the grammar is in, capability discovery, and handling unsupported
236	   grammars.

238	   OPEN ISSUE: Is there a need for all of the parameters specified for
239	   MRCP over RTSP [7]?  Most of them are part of the W3C speech
240	   recognition grammar.

242	   The ASR Server SHOULD support a method for capturing the input media
243	   stream for later analysis and tuning of the ASR engine.
244	   The ASR Server SHOULD support sharing grammars across sessions.
245	   This supports applications with large grammars for which it is
246	   unrealistic to dynamically load.  An example is a city-country
247	   grammar for a weather service.

249	8. Speaker Verification Requirements

251	   The SRCP framework MUST allow a Media Processing Entity to request
252	   the SV Server to perform speaker verification on an RTP stream,
253	   returning the results over SRCP.

255	   The SV Server MUST The server MUST accept grammar specifications
256	   either ?by value? (embedded in the protocol), or ?by reference? (by
257	   de-referencing a URI embedded in the protocol).

259	   The SRCP framework MUST accommodate an identifier for each
260	   verification resource and permit control of that resource by ID,
261	   because voiceprint format and contents are vendor specific
262	                Distributed Media Control Requirements  February 2002

264	   The SRCP framework MUST work with SV servers which maintain state to
265	   handle multi-utterance verification.

267	   The SV Server SHOULD support a method for capturing the input media
268	   stream for later analysis and tuning of the SV engine.

270	9. Dual-Mode Requirements

272	   One very important requirement for an interactive speech-driven
273	   system is that user perception of the quality of the interaction
274	   depends strongly on the ability of the user to interrupt a prompt or
275	   rendered TTS with speech.  Interrupting, or barging, the speech
276	   output requires more than energy detection from the user's
277	   direction.  Many advanced systems halt the media towards the user by
278	   employing the ASR engine to decide if an utterance is likely to be
279	   real speech, as opposed to a cough, for example.

281	   To achieve low latency between utterance detection and halting of
282	   playback, many implementations combine the speaking and ASR
283	   functions.  The SRCP framework MUST support such dual-mode
284	   implementations.

286	10. Thoughts to Date (non-normative)

288	   The protocol assumes RTP carriage of media. Assuming session-
289	   oriented media transport, the protocol will use SDP to describe the
290	   session.

292	   The working group will not be investigating distributed speech
293	   recognition (DSR), as exemplified by the ETSI Aurora project.  The
294	   working group will not be recreating functionality available in
295	   other protocols, such as SIP or SDP.

297	   TTS looks very much like playing back a file.  Extending RTSP looks
298	   promising for when one requires VCR controls or markers in the text
299	   to be spoken.  When one does not require VCR controls, SIP in a
300	   framework such as Network Announcements [16] works directly without
301	   modification.

303	   ASR has an entirely different set of characteristics.  For barge-in
304	   support, ASR requires real-time return of intermediate results.
305	   Barring the discovery of a good reuse model for an existing
306	   protocol, this will most likely become the focus of SRCP.

308	11. Security Considerations

310	   Protocols relating to speech processing must take security into
311	   account.  This is particularly important as popular uses for TTS
312	   include reading financial information.  Likewise, popular uses for
313	   ASR include executing financial transactions and shopping.

315	                Distributed Media Control Requirements  February 2002

317	   We envision that rather than providing application-specific security
318	   mechanisms in SRCP itself, the resulting protocol will employ
319	   security machinery of either containing protocols or the transport
320	   on which it runs.  For example, we will consider solutions such as
321	   using TLS for securing the control channel, and SRTP for securing
322	   the media channel.

324	12. References

326	   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP
327	      9, RFC 2026, October 1996.

329	   2  Bradner, S., "Key words for use in RFCs to Indicate Requirement
330	      Levels", BCP 14, RFC 2119, March 1997

332	   3  Handley, M., Schulzrinne, H., Schooler, E., and Rosenberg, J.,
333	      "SIP: Session Initiation Protocol", RFC 2543, March 1999

335	   4  Arango, M., Dugan, A., Elliott, I., Huitema, C., and Pickett, S.,
336	      "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 2705,
337	      October 1999

339	   5 Cuervo, F., Greene, N., Rayhan, A., Huitema, C., Rosen, B., and
340	      Segers, J., "Megaco Protocol Version 1.0", RFC 3015, November 2000

342	   6 Schulzrinne, H., Rao, A., and Lanphier, R., "Real Time Streaming
343	      Protocol (RTSP)", RFC 2326, April 1998

345	   7 Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource
346	      Control Protocol", draft-shanmugham-mrcp-01.txt, November 2001,
347	      work in progress

349	   8 Robinson, F., Marquette, B., and R. Hernandez, "Using Media
350	      Resource Control Protocol with SIP", draft-robinson-mrcp-sip-
351	      00.txt, September 2001, work in progress

353	   9 World Wide Web Consortium, "Voice Extensible Markup Language
354	      (VoiceXML) Version 2.0", W3C Working Draft,
355	      <http://www.w3.org/TR/2001/WD-voicexml20-20011023/>,
356	      October 2001, work in progress

358	   10 Van Dyke, J. and Burger, E., "SIP URI Conventions for Media
359	      Servers", draft-burger-sipping-msuri-01, July 2001, work in
360	      progress (expired)

362	   11 Floyd, S., Daigle, L., ?IAB Architectural and Policy
363	      Considerations for Open Pluggable Edge Services,? RFC3238,
364	      January 2002.

366	                Distributed Media Control Requirements  February 2002

368	   12 Guttman, E., Perkins, C., Veizades, J., Day, M. , "Service
369	      Location Protocol, Version 2,? RFC 2608, June 1999.

371	   13 Gulbrandson, A, Vixie, P., Esibov, L., ?A DNS RR for specifying
372	      the location of services (DNS SRV)?, RFC2782, February 2000.

374	   14 World Wide Web Consortium, "Speech Synthesis Markup Language
375	      Specification for the Speech Interface Framework", W3C Working
376	      Draft, <http://www.w3.org/TR/speech-synthesis>, January 2001,
377	      work in progress

379	   15 World Wide Web Consortium, "Speech Recognition Grammar
380	      Specification for the W3C Speech Interface Framework", W3C
381	      Working Draft, <http://www.w3.org/TR/speech-grammar/>, August
382	      2001, work in progress

384	   16 O'Connor, W., Burger, E., "Network Announcements with SIP",
385	      draft-ietf-sipping-netann-01.txt, November 2001, work in progress

387	13. Acknowledgments

389	   Brian Eberman came up with the new name.  It is catchy and describes
390	   what we are working on.

392	14. Author's Addresses

394	   Eric W. Burger
395	   SnowShore Networks, Inc.
396	   Chelmsford, MA
397	   USA
398	   Email: eburger@snowshore.com

400	   David R. Oran
401	   Cisco Systems, Inc.
402	   Acton, MA
403	   USA
404	   Email: oran@cisco.com

406	15. Change Log

408	   From version draft-burger-mrcp-reqts-00 to version draft-burger-
409	   speechsc-reqts-00:
410	        - draft name changed per area director advice
411	        - added speaker verification to the areas addressed, including
412	          speaker verification requirements, per Dan Burnet?s
413	          presentation at the Minneapolis BoF (see minutes).

415	                Distributed Media Control Requirements  February 2002

417	        - based on mailing list discussion, added requirement to handle
418	          both ?by value? and ?by reference? data. This is both for TTS
419	          to be played out and grammar(s) to be applied to ASR.
420	        - Based on discussion at the BoF in Minneapolis, added a
421	          requirement concerning the use of load balancing schemes,
422	          including those based on SRVLOC, SRV.
423	        - Added a requirement for OPES compliance, per a discussion
424	          with Sally Floyd as IAB observer for the BoF.

426	                Distributed Media Control Requirements  February 2002

428	Full Copyright Statement

430	   Copyright (C) The Internet Society (2002).  All Rights Reserved.

432	   This document and translations of it may be copied and furnished to
433	   others, and derivative works that comment on or otherwise explain it
434	   or assist in its implementation may be prepared, copied, published
435	   and distributed, in whole or in part, without restriction of any
436	   kind, provided that the above copyright notice and this paragraph are
437	   included on all such copies and derivative works.  However, this
438	   document itself may not be modified in any way, such as by removing
439	   the copyright notice or references to the Internet Society or other
440	   Internet organizations, except as needed for the purpose of
441	   developing Internet standards in which case the procedures for
442	   copyrights defined in the Internet Standards process must be
443	   followed, or as required to translate it into languages other than
444	   English.

446	   The limited permissions granted above are perpetual and will not be
447	   revoked by the Internet Society or its successors or assigns.  This
448	   document and the information contained herein is provided on an "AS
449	   IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
450	   FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
451	   LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL
452	   NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY
453	   OR FITNESS FOR A PARTICULAR PURPOSE.

455	Acknowledgement

457	   The Internet Society currently provides funding for the RFC Editor
458	   function.