idnits 2.17.1 draft-burger-mrcp-reqts-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == There are 8 instances of lines with non-ascii characters in the document. == The page length should not exceed 58 lines per page, but there was 7 longer pages, the longest (page 2) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 19, 2002) is 8073 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? '1' on line 14 looks like a reference -- Missing reference section? '2' on line 50 looks like a reference -- Missing reference section? '3' on line 69 looks like a reference -- Missing reference section? '4' on line 69 looks like a reference -- Missing reference section? '5' on line 70 looks like a reference -- Missing reference section? '6' on line 70 looks like a reference -- Missing reference section? '7' on line 221 looks like a reference -- Missing reference section? '8' on line 81 looks like a reference -- Missing reference section? '9' on line 127 looks like a reference -- Missing reference section? '10' on line 161 looks like a reference -- Missing reference section? '11' on line 177 looks like a reference -- Missing reference section? '12' on line 213 looks like a reference -- Missing reference section? '13' on line 261 looks like a reference Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Burger 3 Internet Draft SnowShore Networks, Inc. 4 Document: draft-burger-mrcp-reqts-00.txt D. Oran 5 Category: Informational Cisco Systems, Inc. 6 Expires August 2002 February 19, 2002 8 Requirements for Distributed Control of ASR and TTS Resources 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026 [1]. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. Internet-Drafts are draft documents valid for a maximum of 19 six months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- Drafts 21 as reference material or to cite them other than as "work in 22 progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 1. Abstract 32 This document outlines the needs and requirements for a protocol to 33 control distributed speech processing of audio streams. By speech 34 processing, this document specifically means automatic speech 35 recognition and text-to-speech. Other IETF protocols, such as SIP 36 and RTSP, address rendezvous and control for generalized media 37 streams. However, speech processing presents additional 38 requirements that none of the extant IETF protocols address. 40 Discussion of this and related documents is on the MRCP list. To 41 subscribe, send the message "subscribe mrcp" to 42 majordomo@snowshore.com. The public archive is at 43 http://flyingfox.snowshore.com/mrcp_archive/maillist.html. 45 2. Conventions used in this document 47 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 48 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 49 this document are to be interpreted as described in RFC-2119 [2]. 51 Burger & Oran Informational � Expires August 2002 1 52 Distributed Media Control Requirements February 2002 54 FORMATTING NOTE: Notes, such at this one, provide additional, 55 nonessential information that the reader may skip without missing 56 anything essential. The primary purpose of these non-essential 57 notes is to convey information about the rationale of this document, 58 or to place this document in the proper historical or evolutionary 59 context. Readers whose sole purpose is to construct a conformant 60 implementation may skip such information. However, it may be of use 61 to those who wish to understand why we made certain design choices. 63 OPEN ISSUES: This document highlights questions that are, as yet, 64 undecided as "OPEN ISSUES". 66 3. Introduction 68 There are multiple IETF protocols for establishment and termination 69 of media sessions (SIP[3]), low-level media control (MGCP[4] and 70 megaco[5]), and media record and playback (RTSP[6]). The focus of 71 this document is requirements for one or more protocols to support 72 the control of network elements that perform Automated Speech 73 Recognition (ASR) and rendering text into audio, a.k.a. Text-to- 74 Speech (TTS). Many multimedia applications can benefit from having 75 automatic speech recognition (ASR) and text-to-speech (TTS) 76 processing available as a distributed, network resource. This 77 requirements document limits its focus on the distributed control of 78 ASR and TTS servers. 80 To date, there are a number of proprietary ASR and TTS API's, as 81 well as two IETF drafts that address this problem [7] [8]. However, 82 there are serious deficiencies to the existing drafts. In 83 particular, they mix the semantics of existing protocols yet are 84 close enough to other protocols as to be confusing to the 85 implementer. 87 This document sets forth requirements for protocols to support 88 distributed speech processing of audio streams. 90 For simplicity, and to remove confusion with existing protocol 91 proposals, this document presents the requirements as being for a 92 "new protocol" that addresses the distributed control of speech 93 resources It refers to such a protocol as "SRCP", for Speech 94 Resource Control Protocol. 96 4. SRCP Framework 98 The following is the SRCP framework for speech processing. 100 Burger & Oran Informational � Expires August 2002 2 101 Distributed Media Control Requirements February 2002 103 +-------------+ 104 | Application | 105 | Server | 106 +-------------+ 107 SIP or whatever / 108 / 109 +------------+ / +--------+ 110 | Media |/ SRCP | ASR | 111 | Processing |-------------------------| and/or | 112 RTP | Entity | RTP | TTS | 113 =====| |=========================| Server | 114 +------------+ +--------+ 116 The "Media Processing Entity" is a network element that processes 117 media. The "Application Server" is a network element that instructs 118 the Media Processing Entity on what transformations to make to the 119 media stream. The "ASR and/or TTS Server" is a network element that 120 either generates a RTP stream based on text input (TTS) or returns 121 speech recognition results in response to an RTP stream as input 122 (ASR). The Media Processing Entity controls the ASR or TTS Server 123 using SRCP as a control protocol. 125 Physical embodiments of the entities can reside in one physical 126 instance per entity, or some combination of entities. For example, 127 a VoiceXML [9] Gateway may combine the ASR and TTS functions on the 128 same platform as the Media Processing Entity. Note that VoiceXML 129 Gateways themselves are outside the scope of this protocol. 131 Likewise, one can combine the Application Server and Media 132 Processing Entity, as would be the case in an interactive voice 133 response (IVR) platform. 135 One can also decompose the Media Processing Entity into an entity 136 that controls media endpoints and entities that process media 137 directly. Such would be the case with a decomposed gateway using 138 MGCP or megaco. However, this decomposition is again orthogonal to 139 the scope of SRCP. 141 5. General Requirements 143 5.1. Reuse Existing Protocols 145 To the extent feasible, the SRCP framework SHOULD use existing 146 protocols whenever possible. 148 5.2. Maintain Existing Protocol Integrity 150 In meeting requirement 5.1, the SRCP framework MUST NOT redefine the 151 semantics of an existing protocol. 153 Said differently, we will not break existing protocols. 155 Burger & Oran Informational � Expires August 2002 3 156 Distributed Media Control Requirements February 2002 158 5.3. Avoid Duplicating Existing Protocols 160 To the extent feasible, SRCP SHOULD NOT duplicate the functionality 161 of existing protocols. For example, SIP with msuri [10] and RTSP 162 already define how to request playback of audio. 164 The focus of SRCP is new functionality not addressed by existing 165 protocols or extending existing protocols within the strictures of 166 requirement 5.2. 168 6. TTS Requirements 170 The SRCP framework MUST allow a Media Processing Entity, using a 171 control protocol, to request the TTS Server to playback text as 172 voice in an RTP stream. 174 The TTS Server MUST support the reading of plain text. For reading 175 plain text, the language and voicing is a local matter. 177 The TTS Server SHOULD support the reading of SSML [11] text. 179 OPEN ISSUE: Should the TTS Server infer the text is SSML by 180 detecting a legal SSML document, or must the protocol tell the TTS 181 Server the document type? 183 The TTS Server MUST accept text over the SRCP connection for reading 184 over the RTP connection. 186 OPEN ISSUE: Should we allow the TTS Server to retrieve text on its 187 own? That is, have SRCP pass in a URI from which the TTS Server 188 retrieves the text. 190 OPEN ISSUE: Should we allow (or require) the TTS Server to use long- 191 lived control channels? 193 The TTS Server SHOULD support, and the SRCP framework MUST support 194 the specification of, "VCR Controls", such as skip forward, skip 195 backward, play faster, and play slower. 197 OPEN ISSUE: Should we allow for session parameters, like prosody and 198 voicing, as is specified for MRCP over RTSP [7]? 200 OPEN ISSUE: Should we allow for speech markers, as is specified for 201 MRCP over RTSP [7]? 203 7. ASR Requirements 205 The SRCP framework MUST allow a Media Processing Entity to request 206 the ASR Server to perform automatic speech recognition on an RTP 207 stream, returning the results over SRCP. 209 Burger & Oran Informational � Expires August 2002 4 210 Distributed Media Control Requirements February 2002 212 The ASR Server MUST support the XML specification for speech 213 recognition [12]. 215 OPEN ISSUE: Should we allow the ASR Server to support alternative 216 grammar formats? If so, we need mechanisms to specify what format 217 the grammar is in, capability discovery, and handling unsupported 218 grammars. 220 OPEN ISSUE: Is there a need for all of the parameters specified for 221 MRCP over RTSP [7]? Most of them are part of the W3C speech 222 recognition grammar. 224 The ASR Server SHOULD support a method for capturing the input media 225 stream for later analysis and tuning of the ASR engine. 226 The ASR Server SHOULD support sharing grammars across sessions. 227 This supports applications with large grammars for which it is 228 unrealistic to dynamically load. An example is a city-country 229 grammar for a weather service. 231 8. Dual-Mode Requirements 233 One very important requirement for an interactive speech-driven 234 system is that user perception of the quality of the interaction 235 depends strongly on the ability of the user to interrupt a prompt or 236 rendered TTS with speech. Interrupting, or barging, the speech 237 output requires more than energy detection from the user's 238 direction. Many advanced systems halt the media towards the user by 239 employing the ASR engine to decide if an utterance is likely to be 240 real speech, as opposed to a cough, for example. 242 To achieve low latency between utterance detection and halting of 243 playback, many implementations combine the speaking and ASR 244 functions. The SRCP framework MUST support such dual-mode 245 implementations. 247 9. Thoughts to Date (non-normative) 249 The protocol assumes RTP carriage of media. Assuming session- 250 oriented media transport, the protocol will use SDP to describe the 251 session. 253 The working group will not be investigating distributed speech 254 recognition (DSR), as exemplified by the ETSI Aurora project. The 255 working group will not be recreating functionality available in 256 other protocols, such as SIP or SDP. 258 TTS looks very much like playing back a file. Extending RTSP looks 259 promising for when one requires VCR controls or markers in the text 260 to be spoken. When one does not require VCR controls, SIP in a 261 framework such as Network Announcements [13] works directly without 262 modification. 264 Burger & Oran Informational � Expires August 2002 5 265 Distributed Media Control Requirements February 2002 267 ASR has an entirely different set of characteristics. For barge-in 268 support, ASR requires real-time return of intermediate results. 269 Barring the discovery of a good reuse model for an existing 270 protocol, this will most likely become the focus of SRCP. 272 10. Security Considerations 274 Protocols relating to speech processing must take security into 275 account. This is particularly important as popular uses for TTS 276 include reading financial information. Likewise, popular uses for 277 ASR include executing financial transactions and shopping. 279 We envision that rather than providing application-specific security 280 mechanisms in SRCP itself, the resulting protocol will employ 281 security machinery of either containing protocols or the transport 282 on which it runs. For example, we will consider solutions such as 283 using TLS for securing the control channel, and SRTP for securing 284 the media channel. 286 11. References 288 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 289 9, RFC 2026, October 1996. 291 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement 292 Levels", BCP 14, RFC 2119, March 1997 294 3 Handley, M., Schulzrinne, H., Schooler, E., and Rosenberg, J., 295 "SIP: Session Initiation Protocol", RFC 2543, March 1999 297 4 Arango, M., Dugan, A., Elliott, I., Huitema, C., and Pickett, S., 298 "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 2705, 299 October 1999 301 5 Cuervo, F., Greene, N., Rayhan, A., Huitema, C., Rosen, B., and 302 Segers, J., "Megaco Protocol Version 1.0", RFC 3015, November 2000 304 6 Schulzrinne, H., Rao, A., and Lanphier, R., "Real Time Streaming 305 Protocol (RTSP)", RFC 2326, April 1998 307 7 Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource 308 Control Protocol", draft-shanmugham-mrcp-01.txt, November 2001, 309 work in progress 311 8 Robinson, F., Marquette, B., and R. Hernandez, "Using Media 312 Resource Control Protocol with SIP", draft-robinson-mrcp-sip- 313 00.txt, September 2001, work in progress 315 Burger & Oran Informational � Expires August 2002 6 316 Distributed Media Control Requirements February 2002 318 9 World Wide Web Consortium, "Voice Extensible Markup Language 319 (VoiceXML) Version 2.0", W3C Working Draft, 320 , 321 October 2001, work in progress 323 10 Van Dyke, J. and Burger, E., "SIP URI Conventions for Media 324 Servers", draft-burger-sipping-msuri-01, July 2001, work in 325 progress (expired) 327 11 World Wide Web Consortium, "Speech Synthesis Markup Language 328 Specification for the Speech Interface Framework", W3C Working 329 Draft, , January 2001, 330 work in progress 332 12 World Wide Web Consortium, "Speech Recognition Grammar 333 Specification for the W3C Speech Interface Framework", W3C 334 Working Draft, , August 335 2001, work in progress 337 13 O'Connor, W., Burger, E., "Network Announcements with SIP", 338 draft-ietf-sipping-netann-01.txt, November 2001, work in progress 340 12. Acknowledgments 342 Brian Eberman came up with the new name. It is catchy and describes 343 what we are working on. 345 OPEN ISSUE: Chose a name! 347 13. Author's Addresses 349 Eric W. Burger 350 SnowShore Networks, Inc. 351 Chelmsford, MA 352 USA 353 Email: eburger@snowshore.com 355 David R. Oran 356 Cisco Systems, Inc. 357 Acton, MA 358 USA 359 Email: oran@cisco.com 361 Burger & Oran Informational � Expires August 2002 7 362 Distributed Media Control Requirements February 2002 364 Full Copyright Statement 366 Copyright (C) The Internet Society (2002). All Rights Reserved. 368 This document and translations of it may be copied and furnished to 369 others, and derivative works that comment on or otherwise explain it 370 or assist in its implementation may be prepared, copied, published 371 and distributed, in whole or in part, without restriction of any 372 kind, provided that the above copyright notice and this paragraph are 373 included on all such copies and derivative works. However, this 374 document itself may not be modified in any way, such as by removing 375 the copyright notice or references to the Internet Society or other 376 Internet organizations, except as needed for the purpose of 377 developing Internet standards in which case the procedures for 378 copyrights defined in the Internet Standards process must be 379 followed, or as required to translate it into languages other than 380 English. 382 The limited permissions granted above are perpetual and will not be 383 revoked by the Internet Society or its successors or assigns. This 384 document and the information contained herein is provided on an "AS 385 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 386 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 387 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL 388 NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY 389 OR FITNESS FOR A PARTICULAR PURPOSE. 391 Acknowledgement 393 The Internet Society currently provides funding for the RFC Editor 394 function. 396 Burger & Oran Informational � Expires August 2002 8