idnits 2.17.1 draft-burger-speechsc-reqts-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 13, 2002) is 7985 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? '1' on line 13 looks like a reference -- Missing reference section? '2' on line 56 looks like a reference -- Missing reference section? '3' on line 72 looks like a reference -- Missing reference section? '4' on line 72 looks like a reference -- Missing reference section? '5' on line 73 looks like a reference -- Missing reference section? '6' on line 73 looks like a reference -- Missing reference section? '7' on line 239 looks like a reference -- Missing reference section? '8' on line 84 looks like a reference -- Missing reference section? '9' on line 129 looks like a reference -- Missing reference section? '10' on line 163 looks like a reference -- Missing reference section? '11' on line 172 looks like a reference -- Missing reference section? '12' on line 182 looks like a reference -- Missing reference section? '13' on line 182 looks like a reference -- Missing reference section? '14' on line 195 looks like a reference -- Missing reference section? '15' on line 227 looks like a reference -- Missing reference section? '16' on line 300 looks like a reference Summary: 5 errors (**), 0 flaws (~~), 1 warning (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group E. Burger 2 Internet Draft SnowShore Networks, Inc. 3 Document: draft-burger-speechsc-reqts-00.txt D. Oran 4 Category: Informational Cisco Systems, Inc. 5 Expires August 2002 June 13, 2002 7 Requirements for Distributed Control of ASR, SV and TTS Resources 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC2026 [1]. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. Internet-Drafts are draft documents valid for a maximum of 18 six months and may be updated, replaced, or obsoleted by other 19 documents at any time. It is inappropriate to use Internet- Drafts 20 as reference material or to cite them other than as "work in 21 progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html. 29 1. Abstract 31 This document outlines the needs and requirements for a protocol to 32 control distributed speech processing of audio streams. By speech 33 processing, this document specifically means automatic speech 34 recognition, speaker verification and text-to-speech. Other IETF 35 protocols, such as SIP and RTSP, address rendezvous and control for 36 generalized media streams. However, speech processing presents 37 additional requirements that none of the extant IETF protocols 38 address. 40 Discussion of this and related documents is on the MRCP list. To 41 subscribe, send the message "subscribe mrcp" to 42 majordomo@snowshore.com. The public archive is at 43 http://flyingfox.snowshore.com/mrcp_archive/maillist.html. 45 NOTE: This mailing list will be superseded by an official working 46 group mailing list, cats@ietf.org, once the WG is formally 47 chartered. 49 Distributed Media Control Requirements February 2002 51 2. Conventions used in this document 53 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 54 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 55 this document are to be interpreted as described in RFC-2119 [2]. 57 FORMATTING NOTE: Notes, such at this one, provide additional, 58 nonessential information that the reader may skip without missing 59 anything essential. The primary purpose of these non-essential 60 notes is to convey information about the rationale of this document, 61 or to place this document in the proper historical or evolutionary 62 context. Readers whose sole purpose is to construct a conformant 63 implementation may skip such information. However, it may be of use 64 to those who wish to understand why we made certain design choices. 66 OPEN ISSUES: This document highlights questions that are, as yet, 67 undecided as "OPEN ISSUES". 69 3. Introduction 71 There are multiple IETF protocols for establishment and termination 72 of media sessions (SIP[3]), low-level media control (MGCP[4] and 73 MEGACO[5]), and media record and playback (RTSP[6]). This document 74 focuses on requirements for one or more protocols to support the 75 control of network elements that perform Automated Speech 76 Recognition (ASR), speaker verification (SV), and rendering text 77 into audio, a.k.a. Text-to-Speech (TTS). Many multimedia 78 applications can benefit from having automatic speech recognition 79 (ASR) and text-to-speech (TTS) processing available as a 80 distributed, network resource. This requirements document limits 81 its focus on the distributed control of ASR, SV and TTS servers. 83 To date, there are a number of proprietary ASR and TTS API's, as 84 well as two IETF drafts that address this problem [7] [8]. However, 85 there are serious deficiencies to the existing drafts. In 86 particular, they mix the semantics of existing protocols yet are 87 close enough to other protocols as to be confusing to the 88 implementer. 90 This document sets forth requirements for protocols to support 91 distributed speech processing of audio streams. 93 For simplicity, and to remove confusion with existing protocol 94 proposals, this document presents the requirements as being for a 95 "new protocol" that addresses the distributed control of speech 96 resources It refers to such a protocol as "SRCP", for Speech 97 Resource Control Protocol. 99 4. SRCP Framework 101 The following is the SRCP framework for speech processing. 103 Distributed Media Control Requirements February 2002 105 +-------------+ 106 | Application | 107 | Server | 108 +-------------+ 109 SIP or whatever / 110 / 111 +------------+ / +--------+ 112 | Media |/ SRCP | ASR | 113 | Processing |-------------------------| and/or | 114 RTP | Entity | RTP | TTS | 115 =====| |=========================| Server | 116 +------------+ +--------+ 118 The "Media Processing Entity" is a network element that processes 119 media. The "Application Server" is a network element that instructs 120 the Media Processing Entity on what transformations to make to the 121 media stream. The "ASR and/or TTS Server" is a network element that 122 either generates a RTP stream based on text input (TTS) or returns 123 speech recognition results in response to an RTP stream as input 124 (ASR). The Media Processing Entity controls the ASR or TTS Server 125 using SRCP as a control protocol. 127 Physical embodiments of the entities can reside in one physical 128 instance per entity, or some combination of entities. For example, 129 a VoiceXML [9] Gateway may combine the ASR and TTS functions on the 130 same platform as the Media Processing Entity. Note that VoiceXML 131 Gateways themselves are outside the scope of this protocol. 133 Likewise, one can combine the Application Server and Media 134 Processing Entity, as would be the case in an interactive voice 135 response (IVR) platform. 137 One can also decompose the Media Processing Entity into an entity 138 that controls media endpoints and entities that process media 139 directly. Such would be the case with a decomposed gateway using 140 MGCP or megaco. However, this decomposition is again orthogonal to 141 the scope of SRCP. 143 5. General Requirements 145 5.1. Reuse Existing Protocols 147 To the extent feasible, the SRCP framework SHOULD use existing 148 protocols. 150 5.2. Maintain Existing Protocol Integrity 152 In meeting requirement 5.1, the SRCP framework MUST NOT redefine the 153 semantics of an existing protocol. 155 Distributed Media Control Requirements February 2002 157 Said differently, we will not break existing protocols or cause 158 backward compatibility problems. 160 5.3. Avoid Duplicating Existing Protocols 162 To the extent feasible, SRCP SHOULD NOT duplicate the functionality 163 of existing protocols. For example, SIP with msuri [10] and RTSP 164 already define how to request playback of audio. 166 The focus of SRCP is new functionality not addressed by existing 167 protocols or extending existing protocols within the strictures of 168 requirement 5.2. 170 5.4. Explicit invocation of services 172 The SRCP framework MUST be compliant with the IAB OPES[11] 173 framework. The applicability of the SRCP protocol will therefore be 174 specified as occurring between clients and servers at least one of 175 which is operating directly on behalf of the user requesting the 176 service. 178 5.5. Server Location and Load Balancing 180 To the extent feasible, the SRCP framework SHOULD exploit existing 181 schemes for performing service location and load balancing, such as 182 the Service Location Protocol[12] or DNS SRV records[13]. Where such 183 facilities are not deemed adequate, the SRCP framework MAY define 184 additional load balancing techniques. 186 6. TTS Requirements 188 The SRCP framework MUST allow a Media Processing Entity, using a 189 control protocol, to request the TTS Server to playback text as 190 voice in an RTP stream. 192 The TTS Server MUST support the reading of plain text. For reading 193 plain text, the language and voicing is a local matter. 195 The TTS Server SHOULD support the reading of SSML [14] text. 197 OPEN ISSUE: Should the TTS Server infer the text is SSML by 198 detecting a legal SSML document, or must the protocol tell the TTS 199 Server the document type? 201 The TTS Server MUST accept text over the SRCP connection for reading 202 over the RTP connection. The server MUST accept text either ?by 203 value? (embedded in the protocol), or ?by reference? (by de- 204 referencing a URI embedded in the protocol). 206 OPEN ISSUE: Should we allow (or require) the TTS Server to use long- 207 lived control channels? 208 Distributed Media Control Requirements February 2002 210 The TTS Server SHOULD support, and the SRCP framework MUST support 211 the specification of, "VCR Controls", such as skip forward, skip 212 backward, play faster, and play slower. 214 OPEN ISSUE: Should we allow for session parameters, like prosody and 215 voicing, as is specified for MRCP over RTSP [7]? 217 OPEN ISSUE: Should we allow for speech markers, as is specified for 218 MRCP over RTSP [7]? 220 7. ASR Requirements 222 The SRCP framework MUST allow a Media Processing Entity to request 223 the ASR Server to perform automatic speech recognition on an RTP 224 stream, returning the results over SRCP. 226 The ASR Server MUST support the XML specification for speech 227 recognition [15]. 229 The ASR Server MUST accept grammar specifications either ?by value? 230 (embedded in the protocol), or ?by reference? (by de-referencing a 231 URI embedded in the protocol). 233 OPEN ISSUE: Should we allow the ASR Server to support alternative 234 grammar formats? If so, we need mechanisms to specify what format 235 the grammar is in, capability discovery, and handling unsupported 236 grammars. 238 OPEN ISSUE: Is there a need for all of the parameters specified for 239 MRCP over RTSP [7]? Most of them are part of the W3C speech 240 recognition grammar. 242 The ASR Server SHOULD support a method for capturing the input media 243 stream for later analysis and tuning of the ASR engine. 244 The ASR Server SHOULD support sharing grammars across sessions. 245 This supports applications with large grammars for which it is 246 unrealistic to dynamically load. An example is a city-country 247 grammar for a weather service. 249 8. Speaker Verification Requirements 251 The SRCP framework MUST allow a Media Processing Entity to request 252 the SV Server to perform speaker verification on an RTP stream, 253 returning the results over SRCP. 255 The SV Server MUST The server MUST accept grammar specifications 256 either ?by value? (embedded in the protocol), or ?by reference? (by 257 de-referencing a URI embedded in the protocol). 259 The SRCP framework MUST accommodate an identifier for each 260 verification resource and permit control of that resource by ID, 261 because voiceprint format and contents are vendor specific 262 Distributed Media Control Requirements February 2002 264 The SRCP framework MUST work with SV servers which maintain state to 265 handle multi-utterance verification. 267 The SV Server SHOULD support a method for capturing the input media 268 stream for later analysis and tuning of the SV engine. 270 9. Dual-Mode Requirements 272 One very important requirement for an interactive speech-driven 273 system is that user perception of the quality of the interaction 274 depends strongly on the ability of the user to interrupt a prompt or 275 rendered TTS with speech. Interrupting, or barging, the speech 276 output requires more than energy detection from the user's 277 direction. Many advanced systems halt the media towards the user by 278 employing the ASR engine to decide if an utterance is likely to be 279 real speech, as opposed to a cough, for example. 281 To achieve low latency between utterance detection and halting of 282 playback, many implementations combine the speaking and ASR 283 functions. The SRCP framework MUST support such dual-mode 284 implementations. 286 10. Thoughts to Date (non-normative) 288 The protocol assumes RTP carriage of media. Assuming session- 289 oriented media transport, the protocol will use SDP to describe the 290 session. 292 The working group will not be investigating distributed speech 293 recognition (DSR), as exemplified by the ETSI Aurora project. The 294 working group will not be recreating functionality available in 295 other protocols, such as SIP or SDP. 297 TTS looks very much like playing back a file. Extending RTSP looks 298 promising for when one requires VCR controls or markers in the text 299 to be spoken. When one does not require VCR controls, SIP in a 300 framework such as Network Announcements [16] works directly without 301 modification. 303 ASR has an entirely different set of characteristics. For barge-in 304 support, ASR requires real-time return of intermediate results. 305 Barring the discovery of a good reuse model for an existing 306 protocol, this will most likely become the focus of SRCP. 308 11. Security Considerations 310 Protocols relating to speech processing must take security into 311 account. This is particularly important as popular uses for TTS 312 include reading financial information. Likewise, popular uses for 313 ASR include executing financial transactions and shopping. 315 Distributed Media Control Requirements February 2002 317 We envision that rather than providing application-specific security 318 mechanisms in SRCP itself, the resulting protocol will employ 319 security machinery of either containing protocols or the transport 320 on which it runs. For example, we will consider solutions such as 321 using TLS for securing the control channel, and SRTP for securing 322 the media channel. 324 12. References 326 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 327 9, RFC 2026, October 1996. 329 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement 330 Levels", BCP 14, RFC 2119, March 1997 332 3 Handley, M., Schulzrinne, H., Schooler, E., and Rosenberg, J., 333 "SIP: Session Initiation Protocol", RFC 2543, March 1999 335 4 Arango, M., Dugan, A., Elliott, I., Huitema, C., and Pickett, S., 336 "Media Gateway Control Protocol (MGCP) Version 1.0", RFC 2705, 337 October 1999 339 5 Cuervo, F., Greene, N., Rayhan, A., Huitema, C., Rosen, B., and 340 Segers, J., "Megaco Protocol Version 1.0", RFC 3015, November 2000 342 6 Schulzrinne, H., Rao, A., and Lanphier, R., "Real Time Streaming 343 Protocol (RTSP)", RFC 2326, April 1998 345 7 Shanmugham, S., Monaco, P., and B. Eberman, "MRCP: Media Resource 346 Control Protocol", draft-shanmugham-mrcp-01.txt, November 2001, 347 work in progress 349 8 Robinson, F., Marquette, B., and R. Hernandez, "Using Media 350 Resource Control Protocol with SIP", draft-robinson-mrcp-sip- 351 00.txt, September 2001, work in progress 353 9 World Wide Web Consortium, "Voice Extensible Markup Language 354 (VoiceXML) Version 2.0", W3C Working Draft, 355 , 356 October 2001, work in progress 358 10 Van Dyke, J. and Burger, E., "SIP URI Conventions for Media 359 Servers", draft-burger-sipping-msuri-01, July 2001, work in 360 progress (expired) 362 11 Floyd, S., Daigle, L., ?IAB Architectural and Policy 363 Considerations for Open Pluggable Edge Services,? RFC3238, 364 January 2002. 366 Distributed Media Control Requirements February 2002 368 12 Guttman, E., Perkins, C., Veizades, J., Day, M. , "Service 369 Location Protocol, Version 2,? RFC 2608, June 1999. 371 13 Gulbrandson, A, Vixie, P., Esibov, L., ?A DNS RR for specifying 372 the location of services (DNS SRV)?, RFC2782, February 2000. 374 14 World Wide Web Consortium, "Speech Synthesis Markup Language 375 Specification for the Speech Interface Framework", W3C Working 376 Draft, , January 2001, 377 work in progress 379 15 World Wide Web Consortium, "Speech Recognition Grammar 380 Specification for the W3C Speech Interface Framework", W3C 381 Working Draft, , August 382 2001, work in progress 384 16 O'Connor, W., Burger, E., "Network Announcements with SIP", 385 draft-ietf-sipping-netann-01.txt, November 2001, work in progress 387 13. Acknowledgments 389 Brian Eberman came up with the new name. It is catchy and describes 390 what we are working on. 392 14. Author's Addresses 394 Eric W. Burger 395 SnowShore Networks, Inc. 396 Chelmsford, MA 397 USA 398 Email: eburger@snowshore.com 400 David R. Oran 401 Cisco Systems, Inc. 402 Acton, MA 403 USA 404 Email: oran@cisco.com 406 15. Change Log 408 From version draft-burger-mrcp-reqts-00 to version draft-burger- 409 speechsc-reqts-00: 410 - draft name changed per area director advice 411 - added speaker verification to the areas addressed, including 412 speaker verification requirements, per Dan Burnet?s 413 presentation at the Minneapolis BoF (see minutes). 415 Distributed Media Control Requirements February 2002 417 - based on mailing list discussion, added requirement to handle 418 both ?by value? and ?by reference? data. This is both for TTS 419 to be played out and grammar(s) to be applied to ASR. 420 - Based on discussion at the BoF in Minneapolis, added a 421 requirement concerning the use of load balancing schemes, 422 including those based on SRVLOC, SRV. 423 - Added a requirement for OPES compliance, per a discussion 424 with Sally Floyd as IAB observer for the BoF. 426 Distributed Media Control Requirements February 2002 428 Full Copyright Statement 430 Copyright (C) The Internet Society (2002). All Rights Reserved. 432 This document and translations of it may be copied and furnished to 433 others, and derivative works that comment on or otherwise explain it 434 or assist in its implementation may be prepared, copied, published 435 and distributed, in whole or in part, without restriction of any 436 kind, provided that the above copyright notice and this paragraph are 437 included on all such copies and derivative works. However, this 438 document itself may not be modified in any way, such as by removing 439 the copyright notice or references to the Internet Society or other 440 Internet organizations, except as needed for the purpose of 441 developing Internet standards in which case the procedures for 442 copyrights defined in the Internet Standards process must be 443 followed, or as required to translate it into languages other than 444 English. 446 The limited permissions granted above are perpetual and will not be 447 revoked by the Internet Society or its successors or assigns. This 448 document and the information contained herein is provided on an "AS 449 IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK 450 FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT 451 LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL 452 NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY 453 OR FITNESS FOR A PARTICULAR PURPOSE. 455 Acknowledgement 457 The Internet Society currently provides funding for the RFC Editor 458 function.