Last Modifield: 06/28/2002
The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers.
The working group will develop an informational RFC detailing the architecture and requirements for distributed speechsc control. In addition, the requirements document will describe the use cases driving these requirements. The working group will then examine existing media-related protocols, especially RTSP, for suitability as a protocol for carriage of speechsc server control. The working group will then propose extensions to existing protocols or the development of new protocols, as appropriate, to meet the requirements specified in the informational RFC.
The protocol will assume RTP carriage of media. Assuming session-oriented media transport, the protocol will use SDP to describe the session.
The working group will not be investigating distributed speech recognition (DSR), as exemplified by the ETSI Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP. The working group will offer changes to existing protocols, with the possible exception of RTSP, to the appropriate IETF work group for consideration. This working group will explore modifications to RTSP, if required.
It is expected that we will coordinate our work in the IETF with the W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI Aurora STQ.
Once the current set of milestones is completed, the speechsc charter may be expanded, with IESG approval, to cover additional uses of the technology, such as the orchestration of multiple ASR/TTS/SV servers, the accommodation of additional types of servers such as simultaneous translation servers, etc.
|JUL 02||Requirements ID submitted to IESG for publication (informational)|
|DEC 02||Submit Internet Draft(s) Analyzing Existing Protocols (informational)|
|DEC 02||Submit Internet Draft Describing New Protocol (if required) (standards track)|
|MAR 03||Submit Drafts to IESG for publication|
Speech Services Control WG (speechsc) Thursday, November 21 at 1530-1730 =================================== CHAIRS: Eric Burger <email@example.com> D. Oran <firstname.lastname@example.org> SCRIBES: Mary Barnes Joerg Ott AGENDA: Agenda Bashing Requirements Document Status (chairs) Protocol Analysis Document Open Issues Next Steps, Design Team Formation Agenda Bashing ============== Requirements Document Status (chairs) ===================================== Requirements Document: o Have submitted original requirements doc, under IESG review (not currently reflected by draft tracker). o two references #5. Protocol evaluation o Considerable work needs to be done Protocol Analysis Document Open Issues: ======================================= Beep (Jerry Carter) ------------------- o Jerry provided an overview of basic BEEP functionality. Intended to provide a common framework in which you develop other protocols. o To use BEEP: - Would need to define security. - And specific messages that would need to be supported. - Open Issues: - Resource acquisition - Extensions for request grouping - Extensions for service location and load balancing - Mapping session and RTP channels - Mechanisms for grammar naming and storage - Mechanism for storing and retrieving input - State preservation for multiple utterance SI/SV - Extensions for duplexing and parallel operations - How would SpeechSC integrate with BEEP security? - High level concerns: . Not a lot of knowledge about spec. . BEEP appears to be efficient, but not sure how effective it might be for this. Discussion: Carl: RTP? Eric: this is control protocol Jerry: the reference to beep was the logical mapping of channels. SIP (Rajiv Dharmadhikari) ------------------------- o Suggests that SIP is already being used for sessions, so can also be used for controlling resources. o SIP for MRCP had been previously proposed. o Would like feedback on requirements. Proposing the use of URIs. o Work needing to be done: o Summary. o Grammar sharing o State for multiple utterances Discussion: o Radika: SIP for mid session control, but SIP is not for media control. o How to do grammar sharing? o How to preserve state across multiple training sessions? o SIP not intended for mid-session control; had you considered using SIP for establishing the control channel and then another mechanism for the actual control? RTSP (Brian Wyld) [won't be in Atlanta] (discussed by Dave Oran) ---------------------------------------- ------------------------ High-order bits o RTSP is excellent semantic match for the problem domain o Fundamental problem domain is control of media servers o Many constructs that can be used directly (e.g. Play method) o State machines either match or would be very similar. o RTSP is proven and deployed. o No "magic-bullet" for barge-in control problem. No concept of asynchronous notification o May need new development, no matter what protocol. Things needed over base RTSP o Ways to express speechsc-specific constructs o Grammars o Session parms o Input capture o Methods for recognition and SI/SV needed o Record not a perfect semantic match o Method of doing barge-in control Discussion: o Joerg: Record may not stay that long in RTSP; removed this morning. o Jerry: Would any of the extensions violate the core aspects of RTSP. o Dave: it's a judgment call; there may be issues with tunneling. o Sarvi: it (tunneling) was ugly, but it worked. Didn't have to deal with extending RTSP. May be some roadblocks when dealing with RTSP itself. MRCP (Sarvi Shanmugham) ----------------------- Overview of MRCP: o MRCP was designed with the specific goal of being extensible in the future to address SI/SV. o Depends upon RTSP or SIP for setting up the media session. Chose to tunnel over RTSP rather than extending either. o Core methods, headers are independent of being tunneled over RTSP. o Ugly, but works o It already supports parallel usage. o Multiple interoperable implementations. Issues: o MRCP section of the compliance document needs to be updated to make the evaluation consistent with other sections. o Need to create or adopt a session level protocol capable of creating a control pipe and a media pipe, and then extend it with MRCP messages (to remove need for tunneling). o Need to add support for global or shared grammars to MRCP o Need to add MRCP resource extensions for SI/SV. o Consider how resources like the recognizer can be modularized and chained. o MRCP doesn't handle "sub-modules" Discussion: o Radika: SIP and RTSP would be good. Do you want one or do you want a new protocol? o Sarvi: MRCP by itself addresses the core of recognition commands. Looking for a protocol to establish a control session between client and server and be able to negotiate media pipe. o Radika: setting up of session, if either (SIP or RTSP) are suitable, suggest to allow both. o Sarvi: One could take RTSP or SIP as baseline and work from there to define the necessary extensions Web Services (Stephane Maes) ---------------------------- Overview: o Looks at the problem from a higher level. o Speech engines can be considered web services programmed by SOAP, WSDL (built on top of SOAP), WSFL and discovered via UDDI. o SOAP is bound to underlying protocol (HTTP, TCP, SIP, BEEP...) (per existing proposals) o Audio sub-systems and speech engines defined by WSDL interfaces: o Web services programmed with WSDL o Combined/composed with WSFL o Discovered by UDDI o Additional events and messages via SOAP and a la WSXL (coordination among web services) o Security can be provided by ws-security. Conceptual view o various engines are independent components o accessible through aforementioned interface o Each need to be associated with parallel streams of audio, etc. Issues: o Web services don't have syntax and semantics for speech control, however, it was designed to control any component. o There is no syntax and semantics associated to the control of speech engines (can be inspired from MRCP or other speech APIs) o The framework can be bound to numerous transports o Additional features are available today through tools and middleware offering rather than standard specs. o The evaluation assumes these characteristics are exploited: o IF no change is required and only syntax and semantics must be defined, then it's a T. The evaluation can be done such that most are Ts, with P+s being satisfied. o web services: generic framework and extensible o no syntax and semantics predefined o can be taken from existing syntax (e.g. MRCP) o works with multiple transport protocols o additional tools available through tools and middleware o nothing needs to be changed in web services o can satisfy all the requirements defined above Finalization of web services involves: o Integration of the web service framework o Specification of syntax and semantics o Optional selection of recommended transport product. Discussion: o Dave O: the architectural model shows the media being carried in separate RTP channels. Is there any support in web services for setting up and tearing down sessions. Or would this have to be recast in web services framework? o Stephane: it would have to be recast inside the web services model. o Dave O: glaring missing piece for asynchronous notification for barge control. o Stephane: XML event exchange on SOAP gives you interaction (asynchronous), but may still have the problem with delay, race conditions. o ? on RTP. How do you synchronize event timing? o Stephane: it's the same answer. The engines are characterized by sink/source ports and this work would need to be done in the IETF. o Has Web service been used at the protocol control level? Is latency a concern? o Stephane: web services for controlling is being done by Parlay. There will likely be the same race conditions as other protocols for the asynchronous notification for barge control. o Radika: Basic problem, setup and negotiation of session and this proposal still doesn't address how you could do that? o Dave O: only a small subset of the requirements deal with session setup. The majority of the requirements are about command and control within a session. o Stephane: nothing prevents using SIP for negotiation and the use SOAP for command and control. Definitely won't use web services for negotiation. o Joerg: You've explained the framework, but there would be a lot to be done to get the necessary functionality. If one was developing a protocol from scratch, what would be the difference. o Stephane: syntax and semantic of the programming of the engine (API) would be the work. o Joerg: Is this approach a bug or a feature? Any further questions? o Radika: We have the solutions for everything, we just need a control protocol. o Karl: About 5 years ago, did IP TV and used RTSP for control. Lots of issues with interacting with RTP and users. Can't start anywhere in an RTP stream. Many receivers need to the message to get a timestamp. How do deal with starts and stops? Users are used to mechanical things; there appear to be lots of user concerns. o Dave O: Requirements is in the hands of the IESG; if there are additional comments, please provide. o Markus: RTP synchronization; RTSP synchronization needs some work to get it to work properly. o Dave O: in the requirements, there is a specific synchronization requirement (must almost instantaneously stop). IF there are other syncs needed, these need to be provided. o SIP is intended to initiate sessions. It's not a good control protocol or a good transport protocol. Could perhaps use to set up audio in parallel. o Sarvi: not talking about tunneling MRCP over RTSP or SIP. SIP is good at setting up and modifying a session (RTP pipes and negotiating params). If SIP were extended to solve recognition and TTS problems, it would complement SIP. o SIP is fine for session associations. You don't need to add MRCP stuff to SIP. o Sarvi: if we were take MRCP and make it a protocol, need a way to setup a session (pipe) and negotiate a media stream. Believes MRCP is Complimentary to SIP. o ?: SOAP sessions over SIP (Ubiquity draft). o Peter : Importance of fast response to any user interaction o Eric: SIP is good at setting up a session, RTSP is okay at doing that. It's all the commands that come after setting up the session that is really the issue. o IF you need to negotiate media parms, RTSP is not good for that. AD left to go get a projector, which had died. Next Steps, Design Team Formation ================================== o Need consistent analysis framework: T is right now. o Date for completion Discussion: o Stephane: agrees in principle. o Dave O: have combined 2 approaches under one umbrella. o You can evaluate as an existing protocol which can modify or extend. . (MRCP, RTSP, SIP) o Framework of choice for new protocol. . Web services & BEEP. Proposes that criteria for framework would have to be different. Document was not intended to cast something in stone and give guidance to protocol development. Evaluation helps them to look at tradeoffs for various starting points. o Jerry: It is true that any protocol can be made to appear to work. Something like web services depends upon what messages you create. o UDP could do anything. Same applies to BEEP or web services. o How much needs to actually be done. o Sarvi: In the context of framework, MRCP analysis - SIP also falls under framework category. o Rajeev: Job to get right Ts, Fs, Ps; what are the things that we don't want done? o Stephane: protocol vs. framework is a good one. What's the outcome of the evaluation. Some overlap where things are complimentary. o Dave: keep in mind end goal. o Karl: one aspect of protocol choice; it will be implemented o ..... o Discussion around doing further work to detail the syntax and protocol impacts. There is general support, but it's a lot of work. . Mary Barnes: MIDCOM experience o counting P+ and other marks o people arguing about which marks to get o not much consistency in the end o numbers could not really be compared . Really clear usage scenario(s) needed. Identifies these so that we have a target. . Worthwhile thing to do this. Upside: educates the community. Downside: more difficult to step back for those involved. Four out of five analyses will be thrown away. . C: Good idea. Helps to identify how much work is needed. . C: In principle, a good idea. But this is much work. Shouldn't this be part of the design process. . C: May educate, but may not help the evaluation. Will cause more time to be spent. o Scott B: motherhood stuff is okay, but it seems like the idea of tossing it to design team isn't a win. Need a little bit more evaluation. o Jerry: 2 tiers of problems: 1. existing protocol 2. new protocol (i.e. with framework). The framework ones are more work. o Dave O: Framework vs existing protocol issue; with protocol, wouldn't you end up with a sub-optimal solution. o Scott: Bob Braden; the strength of the internet, we didn't optimize things, but made them flexible. o Dave O: the intent wasn't to focus on efficiency, we want to maximize flexibility. . o Scott: in the SIP world; a reasonable argument can't be made for ability to extend. o Rajeev: vendor perspective; we have to be pragmatic; timing is critical and skill set is different o Stephane: rapidly available should be a priority. o Michael: agrees with flexibility of web services model; is there a liaison to that community? o Dave O: have already established liaisons. o Scott: Friendly reception from base control could be an issue. Footprint might be an issue. o Stephane: some of these are captured in requirements. Web services doesn't necessarily take a big footprint. If you follow one of the approaches that is carried by the industry and there are significant work, you may have a better chance to address all those issues. o If you look at it practically, most of the industry from which is being borrowed (media servers and speech recognition vendors). Will these vendors be ready to do web services? There are metrics available today. o .... o Stephane: lots of mis-conceptions about web services. o Sarvi: requirements about server to server communication? VXML browser running on a gateway or a small PDA or a phone and you access to a TTS-type resource, thus it's not server-server, but rather client-server. Terminals, as well, need to be considered. o Stephane: didn't see this restricted to server-server. o In the end, protocol needs to be lightweight. Going forward: o Does group feel it can start with a protocol while finishing evaluation? o One team or more than one team? Discussion: o Stephane: believes this could be done. Concern over evaluation. o Scott B: One team only! Conclusion: General hum vote support for this. Revisit milestones: =================== Completing document: o Task 1: Recast the document into 2 categories (framework vs. protocol) o Task 2: align ratings. o Task 3: consistency due to multiple authors. o Task 4: make sure each protocol sections has a summary that makes clear the strong points and weak points. After that (2-3 weeks), WG chairs will decide if it's ready for design team. Discussion: o Scott: don't necessarily need to publish as an RFC, but could be useful input to design team (per experience with MIDCOM protocol evaluation document). o Scott: output from a design team has no more weight than anyone else's opinion. o Dave: why is doc still not in waiting state? o Scott: still watching (and not yet looking) o Dave: propose "gazing" state. Conclusion: Hum vote in support of path going forward.