Current Meeting Report
Jabber Logs

2.8.20 Speech Services Control (speechsc)

NOTE: This charter is a snapshot of the 55th IETF Meeting in Altanta, Georgia USA. It may now be out-of-date.

Last Modifield: 06/28/2002

David Oran <>
Eric Burger <>
Transport Area Director(s):
Scott Bradner <>
A. Mankin <>
Transport Area Advisor:
Scott Bradner <>
Mailing Lists:
General Discussion:
To Subscribe:
In Body: subscribe
Description of Working Group:
Many multimedia applications can benefit from having Automated Speech Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV) processing available as a distributed, network resource. To date, there are a number of proprietary ASR, TTS, and SV API's, as well as two IETF drafts, that address this problem. However, there are serious deficiencies to the existing drafts relating to this problem. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the architecture and requirements for distributed speechsc control. In addition, the requirements document will describe the use cases driving these requirements. The working group will then examine existing media-related protocols, especially RTSP, for suitability as a protocol for carriage of speechsc server control. The working group will then propose extensions to existing protocols or the development of new protocols, as appropriate, to meet the requirements specified in the informational RFC.

The protocol will assume RTP carriage of media. Assuming session-oriented media transport, the protocol will use SDP to describe the session.

The working group will not be investigating distributed speech recognition (DSR), as exemplified by the ETSI Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP. The working group will offer changes to existing protocols, with the possible exception of RTSP, to the appropriate IETF work group for consideration. This working group will explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI Aurora STQ.

Once the current set of milestones is completed, the speechsc charter may be expanded, with IESG approval, to cover additional uses of the technology, such as the orchestration of multiple ASR/TTS/SV servers, the accommodation of additional types of servers such as simultaneous translation servers, etc.

Goals and Milestones:
JUL 02  Requirements ID submitted to IESG for publication (informational)
DEC 02  Submit Internet Draft(s) Analyzing Existing Protocols (informational)
DEC 02  Submit Internet Draft Describing New Protocol (if required) (standards track)
MAR 03  Submit Drafts to IESG for publication
No Current Internet-Drafts
No Request For Comments

Current Meeting Report

Speech Services Control WG (speechsc) 
Thursday, November 21 at 1530-1730  
CHAIRS:  Eric Burger <>  
         D. Oran <> 

SCRIBES: Mary Barnes
         Joerg Ott
Agenda Bashing 
Requirements Document Status (chairs) 
Protocol Analysis Document Open Issues  
Next Steps, Design Team Formation  
Agenda Bashing 
Requirements Document Status (chairs) 
Requirements Document:  
   o Have submitted original requirements doc, under IESG review (not 
currently reflected by draft tracker). 
   o two references #5. 
Protocol evaluation 
   o Considerable work needs to be done  
Protocol Analysis Document Open Issues: 
 Beep (Jerry Carter) 
   o Jerry provided an overview of basic BEEP functionality.  Intended to 
provide a common framework in which you develop other protocols.  
   o To use BEEP: 
        - Would need to define security. 
        - And  specific  messages  that  would  need  to  be 
        - Open Issues: 
        - Resource acquisition 
        - Extensions for request grouping 
        - Extensions for service location and load balancing 
        - Mapping session and RTP channels 
        - Mechanisms for grammar naming and storage 
        - Mechanism for storing and retrieving input 
        - State preservation for multiple utterance SI/SV 
        - Extensions for duplexing and parallel operations 
        - How would SpeechSC integrate with BEEP security? 
        - High level concerns: 
 . Not a lot of knowledge about spec. 
        . BEEP appears to be efficient, but not sure how effective it 
might be for this.  
     Carl: RTP?  
     Eric: this is control protocol  
     Jerry: the reference to beep was the logical mapping of channels.  
       SIP (Rajiv Dharmadhikari) 
          o Suggests that SIP is already being used for sessions, so can 
also be used for controlling resources.  
          o SIP for MRCP had been previously proposed.  
          o Would like feedback on requirements.  Proposing the use of 
          o Work needing to be done:  
   o Summary. 
   o Grammar sharing 
   o State for multiple utterances 
          o Radika: SIP for mid session control, but SIP is not for media 
   o How to do grammar sharing? 
   o How  to  preserve  state  across  multiple  training sessions? 
          o SIP  not  intended  for  mid-session  control;  had  you 
considered  using  SIP  for  establishing  the  control channel  and  then  
another  mechanism  for  the  actual control?  
RTSP (Brian Wyld) [won't be in Atlanta] (discussed by Dave Oran) 

       High-order bits 
          o RTSP is excellent semantic match for the problem domain 
   o Fundamental  problem  domain  is  control  of  media servers 
   o Many constructs that can be used directly (e.g. Play method) 
   o State  machines  either  match  or  would  be  very similar.  
          o RTSP is proven and deployed. 
          o No  "magic-bullet"  for  barge-in  control  problem.  No 
concept of asynchronous notification 
   o May need new development, no matter what protocol.  
Things needed over base RTSP 
   o Ways to express speechsc-specific constructs 
        o Grammars 
        o Session parms 
        o Input capture 
   o Methods for recognition and SI/SV needed 
        o Record not a perfect semantic match 
   o Method of doing barge-in control 
   o Joerg: Record may not stay that long in RTSP; removed this 
   o Jerry: Would any of the extensions violate the core aspects of RTSP.  
   o Dave: it's a judgment call; there may be issues with tunneling.  
   o Sarvi: it  (tunneling) was ugly, but it worked.  Didn't have  to  deal  
with  extending  RTSP.    May  be  some roadblocks when dealing with RTSP 
MRCP (Sarvi Shanmugham) 
Overview of MRCP:  
   o MRCP  was  designed  with  the  specific  goal  of  being 
extensible in the future to address SI/SV.  
   o Depends upon RTSP or SIP for setting up the media session.      Chose  
to  tunnel  over  RTSP  rather  than extending either.  
        o Core  methods,  headers  are  independent  of  being tunneled 
over RTSP.  
        o Ugly, but works 
   o It already supports parallel usage.  
   o Multiple interoperable implementations.  
   o MRCP section of the compliance document needs to be updated to make the 
evaluation consistent with other sections.  
   o Need to create or adopt a session level protocol capable of 
creating a control pipe and a media pipe, and then extend  it  with  MRCP  
messages  (to  remove  need  for tunneling).  
   o Need to add support for global or shared grammars to MRCP 
   o Need to add MRCP resource extensions for SI/SV.  
   o Consider  how  resources  like  the  recognizer  can  be 
modularized and chained.  
   o MRCP doesn't handle "sub-modules" 
          o Radika: SIP and RTSP would be good.  Do you want one or do you 
want a new protocol?  
          o Sarvi: MRCP by itself addresses the core of recognition 
commands. Looking for a protocol to establish a control session  between  
client  and  server  and  be  able  to negotiate media pipe.  
          o Radika: setting up of session, if either (SIP or RTSP) are 
suitable, suggest to allow both.  
          o Sarvi: One could take RTSP or SIP as baseline and work from 
there to define the necessary extensions 
       Web Services (Stephane Maes) 
          o Looks at the problem from a higher level.  
          o Speech engines can be considered web services programmed by  
SOAP,  WSDL  (built  on  top  of  SOAP),  WSFL  and discovered via UDDI.  
          o SOAP is bound to underlying protocol (HTTP,  TCP, SIP, 
BEEP...) (per existing proposals) 
          o Audio sub-systems and speech engines defined by WSDL 
          o Web services programmed with WSDL 
          o Combined/composed with WSFL 
          o Discovered by UDDI 
          o Additional events and messages via SOAP and a la WSXL 
(coordination among web services) 
          o Security can be provided by ws-security. 
       Conceptual view 
          o various engines are independent components 
          o accessible through aforementioned interface 
          o Each need to be associated with parallel streams of audio, etc.  
          o Web services don't have syntax and semantics for speech 
control,  however,  it  was  designed  to  control  any component. 
          o There  is  no  syntax  and  semantics  associated  to  the 
control of speech engines (can be inspired from MRCP or other speech APIs) 
          o The framework can be bound to numerous transports 
          o Additional features are available today through tools and 
middleware offering rather than standard specs. 
          o The  evaluation  assumes  these  characteristics  are 
          o IF no change is required and only syntax and semantics must be 
defined, then it's a T.  
       The evaluation can be done such that most are Ts, with P+s 
       being satisfied.  
   o web services: generic framework and extensible 
   o no syntax and semantics predefined 
        o can be taken from existing syntax (e.g. MRCP) 
   o works with multiple transport protocols 
   o additional tools available through tools and middleware 
   o nothing needs to be changed in web services 
   o can satisfy all the requirements defined above 
Finalization of web services involves:  
   o Integration of the web service framework 
   o Specification of syntax and semantics 
   o Optional selection of recommended transport product.  
   o Dave O: the architectural model shows the media being carried in 
separate RTP channels.  Is there any support in  web  services  for  
setting  up  and  tearing  down sessions.  Or  would  this  have  to  be  
recast  in  web services framework?  
   o Stephane: it would have to be recast inside the web services model. 
   o Dave  O:  glaring  missing  piece  for  asynchronous 
notification for barge control. 
   o Stephane:  XML  event  exchange  on  SOAP  gives  you 
interaction  (asynchronous),  but  may  still  have  the problem with 
delay, race conditions.  
   o ? on RTP.  How do you synchronize event timing?  
   o Stephane:  it's  the  same  answer.    The  engines  are 
characterized by sink/source ports and this work would need to be done in 
the IETF. 
   o Has Web service been used at the protocol control level?  Is 
latency a concern?  
   o Stephane: web services for controlling is being done by Parlay.  
There will likely be the same race conditions as other protocols for the 
asynchronous notification for barge control.  
   o Radika: Basic problem, setup and negotiation of session and this 
proposal still doesn't address how you could do that?  
   o Dave O: only a small subset of the requirements deal with session 
setup.  The majority of the requirements are about command and control 
within a session. 
   o Stephane: nothing prevents using SIP for negotiation and the use SOAP 
for command and control.  Definitely won't use web services for 
   o Joerg: You've explained the framework, but there would be a lot to be 
done to get the necessary functionality.  
     If one was developing a protocol from scratch, what would be the 
   o Stephane: syntax and semantic of the programming of the engine (API)  
would be the work.  
   o Joerg: Is this approach a bug or a feature?  
Any further questions?  
          o Radika:   We have the solutions for everything, we just need a 
control protocol.  
          o Karl: About 5 years ago, did IP TV and used RTSP for 
control.  Lots of issues with interacting with RTP and users.  Can't start 
anywhere in an RTP stream.  Many receivers need to the message to get a 
timestamp. How do deal with starts and stops? Users are used to 
mechanical things; there appear to be lots of user concerns.  
          o Dave O: Requirements is in the hands of the IESG; if there are 
additional comments, please provide.  
          o Markus: RTP synchronization; RTSP synchronization needs some 
work to get it to work properly. 
          o Dave  O:  in  the  requirements,  there  is  a  specific 
synchronization requirement (must almost instantaneously stop).  IF there 
are other syncs needed, these need to be provided.  
          o SIP is intended to initiate sessions.  It's not a good 
control protocol or a good transport protocol.  Could perhaps use to set up 
audio in parallel.  
          o Sarvi: not talking about tunneling MRCP over RTSP or SIP.  SIP is 
good at setting up and modifying a session (RTP pipes and 
negotiating params).  If SIP were extended to  solve  recognition  and  TTS  
problems,  it  would complement SIP.  
          o SIP is fine for session associations. You don't need to add 
MRCP stuff to SIP.  
          o Sarvi: if we were take MRCP and make it a protocol, need a way to 
setup a session (pipe) and negotiate a media stream.  Believes MRCP is 
Complimentary to SIP.  
          o ?: SOAP sessions over SIP (Ubiquity draft). 
          o Peter  :  Importance  of  fast    response  to  any  user 
          o Eric: SIP is good at setting up a session, RTSP is okay at 
doing that.  It's all the commands that come after setting up the 
session that is really the issue.  
          o IF you need to negotiate media parms, RTSP is not good for 
       AD left to go get a projector, which had died. 
       Next Steps, Design Team Formation 
          o Need consistent analysis framework: T is right now.  
          o Date for completion  
          o Stephane: agrees in principle.  
          o Dave O: have combined 2 approaches under one umbrella.  
   o You can evaluate as an existing protocol which can modify or 
        . (MRCP, RTSP, SIP) 
   o Framework of choice for new protocol.  
        . Web services & BEEP.  
     Proposes that criteria for framework would have to be different.  
Document was not intended to cast something in stone and give guidance to 
protocol development.  Evaluation helps them to look at tradeoffs for 
various starting points.  
   o Jerry:  It is true that any protocol can be made to appear to work.  
Something like web services depends upon what messages you create. 
   o UDP could do anything.  Same applies to BEEP or web services. 
        o How much needs to actually be done. 
   o Sarvi: In the context of framework, MRCP analysis - SIP also falls 
under framework category.  
   o Rajeev: Job to get right Ts, Fs, Ps; what are the things that we 
don't want done?   
   o Stephane: protocol vs. framework is a good one.  What's the outcome of 
the evaluation. Some overlap where things  are complimentary.  
   o Dave: keep in mind end goal.  
   o Karl:  one  aspect  of  protocol  choice;  it  will  be 
   o ..... 
   o Discussion  around  doing  further  work  to  detail  the syntax and 
protocol impacts. There is general support, but it's a lot of work.  
   . Mary Barnes: MIDCOM experience 
        o counting P+ and other marks 
        o people arguing about which marks to get 
        o not much consistency in the end 
        o numbers could not really be compared 
   . Really clear usage scenario(s) needed.  Identifies these so that we 
have a target. 
   . Worthwhile thing to do this.  Upside: educates the community. 
Downside: more difficult to step back for those involved. Four out of five 
analyses will be thrown away. 
   . C: Good idea.  Helps to identify how much work is needed. 
   . C: In principle, a good idea.  But this is much work.  Shouldn't this be 
part of the design process. 
   . C: May educate, but may not help the evaluation. Will cause more time to 
be spent. 
   o Scott B: motherhood stuff  is okay, but it seems like the idea of 
tossing it to design team isn't a win.  Need a little bit more 
   o Jerry: 2 tiers of problems: 1. existing protocol 2. new protocol 
(i.e. with framework).   The framework ones are more work.   
          o Dave  O:  Framework  vs  existing  protocol  issue;  with 
protocol,  wouldn't  you  end  up  with  a  sub-optimal solution.  
          o Scott: Bob Braden; the strength of the internet, we didn't 
optimize things, but made them flexible.  
          o Dave O: the intent wasn't to focus on efficiency, we want to 
maximize flexibility. . 
          o Scott: in the SIP world; a reasonable argument can't be made for 
ability to extend.  
          o Rajeev: vendor perspective; we have to be pragmatic; timing is 
critical and skill set is different 
          o Stephane: rapidly available should be a priority.  
          o Michael: agrees with flexibility of web services model; is 
there a liaison to that community?   
          o Dave O: have already established liaisons.   
          o Scott: Friendly reception from base control could be an 
issue.   Footprint might be an issue.   
          o Stephane: some of these are captured in requirements. 
Web services doesn't necessarily take a big footprint.  
If you follow one of the approaches that is carried by the industry and 
there are significant work, you may have a better chance to address all 
those issues.  
          o If you look at it practically, most of the industry from which  
is  being  borrowed  (media  servers  and  speech recognition 
vendors).  Will these vendors be ready to do web services? There are 
metrics available today.   
          o .... 
          o Stephane: lots of  mis-conceptions about web services.   
          o Sarvi:    requirements    about    server    to    server 
communication?  VXML browser running on a gateway or a small PDA or a 
phone and you access to a TTS-type resource,  thus  it's  not  
server-server,  but  rather client-server.   Terminals,   as   well,   need   
to   be considered.  
          o Stephane: didn't see this restricted to 
          o In the end, protocol needs to be lightweight.    
       Going forward:  
          o Does group feel it can start with a protocol while 
finishing evaluation? 
          o One team or more than one team?  
          o Stephane: believes this could be done.  Concern over 
          o Scott B: One team only!  
       Conclusion: General hum vote support for this.  
       Revisit milestones:  
       Completing document:  
   o Task 1: Recast the document into 2 categories (framework vs. 
   o Task 2: align ratings.  
   o Task 3: consistency due to multiple authors.  
   o Task 4: make sure each protocol sections has a summary that makes 
clear the strong points and weak points.  
After that (2-3 weeks), WG chairs will decide if it's ready for design 
   o Scott: don't necessarily need to publish as an RFC, but could be 
useful input to design team (per experience with MIDCOM protocol 
evaluation document).   
   o Scott: output from a design team has no more weight than anyone 
else's opinion.  
   o Dave: why is doc still not in waiting state?  
   o Scott: still watching (and not yet looking)  
   o Dave: propose "gazing" state.  
Conclusion: Hum vote in support of path going forward.  


BEEP: Overview
Media Resource Control Protocol
RTSP Protocol Evaluation
SIP Protocol Evaluation
Web Services Framework For Speechsc
The Internet Standards Process