2.7.15 Speech Services Control (speechsc)

NOTE: This charter is a snapshot of the 59th IETF Meeting in Seoul, Korea. It may now be out-of-date.

Last Modified: 2004-01-22

David Oran <oran@cisco.com>
Eric Burger <eburger@snowshore.com>
Transport Area Director(s):
Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>
Transport Area Advisor:
Jon Peterson <jon.peterson@neustar.biz>
Mailing Lists:
General Discussion: speechsc@ietf.org
To Subscribe: speechsc-request@ietf.org
In Body: subscribe
Archive: www.ietf.org/mail-archive/working-groups/speechsc/current/maillist.html
Description of Working Group:
Many multimedia applications can benefit from having Automated Speech Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV) processing available as a distributed, network resource. To date, there are a number of proprietary ASR, TTS, and SV API's, as well as two IETF drafts, that address this problem. However, there are serious deficiencies to the existing drafts relating to this problem. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the architecture and requirements for distributed speechsc control. In addition, the requirements document will describe the use cases driving these requirements. The working group will then examine existing media-related protocols, especially RTSP, for suitability as a protocol for carriage of speechsc server control. The working group will then propose extensions to existing protocols or the development of new protocols, as appropriate, to meet the requirements specified in the informational RFC.

The protocol will assume RTP carriage of media. Assuming session-oriented media transport, the protocol will use SDP to describe the session.

The working group will not be investigating distributed speech recognition (DSR), as exemplified by the ETSI Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP. The working group will offer changes to existing protocols, with the possible exception of RTSP, to the appropriate IETF work group for consideration. This working group will explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI Aurora STQ.

Once the current set of milestones is completed, the speechsc charter may be expanded, with IESG approval, to cover additional uses of the technology, such as the orchestration of multiple ASR/TTS/SV servers, the accommodation of additional types of servers such as simultaneous translation servers, etc.

Goals and Milestones:
Done  Requirements ID submitted to IESG for publication (informational)
Done  Submit Internet Draft(s) Analyzing Existing Protocols (informational)
Done  Submit Internet Draft Describing New Protocol (if required) (standards track)
Oct 03  Submit Drafts to IESG for publication
  • - draft-ietf-speechsc-reqts-05.txt
  • - draft-ietf-speechsc-mrcpv2-01.txt
  • No Request For Comments

    Current Meeting Report

    comments.SpeechSC Minutes 040302 17.00-18.00
    Magnus Westerlund
    Chairs Introduction and WG Status
    The WG chairs started with agenda bashing, followed up with 
    presenting the WG status. The SpeechSC requirements document has been 
    approved by the IESG with some smaller edits requested. However the 
    document was lost between IESG and the RFC-Editor. This has delayed the 
    publication. There is a milestone to request publication of any drafts by 
    October 03. The current goal is to request publication no later than 
    October 04.
    Separate Record Function Discussion
    The question to the WG was: Should there be a explicit recording 
    function in MRCPv2? The draft version 01 does not have a pure 
    recording function. Recording behaviour is determined by using speech 
    recognition or at least voice activity detection. There was some 
    discussion around the use cases for recording. One use cases mentioned for 
    this behaviour is voice mail recording, where the recording is 
    controlled through voice recognition. Therefore there is desire to have a 
    RECORD resource, a RECORD method which has has a header indicating how the 
    recorder should perform speech recognition. Another use case that match 
    this behaviour is recording for training, or verification. The 
    conclusion of the discussion was there is no expressed need for blind 
    recording, any one needing this can use RTSP record. Also this use case 
    should be mentioned in the protocol spec to motivate the 
    The draft version 02 was made available on mailing list, will be 
    submitted when internet-drafts@ietf.org opens again. A number of open 
    issues where discussed. Presentation was made by Sarvi Shanmugham.
    NAT traversal for the MRCPv2 TCP control channel setup: As long as only one 
    end-point is in a private space it is possible to make things work.If both 
    entities are in a private space a relay will be needed. To get TCP to work 
    some signalling to indicate how the TCP connect should be done is 
    needed. This is similar to the MMUSIC work on Co-Media 
    Do we need an INTERMEDIATE-RECOG-RESULT: The WG was questioned if there any 
    need for this functionality. Nobody expressed any desire for it. Unless 
    anyone on the mailing list expresses a real need for this 
    functionality, it will not be included. A mail will be sent to the 
    mailing list to ask this question.
    Speech or hotword barge-in: Eric Burger asked if there is any protocol 
    difference between the two. The answer is that real issues is actually to 
    identify what type of barge in that has happened, as this may exist 
    policies accepting either of the types. The conclusion is to confirm the 
    with the mailing list that this feature is included if a solution exist.
    Multiple instances of a Header field Vs Single header field with 
    multiple values: First there where some discussion around the 
    historical reasons. Then it was asked; Are any reason why not to leave it as 
    it is? As none had a real reason to change it from how SIP, and HTTP 
    handles headers? The list shall be asked if they no a reason to change it.
    Header field ranges Confidence-Threshold, Speed-Vs-Accuracy etc(0.0 - 1.0 or 0 
    - 100): There was consensus in the room to use 0.0-1.0 ranges. It will be 
    confirmed by the mailing list.
    Proposal to specify a fixed header with a vendor identifier and a vendor 
    registry for the Recognizer context block: David Oran stated that IESG has 
    concerns with vendor specific extensions that make things fail. To make 
    this work, the specification needs to ensure that it is optional and can be 
    ignored. No new error cases should be generated by this. Also the 
    motivation of this was discussed, allowing some resources to work 
    better, however it is not required to. A proposal was: The client MUST copy 
    the header field to the next resource within the session. Some 
    discussion of making the MUST a SHOULD. After some more discussion around 
    the issues the following conclusion was reached in the room: Client must 
    copy, Server must not barf. Server is not allowed to reject a request 
    based on empty or non-present header.
    The WG should also look into if there exist an already existing vendor 
    registry that can be leveraged, for example with IANA.
    DTMF support and RFC 2833 support: The conclusion was: If one supports 
    DTMF, one MUST  support RFC 2833. Confirm consensus on the mailing list.
    Security support - sips: ? https? Digest ? SRTP?: What is the minimal 
    security support to implement. To help interoperability it is normal to 
    require a single solution as being mandatory when having security 
    features. The discussion was split into the different parts. For the MRCP 
    channel it where consensus for having MANDATORY support of TLS. For the 
    media channel, the initial proposal from David Oran was: when 
    transmitting media streams requiring security one uses either SRTP or 
    IPSec. Further discussion made the observation that no other place in the 
    system does there exist a requirement for IPSec. Therefore there might be a 
    reason for looking more a SRTP. Further comments was that the WG should 
    early on contact security area advisors to have them check the 
    proposal, thus avoiding late surprise. Another question was, how does one 
    indicate in SDP that one should use IPSec for a media stream?
    Define grammars one at a time: This proposal was supported by the room.
    PAUSE on Barge In: Is there a need to have this instead of Stop on barge in? 
    There where no comments raised on this topic.
    There is two proposal for similar functionality of determining 
    available functionality: "OPTION commands for m-lines" and "SIP Callee 
    capability for resource description and capability". The proposal is to 
    check if the SIP Callee method can solve everything. IF not then one needs to 
    look into SIP OPTIONS.
    3PCC model of connecting with the MRCPv2 server: It was proposed that 
    using Offer-Answer will solve the problem. Invite response can be the 
    initial offer.
    Specification conclusion: The specification is believed to be 
    functionality wise completed. However it needs review to ensure that 
    everything works. The WG chairs asked Sarvi if was possible to have a 
    target of working group last call by end of April, which he