2.8.20 Speech Services Control (speechsc)

Last Modified: 2003-07-21

David Oran <oran@cisco.com>
Eric Burger <eburger@snowshore.com>
Transport Area Director(s):
Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>
Transport Area Advisor:
Jon Peterson <jon.peterson@neustar.biz>
Mailing Lists:
General Discussion: speechsc@ietf.org
To Subscribe: speechsc-request@ietf.org
In Body: subscribe
Archive: www.ietf.org/mail-archive/working-groups/speechsc/current/maillist.html
Description of Working Group:
Many multimedia applications can benefit from having Automated Speech
Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV)
processing available as a distributed, network resource. To date, there
are a number of proprietary ASR, TTS, and SV API's, as well as two IETF
drafts, that address this problem. However, there are serious
deficiencies to the existing drafts relating to this problem. In
particular, they mix the semantics of existing protocols yet are close
enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed
media processing of audio streams. The focus of this working group is
to develop protocols to support ASR, TTS, and SV. The working group
will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the
architecture and requirements for distributed speechsc control. In
addition, the requirements document will describe the use cases driving
these requirements. The working group will then examine existing
media-related protocols, especially RTSP, for suitability as a protocol
for carriage of speechsc server control. The working group will then
propose extensions to existing protocols or the development of new
protocols, as appropriate, to meet the requirements specified in the
informational RFC.

The protocol will assume RTP carriage of media. Assuming
session-oriented media transport, the protocol will use SDP to describe
the session.

The working group will not be investigating distributed speech
recognition (DSR), as exemplified by the ETSI Aurora project. The
working group will not be recreating functionality available in other
protocols, such as SIP or SDP. The working group will offer changes to
existing protocols, with the possible exception of RTSP, to the
appropriate IETF work group for consideration. This working group will
explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the
W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working
Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI
Aurora STQ.

Once the current set of milestones is completed, the speechsc charter
may be expanded, with IESG approval, to cover additional uses of the
technology, such as the orchestration of multiple ASR/TTS/SV servers,
the accommodation of additional types of servers such as simultaneous
translation servers, etc.
Goals and Milestones:
Done  Requirements ID submitted to IESG for publication (informational)
Done  Submit Internet Draft(s) Analyzing Existing Protocols (informational)
Done  Submit Internet Draft Describing New Protocol (if required) (standards track)
Oct 03  Submit Drafts to IESG for publication
  • - draft-ietf-speechsc-reqts-04.txt
  • - draft-ietf-speechsc-protocol-eval-02.txt
  • No Request For Comments

    Current Meeting Report

    speechsc working group minutes, Wed July 16
    reported by Edwin Aoki <aoki@aol.net>
    Eric Burger and David Oran chair
    Administrivia and Agenda Bashing
    Proposed Agenda:
     Agenda Bashing           5 min
     Requirement Status       4 min
     Protocol Proposal       90 min
     Protocol Analysis       20 min
     Wrap up and next steps
    There were no objections to the agenda as proposed.
    Requirements Status - Dave Oran
    The requirements document was in the IESG for some time, and the 
    majority of comments were integrated into 
    draft-ietf-speechsc-reqts-04. The security ADs asked for a couple minor 
    changes, which will be included in an -05 draft, including a reference to 
    the risks of use of biometrics, including speaker identification and 
    speaker verification.
    After those changes, the draft will go to the RFC editor.
    Guido from the RNID had requested some changes in the wording of section 
    3.9.  Dave indicated that he'd thought that those changes were already 
    incorporated in the -04 draft; Guido thought his comments were for -04. 
    Guido will verify that his comments are still appropriate for the -04 
    speechsc Protocol Proposal - Sarvi Shanmugham (via audio link)
    The protocol proposal is now in draft form, based on the MRCP 
    proposal, also now in draft form.  However, there was some issues that came 
    up relating to MRCP's tunneling capability.  The proposal proposes a 
    SIP-based framework as a control channel to initiate sessions between 
    client and server.  The control channel will run over TCP or SCTP and will 
    not use an unreliable protocol such as UDP.
    This proposal doesn't address speaker identification or speaker 
    * The speechsc exchange is simple because it need not work around the 
    unreliability of the protocol
    * Allows for TCP/SCTP connection sharing, unlike RTSP, which requires the 
    client to open a separate connection to the server for each session.
    * Leverages MRCP - the state machine and flow are the same as MRCP, and are 
    therefore well-understood
    Most of the issues that have been raised on the list have been noted and 
    simply need to be incorporated into the next set of drafts.  Sarvi 
    presented a slide which listed the known issues, and the remainder of the 
    discussion focused around these issues (and others that would come in in the 
    course of the discussion).  The chairs took a quick show of hands, which 
    revealed that a few people have read the most recent draft.
    * Issue 1: Define SI and SV
    The author has received some responses from a few people who might be 
    interested in working on the SI and SV problem, but if there are 
    additional people who are interested, they should contact the WG chairs.
    Dan Burnett has volunteered.
    * Issue 2: Why use SIP (Bryan Wild and others)
    There was some discussion around the choice of SIP.  Morna Hirsch asked the 
    question (which Bryan Wild and others have asked on the list) why we 
    wouldn't continue with the use of RTSP and extend that instead of going all 
    the way to SIP?
    Sarvi explained that two issues that while RTSP was being used, 
    speechsc was primarily using MRCP as a TCP pipe and so therefore it 
    worked.  The desire was to move the messages to the top layer without 
    requiring tunneling, and the separation of the control channel provided a 
    clean way to do this.  Additionally, going to SIP allowed for reuse of the 
    TCP pipe between client and server.
    In getting some more detail around the use of SIP for speechsc, Colin 
    asked whether the proposal was a subset of SIP, or whether there would be 
    parts of SIP that people would expect to work, that wouldn't when used in a 
    speechsc context.
    Sarvi explained that everything one would expect for a standard RFC 
    3261-compliant UA would work; it is not a subset of SIP and there's no 
    expectation that a profile would be needed.
    The chairs took a hum on the question: "Is there consensus on using SIP as 
    the session initiation protocol for speechsc?"  The hum indicated rough 
    consensus for the statement; there was no opposition.
    The chairs then took a hum on the question of whether it would be 
    appropriate to adopt this draft as a WG item.  Again, there was no 
    opposition, but only a light hum in favor.  The chairs will take this 
    question to the list.
    * Issue 3: Multiple resources of a given type
    Dan Burnett asked regarding section 3.2 for some more 
    clarification on adding and removing resources.  Is it possible to have, for 
    example, multiple ASR resources and then to be able to drop just one?  As 
    long as there are only references to resource type and not to specific 
    resources, it's unclear what would be dropped?
    There was some discussion around why one would want to have multiple 
    resources - for example to have multiple recognizers in parallel, but the 
    current draft does not consider having multiple resources on a single 
    Further discussion was taken to the list.
    * Issue 4: Resource Tokens as strings
    The protocol currently defines resources by an integer number. In an XML 
    format, it costs the same (in bytes) to use strings such as "SI", "SV", or 
    even "ASR" or others.  Colin and Eric independently asked the question of 
    the extensibility of the namespace and whether strings could be used 
    instead of numbers.
    Sarvi indicated that he was open to using strings, perhaps even URIs of the 
    form channel ID@asr.
    There was some followup discussion on whether these strings would be 
    arbitrary, negotiated strings, or fixed strings as in an IANA 
    registry.  The discussion seemed to focus around leaning towards 
    specific strings by resource types.
    The chairs asked for a concrete proposal to be sent to the list 
    * Issue 5: Use of the m= line
    Neil Deason brought up the issue of how one would specify the choice of TCP 
    or SCTP given the current specs.  Two options were proposed.
    Proposal 1: One m= line, with a protocol ID of "speechsc" and where the 
    MIME type is a resource ID
    Proposal 2: One m= line with the protocol ID being the actual protocol used 
    (TCP or SCTP), MIME type of "application/speechsc" and additional 
    attributes a=resource ID <type>, a=channel ID <identifier>
    There were no comments on this and further discussion was taken to the 
    * Comment
    Adam Roach made the comment that having content-length headers in the 
    middle of the data has proven difficult to implement efficiency in other WGs 
    (like SIP).  Subsequent work, for example in MSRP, has gone to more of a 
    fixed-position framing for the ease of parsing.  Various other options 
    include include either an easy to parse byte count, or well-known leader 
    text (a la MIME parts).  This makes it easier to parse without having to 
    pull in the entire message.
    Protocol Analysis Document - Eric Burger
    The document is complete, though it still needs some more work, 
    particularly cross-review.  A show of hands showed that 3 or 4 people had 
    read it.  So now what?  Does this document need to be published? Does it 
    need to be kept alive for the duration of the protocol?  etc.
    The AD felt that if it was interesting and/or worthwhile or could convey 
    some of the rationale for using IETF-supported protocols rather than not, 
    that it would be useful to document.
    There was some collective intuition that it would be good to ahve 
    documented the reasons why the group moved in the direction that it did, 
    particularly because the group has made a fairly significant change in 
    direction.  As of now, however, the document is not in a publishable 
    state, and needs further work.
    Milestone Review - Eric Burger
    The group is a little ahead of schedule on the milestones as far as draft 
    submissions are concerned.  The milestones will be updated coming out of the 
    Vienna meeting.


    Speechsc Protocol Proposal