Speech Services Control (speechsc) Charter

2.8.16 Speech Services Control (speechsc)

NOTE: This charter is a snapshot of the 58th IETF Meeting in Minneapolis, Minnesota USA. It may now be out-of-date.

Last Modified: 2003-10-01

Chair(s):

David Oran <oran@cisco.com>
Eric Burger <eburger@snowshore.com>

Transport Area Director(s):

Allison Mankin <mankin@psg.com>
Jon Peterson <jon.peterson@neustar.biz>

Transport Area Advisor:

Jon Peterson <jon.peterson@neustar.biz>

Mailing Lists:

General Discussion: speechsc@ietf.org
To Subscribe: speechsc-request@ietf.org
In Body: subscribe
Archive: www.ietf.org/mail-archive/working-groups/speechsc/current/maillist.html

Description of Working Group:

Many multimedia applications can benefit from having Automated Speech Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV) processing available as a distributed, network resource. To date, there are a number of proprietary ASR, TTS, and SV API's, as well as two IETF drafts, that address this problem. However, there are serious deficiencies to the existing drafts relating to this problem. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the architecture and requirements for distributed speechsc control. In addition, the requirements document will describe the use cases driving these requirements. The working group will then examine existing media-related protocols, especially RTSP, for suitability as a protocol for carriage of speechsc server control. The working group will then propose extensions to existing protocols or the development of new protocols, as appropriate, to meet the requirements specified in the informational RFC.

The protocol will assume RTP carriage of media. Assuming session-oriented media transport, the protocol will use SDP to describe the session.

The working group will not be investigating distributed speech recognition (DSR), as exemplified by the ETSI Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP. The working group will offer changes to existing protocols, with the possible exception of RTSP, to the appropriate IETF work group for consideration. This working group will explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI Aurora STQ.

Once the current set of milestones is completed, the speechsc charter may be expanded, with IESG approval, to cover additional uses of the technology, such as the orchestration of multiple ASR/TTS/SV servers, the accommodation of additional types of servers such as simultaneous translation servers, etc.

Goals and Milestones:

Done		Requirements ID submitted to IESG for publication (informational)
Done		Submit Internet Draft(s) Analyzing Existing Protocols (informational)
Done		Submit Internet Draft Describing New Protocol (if required) (standards track)
Oct 03		Submit Drafts to IESG for publication

Internet-Drafts:

- draft-ietf-speechsc-reqts-04.txt

- draft-ietf-speechsc-protocol-eval-02.txt

- draft-ietf-speechsc-mrcpv2-00.txt

No Request For Comments

Current Meeting Report

h).SPEECHSC Minutes


Dave: The requirements document passed review, it's in editorial review now


Sarvi: Dan Burnett's speaker 
identification/verification draft is out now.  It's geared toward MRCP v1, 
and will be evolved into MRCP v2.


Sarvi: Open Issues..


 * Proxy support: Call flows are needed.  Currently, we're looking at 
using a relay to front-end requests.


 * When to start/stop media: The recognizer should expect the media to 
start flowing when it receives the recognize request, and shouldn't 
buffer anything it receives beforehand.


 * Recording audio:
    Two types (definitions from Dan Burnett):
      resource-related: everything the recognizer hears and/or 
everything it thinks is speech
      time-based: record some period of the conversation, 
independent of the recognition.


    It was agreed that it is outside the scope of MRCP to record the 
conversation. It is, however, desirable to have a "record" resource which 
takes audio input from the client, "puts a handle on it" and makes it 
available to the client, possibly applying some "speechish" 
operations (end pointing, etc).


 * Resource types: there is potentially a need to 
identify/classify resource for allocation (e.g. this "recognizer" can only 
recognize DTMF input, not speech, or this "TTS engine" can only play 
audio, it doesn't do synthesis).  SIP Callee capabilities will be 
investigated/discussed on the mailing list to determine whether they are 
sufficient.


 * NLSML versus EMMA: As we won't know the status of the EMMA 
specification at the time we publish until the time we publish, we'll 
leave a placeholder in our document until it's time to publish and make a 
decision them.


 * Multiple media streams: There's a need for only one media line


 * Multiple speak requests: Is it desirable to be able to pause an active 
speak request, execute a new speak request, and then resume the 
original request?  Yes, it's potentially useful, but it can be 
accomplished on the client side by allocating two separate TTS 
resources, pausing one, starting the other, and then resuming the first one 
when the second one finishes.



Dan Burnett on Speaker Identification and Verification:


 - joint proposal from Nuance and Intervoice submitted recently.  In 
addition to SI/SV, document covers:


    - speaker-enrolled grammars: use recorded audio to make a grammar; 
well-suited for voice dialing applications
    - hotword recognition: recognizer listens for hotword(s) in a 
conversation, doing nothing until it actually recognizes something (as 
opposed to timing out, throwing a "nomatch", etc)


  SI/SV discussion:


    Two questions so far:


      1. Why buffering?  Can the audio from a captured recognizer 
session be used (when recognition is done with 
save-waveform=true) be used for verification, by passing the 
verification engine handle(s) to the recorded audio?  We should be able to 
eliminate the pause/resume methods


      2. Is there a need for some sort of registry for returned info - some 
verifier/identifier might return gender information, or language 
information; common categories would be beneficial



Milestones:


  Slightly behind schedule currently.  A draft will be submitted 
sometime after the next IETF meeting (March 2004?)



Jeff Kusnitz,

Slides

Agenda

Presentation 0

MRCPv2

Presentation 1

Speaker Identification and Verification

Presentation 2

MRCP Model

Presentation 3