Current Meeting Report

2.8.19 Speech Services Control (speechsc)

NOTE: This charter is a snapshot of the 54th IETF Meeting in Yokohama, Japan. It may now be out-of-date.

Last Modifield: 06/28/2002

David Oran <>
Eric Burger <>
Transport Area Director(s):
Scott Bradner <>
A. Mankin <>
Transport Area Advisor:
Scott Bradner <>
Mailing Lists:
General Discussion:
To Subscribe:
In Body: subscribe
Description of Working Group:
Many multimedia applications can benefit from having Automated Speech Recognition (ASR), Text to Speech (TTS), and Speaker Verification (SV) processing available as a distributed, network resource. To date, there are a number of proprietary ASR, TTS, and SV API's, as well as two IETF drafts, that address this problem. However, there are serious deficiencies to the existing drafts relating to this problem. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer.

The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of this working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers.

The working group will develop an informational RFC detailing the architecture and requirements for distributed speechsc control. In addition, the requirements document will describe the use cases driving these requirements. The working group will then examine existing media-related protocols, especially RTSP, for suitability as a protocol for carriage of speechsc server control. The working group will then propose extensions to existing protocols or the development of new protocols, as appropriate, to meet the requirements specified in the informational RFC.

The protocol will assume RTP carriage of media. Assuming session-oriented media transport, the protocol will use SDP to describe the session.

The working group will not be investigating distributed speech recognition (DSR), as exemplified by the ETSI Aurora project. The working group will not be recreating functionality available in other protocols, such as SIP or SDP. The working group will offer changes to existing protocols, with the possible exception of RTSP, to the appropriate IETF work group for consideration. This working group will explore modifications to RTSP, if required.

It is expected that we will coordinate our work in the IETF with the W3C Mutlimodal Interaction Work Group; the ITU-T Study Group 16 Working Party 3/16 on SG 16 Question 15/16; the 3GPP TSG SA WG1; and the ETSI Aurora STQ.

Once the current set of milestones is completed, the speechsc charter may be expanded, with IESG approval, to cover additional uses of the technology, such as the orchestration of multiple ASR/TTS/SV servers, the accommodation of additional types of servers such as simultaneous translation servers, etc.

Goals and Milestones:
JUL 02  Requirements ID submitted to IESG for publication (informational)
DEC 02  Submit Internet Draft(s) Analyzing Existing Protocols (informational)
DEC 02  Submit Internet Draft Describing New Protocol (if required) (standards track)
MAR 03  Submit Drafts to IESG for publication
No Current Internet-Drafts
No Request For Comments

Current Meeting Report

Minutes - Speech Services Control WG (speechsc)
Reported by Tom Taylor

Wednesday, July 17 at 0900-1130

Chairs: Eric Burger (
David Oran (

0900 - Agenda Bashing/Charter Review (Chairs)

The proposed agenda was accepted.

0910 - Work Roadmap & Timeline (Chairs)

Dave Oran presented. The charts are available at


The Working Group is chartered. The name has changed from CATS to SPEECHSC. The
scope is initially limited to Automatic Speech Recognition (ASR), Text To Speech
(TTS), and Speaker Verification (SV). This will expand later after the group has
demonstrated its ability to meet deliverables. Scott Bradner (
noted that there had been concern within IESG that the group is too narrowly
focused; he would be disappointed if the scope didn't expand. There should be a
strong bias toward protocol reuse. The group is to coordinate with ETSI Aurora,
ITU-T SG 16 (Question 15), W3C, and any other interested groups that emerge.

Work Items

Dave listed the milestones set by the charter. The Working Group is already late
on requirements publication, hence this is highest priority.

Timeline for Work Items

The Chairs would like to do Working Group Last Call on requirements by early
August. (Hence the meeting will focus on this.)

They would like to kick off work on protocol analysis immediately following
Working Group Last Call of requirements. (The meeting wrapup will include a
discussion of ways and means.)

0930 - Discuss requirements document (draft-burger-speechsc-reqts-00.txt)

This document, an update to draft-burger-cats-reqts-00.txt, was posted a month
ago. A small number of people generated a substantial number of postings on the
list. The Chairs wanted to take as long as necessary to cover open issues in the
document and posted on the list.

Open Issues

Identified in the reqts document:

(1) Means of detection of Speech Synthesis Markup Language (SSML)
Proposed resolution: require content type header.

(2) Should control channels be long-lived?
There was only one comment on the list: allow, not require long-lived control
Question: does this mean requiring that the control channels be set up in
The discussion distinguished long-lived vs. session-based vs. on-demand control
channel setup.
Long-lived: set up in advance.
Session-based: (note "session" is undefined).
On-demand: per utterance.
It was proposed that the protocol should support the first two, and may allow
on-demand setup. On-demand raises design issues if support is stronger than MAY.

Proposed summation: there is agreement on session and something larger than
session, but there is some question of whether smaller than session duration is

(3) For parameters that persist across a session, allow setting on a per-session
The proposal is to allow session parameters. There was no discussion.

(4) Allow for speech markers, as specified for MRCP over RTSP?
Two comments on list: Stephane Maes ( stated that speech
markers are needed and must be efficient. Dan Burnett ( asked
whether SSML was not adequate. The proposed resolution is that SSML is a good
initial hypothesis.
Discussants noted that we have to to support markers in messaging. SSML is
acceptable for now. The protocol must provide an efficient mechanism for
reporting that a marker has been sensed.

Stephane Maes noted that SSML can reference audio files. You don't know at
beginning how many files you are going to play. It was recognized that this is a
separate issue from marker. The proposed resolution was accepted.

(5) Should ASR support alternative grammar formats?
Stephane Maes said yes, we need that.
Stephane added that we need an extensibility mechanism, but not discovery.
Dan Burnett agreed.
Stephane noted that we should differentiate between capability discovery for
resource management and capability discovery for control.
Dave Oran restated the conclusion: there is a need to discover the capabilities
of given device, but this is not necessarily part of this protocol. There was
further discussion, but Dave suggested we read RFC 2533 then revisit this
discussion. It may be a matter of incorporating that protocol within this one as
SIP has done.
Proposal: the protocol must be able to explicitly signal grammar format and
support extensibility, but we will say nothing for now on capability discovery.

(6) Is there an need for all the parameters specified for MRCP over RTSP?
List comments: Yes, we need to go beyond the W3C grammar, and also need
extensibility (Maes, Burnett).
Proposed resolution: Yes. Moreover, we need to be able to specify parameters on
per-session basis. The exact set is to be decided as part of the protocol
analysis and design phase.
At this point there was some discussion of parameter setting beyond the session
and within a session. There SHOULD be a capability to reset parameters within a
session. It was noted that processor adjustment is done per-call, hence the
protocol at least needs to allow adjustment per call. There is also some need to
transfer data between servers (e.g. on background noise). Note: session and call
are not necessarily related concepts.

A question was raised on the handling of conferences (multiple speakers).
Dave Oran suggested a protocol requirement to recognize different SSRCs in the
RTP stream. There is a problem here: a conference could have multiple speakers
associated with an SSRC.

(7) The scope of the requirements should go beyond ASR, TTS, and SV/SR
(Speaker Recognition).
Proposed resolution: not for now.
Steve (***don't know E-mail address) remarked that the main market is still for
pre-recorded speech. The Chairs responded that this is a solved problem, not
something we need to work on. However, we can recognize that it will be present.
Text to express this is requested.

Stephane Maes suggested that we need some requirement for extensibility of
scope. Dave Oran asked how one would determine with such a requirement that the
protocol meets that requirement. He saw this as preferable to leave to the
design stage. Text is requested for consideration, if this avenue is to be

The question was raised, whether DTMF is in scope. Dave Oran noted that other
mechanisms are available for hnadling DTMF. Eric Burger added that in ASR, DTMF
would be invisible to protocol: it would be specified in the grammar. For a DTMF
server, use another protocol such as Megaco.
There was a suggestion that one might want conversion between voice and
DTMF. Eric responded that this was an application function.

(8) Does protocol have to cope with both parallel and serial composition of
Proposal: the charter limits topology. Serial chaining involves OPES proxy

There was some discussion about cases associated with wireless LAN. The Chairs'
response was that OPES issues are matters of delegation, trust, security, and
traceability. We would have to convince the IESG that these issues do not arise
or are well met in this case. It would represent a major expansion of work to
generate the required analysis. See RFC 3238 for more information.
Compromise: note this as an area of research and possible future enhancement.

(9) Does the requirement not to redo RTSP or SIP/msuri restrict the ability to
use markers and other playout options like pacing?
Proposed resolution: reword the requirement to clarify that the intent is not to
impose such a restriction.

(10) Clarify the OPES requirement.
Proposed resolution: add a reference to RFC 3238. The intent is that the client
side of the protocol will operate on behalf of one user. Stephane Maes will
supply text.

(11) Load balancing.
The Chairs noted that the current text captured the outcome of lengthy
discussions. The requirements must not preclude load balancing but also must not
require load balancing. The general feeling was that it is not a fruitful area
of effort.

(12) Must be able to control language and prosody for plain text.
Proposed: this is a matter of clarification: SSML provides the desired control.

(13) Need "full control" over TTS engine (Maes). VCR and other fine-grained
control should be lower priority (Burnett).
Dan Burnett clarified: VCR control are audio controls, not TTS controls. It was
agreed such controls are needed, but they are not a high priority for TTS
applications. The counter-argument was that we have the analogue in text
operations: e.g. skip paragraph, go back to previous page. Stephane's point was
that real-time controls are needed, and he is not sure why we would call them
out specifically for lower priority.

This issue is one for the list to consider. We need a more detailed explication
of control requirements. Note that there is a problem of interaction with SSML.
There is the question of how what kind of units to skip ahead, for instance:
seconds, paragraphs, ...

(14) Must handle prompting, recording, possibly utterance verification,
retrianing, in addition to record for analysis (Maes).
Proposed resolution: design for extensibility, but no specific requirements in
the protocol other than for recording for now.

(15) Grammar sharing.
The Chairs proposed to adopt the Burnett phrasing of requirements:

(i) A server implementation needs to be able to store large grammars originally
provided by client, and
(ii) we need the ability within the protocol to reference grammars already known
to the server (e.g. built in). Dave saw this as a name space issue.

The distinction between globally unique and well-known was noted, but seen as a
design issue.
The question of control of grammar use was raised. The Chairs suggested that
this is a matter of passing it only to trusted entities.
There was a suggestion that (i) is a matter of cache control.
There is the issue of who can use which grammar, but the meeting agreed that
this is outside of scope. Discovery of grammars is also out of scope.

It was agreed that the protocol must not preclude grammar sharing across

Dan Burnett is to supply text.

(16) Need to to cover speaker enrollment, identification and classification as
well as recognition as part of SV. Multiple methods are needed.
Resolution: will add this to requirements. Dan Burnett is to provide more

(17) Why a requirement on cross-utterance state?
Dan Burnett explained: he wants to make sure the implementation option remains
open. Hence his concern is that there be no requirement that cross-utterance
information be held only in the client. Stephane saw this as an example of a
number of cases where extensibility will be needed. Dave Oran suggested we need
a way to express in the protocol that some barrier has been crossed and
resynchronization is needed. Looking at it another way: we need to be able to
indicate that different transactions, not necessarily sequential, are
correlated. Stephane suggested we add to this that the specific kind of
correlation is proprietary. Following on, it is important that the server be
able to give a result and say what context it applies to.

(19) Need simultaneous performance of multiple functions on the same streams.
The meeting agreed to add the requirement but not to consider parallel
decomposition for now. (Could be happening behind the scenes due to OPES

Stephane wondered if we always assume the output of engine goes back to the
issuer of the command. The Chairs' answer was "yes", on security grounds: there
are too many hacking scenarios otherwise.

It was noted that the security section needs expansion. It should distinguish
between requirements on the protocol (being put together in this document) and
requirements on the system (not to be documented).

Other agenda points

The requirements discussion took all the time available, so intervening points
of the agenda were not covered.

1115 - Wrap-up and next steps

The intent is to reissue the requirements draft by July 27. The group would aim
for Working Group Last Call by early August, with text going to the IESG by the
end of August.
Issue: do use cases go into the requirements, or will they just be used as a
Stephane Maes proposed a short summary in the requirements, but mainly use them
as a guide.

Steve asked what the group would do about discovery and resource management.
Dave Oran pointed out that this is a generic problem for client-server
protocols. He suggested just leaving it to system architects. This implies that
clients and servers become limited in applicability to the discovery mechanisms
they implement.

The list is now at

Note 1: the posted IETF agenda still has this as a "CATS" BoF". We are in fact
an approved WG, and are called SPEECHSC as we previously reported to the mailing

Note 2: the mailing list will be decommissioned immediately
following this IETF. PLEASE subscribe to the mailing list as
soon as convenient.


Agenda (html)
Agenda (pdf)