< draft-rosenberg-sipping-app-interaction-framework-00.txt   draft-rosenberg-sipping-app-interaction-framework-01.txt >
Internet Engineering Task Force SIPPING WG SIPPING J. Rosenberg
Internet Draft J. Rosenberg Internet-Draft dynamicsoft
dynamicsoft Expires: December 29, 2003 June 30, 2003
draft-rosenberg-sipping-app-interaction-framework-00.txt
October 28, 2002
Expires: April 2003
A Framework and Requirements for Application Interaction in SIP A Framework and Requirements for Application Interaction in the
Session Initiation Protocol (SIP)
draft-rosenberg-sipping-app-interaction-framework-01
STATUS OF THIS MEMO Status of this Memo
This document is an Internet-Draft and is in full conformance with This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026. all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that other
other groups may also distribute working documents as Internet- groups may also distribute working documents as Internet-Drafts.
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress". material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at http://
http://www.ietf.org/ietf/1id-abstracts.txt www.ietf.org/ietf/1id-abstracts.txt.
To view the list Internet-Draft Shadow Directories, see The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 29, 2003.
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Abstract Abstract
This document describes a framework and requirements for the This document describes a framework and requirements for the
interaction between users and Session Initiation Protocol (SIP) based interaction between users and Session Initiation Protocol (SIP) based
applications. By interacting with applications, users can guide the applications. By interacting with applications, users can guide the
way in which they operate. The focus of this framework is stimulus way in which they operate. The focus of this framework is stimulus
signaling, which allows a user agent to interact with an application signaling, which allows a user agent to interact with an application
without knowledge of the semantics of that application. Stimulus without knowledge of the semantics of that application. Stimulus
signaling can occur to a user interface running locally with the signaling can occur to a user interface running locally with the
client, or to a remote user interface, through media streams. client, or to a remote user interface, through media streams.
Stimulus signaling encompasses a wide range of mechanisms, ranging Stimulus signaling encompasses a wide range of mechanisms, ranging
from clicking on hyperlinks, to pressing buttons, to traditional Dual from clicking on hyperlinks, to pressing buttons, to traditional Dual
Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling
is supported through the use of markup languages, which play a key is supported through the use of markup languages, which play a key
role in this framework. role in this framework.
Table of Contents Table of Contents
1 Introduction ........................................ 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Definitions ......................................... 3 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . 4
3 A Model for Application Interaction ................. 6 3. A Model for Application Interaction . . . . . . . . . . . . 7
3.1 Function vs. Stimulus ............................... 8 3.1 Function vs. Stimulus . . . . . . . . . . . . . . . . . . . 8
3.2 Real-Time vs. Non-Real Time ......................... 8 3.2 Real-Time vs. Non-Real Time . . . . . . . . . . . . . . . . 9
3.3 Client-Local vs. Client-Remote ...................... 9 3.3 Client-Local vs. Client-Remote . . . . . . . . . . . . . . . 9
3.4 Interaction Scenarios on Telephones ................. 10 3.4 Interaction Scenarios on Telephones . . . . . . . . . . . . 10
3.4.1 Client Remote ....................................... 10 3.4.1 Client Remote . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 Client Local ........................................ 10 3.4.2 Client Local . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.3 Flip-Flop ........................................... 11 3.4.3 Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Framework Overview .................................. 12 4. Framework Overview . . . . . . . . . . . . . . . . . . . . . 13
5 Client Local Interfaces ............................. 13 5. Client Local Interfaces . . . . . . . . . . . . . . . . . . 15
5.1 Discovering Capabilities ............................ 14 5.1 Discovering Capabilities . . . . . . . . . . . . . . . . . . 15
5.2 Pushing an Initial Interface Component .............. 14 5.2 Pushing an Initial Interface Component . . . . . . . . . . . 15
5.3 Updating an Interface Component ..................... 16 5.3 Updating an Interface Component . . . . . . . . . . . . . . 17
5.4 Terminating an Interface Component .................. 17 5.4 Terminating an Interface Component . . . . . . . . . . . . . 18
6 Client Remote Interfaces ............................ 17 6. Client Remote Interfaces . . . . . . . . . . . . . . . . . . 19
6.1 Originating and Terminating Applications ............ 18 6.1 Originating and Terminating Applications . . . . . . . . . . 19
6.2 Intermediary Applications ........................... 18 6.2 Intermediary Applications . . . . . . . . . . . . . . . . . 19
7 Inter-Application Feature Interaction ............... 18 7. Inter-Application Feature Interaction . . . . . . . . . . . 21
7.1 Client Local UI ..................................... 19 7.1 Client Local UI . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Client-Remote UI .................................... 20 7.2 Client-Remote UI . . . . . . . . . . . . . . . . . . . . . . 22
7.2.1 Centralized Server .................................. 20 8. Intra Application Feature Interaction . . . . . . . . . . . 23
7.2.2 Pipe-and-Filter ..................................... 21 9. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.2.2.1 Client Resolution ................................... 22 10. Security Considerations . . . . . . . . . . . . . . . . . . 25
7.2.3 Comparison .......................................... 31 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26
8 Intra Application Feature Interaction ............... 33 Informative References . . . . . . . . . . . . . . . . . . . 27
9 Examples ............................................ 34 Author's Address . . . . . . . . . . . . . . . . . . . . . . 28
10 Security Considerations ............................. 35 Intellectual Property and Copyright Statements . . . . . . . 29
11 Contributors ........................................ 35
12 Authors Address ..................................... 35
13 Normative References ................................ 37
14 Informative References .............................. 37
1 Introduction 1. Introduction
The Session Initiation Protocol (SIP) [1] provides the ability for The Session Initiation Protocol (SIP) [1] provides the ability for
users to initiate, manage, and terminate communications sessions. users to initiate, manage, and terminate communications sessions.
Frequently, these sessions will involve a SIP application. A SIP Frequently, these sessions will involve a SIP application. A SIP
application is defined as a program running on a SIP-based element application is defined as a program running on a SIP-based element
(such as a proxy or user agent) that provides some value-added (such as a proxy or user agent) that provides some value-added
function to a user or system administrator. Examples of SIP function to a user or system administrator. Examples of SIP
applications include pre-paid calling card calls, conferencing, and applications include pre-paid calling card calls, conferencing, and
presence-based [2] call routing. presence-based [3] call routing.
In order for most applications to properly function, they need input In order for most applications to properly function, they need input
from the user to guide their operation. As an example, a pre-paid from the user to guide their operation. As an example, a pre-paid
calling card application requires the user to input their calling calling card application requires the user to input their calling
card number, their PIN code, and the destination number they wish to card number, their PIN code, and the destination number they wish to
reach. The process by which a user provides input to an application reach. The process by which a user provides input to an application
is called "application interaction". is called "application interaction".
Application interaction can be either functional or stimulus. Application interaction can be either functional or stimulus.
Functional interaction requires the user agent to understand the Functional interaction requires the user agent to understand the
semantics of the application, whereas stimulus interaction does not. semantics of the application, whereas stimulus interaction does not.
Stimulus signaling allows for applications to be built without Stimulus signaling allows for applications to be built without
requiring modifications to the client. Stimulus interaction is the requiring modifications to the client. Stimulus interaction is the
subject of this framework. The framework provides a model for how subject of this framework. The framework provides a model for how
users interact with applications through user interfaces, and how users interact with applications through user interfaces, and how
user interfaces and applications can be distributed throughout a user interfaces and applications can be distributed throughout a
network. This model is then used to describe how applications can network. This model is then used to describe how applications can
instantiate and manage user interfaces. instantiate and manage user interfaces.
2 Definitions 2. Definitions
SIP Application: A SIP application is defined as a program SIP Application: A SIP application is defined as a program running on
running on a SIP-based element (such as a proxy or user a SIP-based element (such as a proxy or user agent) that provides
agent) that provides some value-added function to a user or some value-added function to a user or system administrator.
system administrator. Examples of SIP applications include Examples of SIP applications include pre-paid calling card calls,
pre-paid calling card calls, conferencing, and presence- conferencing, and presence-based [3] call routing.
based [2] call routing.
Application Interaction: The process by which a user provides Application Interaction: The process by which a user provides input
input to an application. to an application.
Real-Time Application Interaction: Application interaction that Real-Time Application Interaction: Application interaction that takes
takes place while an application instance is executing. For place while an application instance is executing. For example,
example, when a user enters their PIN number into a pre- when a user enters their PIN number into a pre-paid calling card
paid calling card application, this is real-time application, this is real-time application interaction.
application interaction.
Non-Real Time Application Interaction: Application interaction Non-Real Time Application Interaction: Application interaction that
that takes place asynchronously with the execution of the takes place asynchronously with the execution of the application.
application. Generally, non-real time application Generally, non-real time application interaction is accomplished
interaction is accomplished through provisioning. through provisioning.
Functional Application Interaction: Application interaction is Functional Application Interaction: Application interaction is
functional when the user device has an understanding of the functional when the user device has an understanding of the
semantics of the application that the user is interacting semantics of the application that the user is interacting with.
with.
Stimulus Application Interaction: Application interaction is Stimulus Application Interaction: Application interaction is
considered to be stimulus when the user device has no considered to be stimulus when the user device has no
understanding of the semantics of the application that the understanding of the semantics of the application that the user is
user is interacting with. interacting with.
User Interface (UI): The user interface provides the user with User Interface (UI): The user interface provides the user with
context in order to make decisions about what they want. context in order to make decisions about what they want. The user
The user enters information into the user interface. The enters information into the user interface. The user interface
user interface interprets the information, and passes it to interprets the information, and passes it to the application.
the application.
User Interface Component: A piece of user interface which User Interface Component: A piece of user interface which operates
operates independently of other pieces of the user independently of other pieces of the user interface. For example,
interface. For example, a user might have two separate web a user might have two separate web interfaces to a pre-paid
interfaces to a pre-paid calling card application - one for calling card application - one for hanging up and making another
hanging up and making another call, and another for call, and another for entering the username and PIN.
entering the username and PIN.
User Device: The software or hardware system that the user User Device: The software or hardware system that the user directly
directly interacts with in order to communicate with the interacts with in order to communicate with the application. An
application. An example of a user device is a telephone. example of a user device is a telephone. Another example is a PC
Another example is a PC with a web browser. with a web browser.
User Input: The "raw" information passed from a user to a user User Input: The "raw" information passed from a user to a user
interface. Examples of user input include a spoken word or interface. Examples of user input include a spoken word or a click
a click on a hyperlink. on a hyperlink.
Client-Local User Interface: A user interface which is co- Client-Local User Interface: A user interface which is co-resident
resident with the user device. with the user device.
Client Remote User Interface: A user interface which executes Client Remote User Interface: A user interface which executes
remotely from the user device. In this case, a standardized remotely from the user device. In this case, a standardized
interface is needed between them. Typically, this is done interface is needed between them. Typically, this is done through
through media sessions - audio, video, or application media sessions - audio, video, or application sharing.
sharing.
Media Interaction: A means of separating a user and a user Media Interaction: A means of separating a user and a user interface
interface by connecting them with media streams. by connecting them with media streams.
Interactive Voice Response (IVR): An IVR is a type of user Interactive Voice Response (IVR): An IVR is a type of user interface
interface that allows users to speak commands to the that allows users to speak commands to the application, and hear
application, and hear responses to those commands prompting responses to those commands prompting for more information.
for more information.
Prompt-and-Collect: The basic primitive of an IVR user Prompt-and-Collect: The basic primitive of an IVR user interface. The
interface. The user is presented with a voice option, and user is presented with a voice option, and the user speaks their
the user speaks their choice. choice.
Barge-In: In an IVR user interface, a user is prompted to enter Barge-In: In an IVR user interface, a user is prompted to enter some
some information. With some prompts, the user may enter the information. With some prompts, the user may enter the requested
requested information before the prompt completes. In that information before the prompt completes. In that case, the prompt
case, the prompt ceases. The act of entering the ceases. The act of entering the information before completion of
information before completion of the prompt is referred to the prompt is referred to as barge-in.
as barge-in.
Focus: A user interface component has focus when user input is Focus: A user interface component has focus when user input is
provided fed to it, as opposed to any other user interface provided fed to it, as opposed to any other user interface
components. This is not to be confused with the term focus components. This is not to be confused with the term focus within
within the SIP conferencing framework, which refers to the the SIP conferencing framework, which refers to the center user
center user agent in a conference [3]. agent in a conference [4].
Focus Determination: The process by which the user device Focus Determination: The process by which the user device determines
determines which user interface component will receive the which user interface component will receive the user input.
user input.
Focusless User Interface: A user interface which has no ability Focusless User Interface: A user interface which has no ability to
to perform focus determination. An example of a focusless perform focus determination. An example of a focusless user
user interface is a keypad on a telephone. interface is a keypad on a telephone.
Feature Interaction: A class of problems which result when Feature Interaction: A class of problems which result when multiple
multiple applications or application components are trying applications or application components are trying to provide
to provide services to a user at the same time. services to a user at the same time.
Inter-Application Feature Interaction: Feature interactions that Inter-Application Feature Interaction: Feature interactions that
occur between applications. occur between applications.
DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones
generated by circuit switched telephony devices when the generated by circuit switched telephony devices when the user
user presses a key on the keypad. As a result, DTMF and presses a key on the keypad. As a result, DTMF and keypad input
keypad input are often used synonymously, when in fact one are often used synonymously, when in fact one of them (DTMF) is
of them (DTMF) is merely a means of conveying the other merely a means of conveying the other (the keypad input) to a
(the keypad input) to a client-remote user interface (the client-remote user interface (the switch, for example).
switch, for example).
Application Instance: A single execution path of a SIP Application Instance: A single execution path of a SIP application.
application.
Originating Application: A SIP application which acts as a UAC, Originating Application: A SIP application which acts as a UAC,
calling the user. calling the user.
Terminating Application: A SIP application which acts as a UAS, Terminating Application: A SIP application which acts as a UAS,
answering a call generated by a user. IVR applications are answering a call generated by a user. IVR applications are
terminating applications. terminating applications.
Intermediary Application: A SIP application which is neither the Intermediary Application: A SIP application which is neither the
caller or callee, but rather, a third party involved in a caller or callee, but rather, a third party involved in a call.
call.
3 A Model for Application Interaction 3. A Model for Application Interaction
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | | | | | |
| | | U | | U | | A | | | | U | | U | | A |
| | Input | s | Input | s | Results | p | | | Input | s | Input | s | Results | p |
| | ---------> | e | ---------> | e | ----------> | p | | | ---------> | e | ---------> | e | ----------> | p |
| U | | r | | r | | l | | U | | r | | r | | l |
| s | | | | | | i | | s | | | | | | i |
| e | | D | | I | | c | | e | | D | | I | | c |
| r | Output | e | Output | f | Update | a | | r | Output | e | Output | f | Update | a |
| | <--------- | v | <--------- | a | <.......... | t | | | <--------- | v | <--------- | a | <.......... | t |
| | | i | | c | | i | | | | i | | c | | i |
| | | c | | e | | o | | | | c | | e | | o |
| | | e | | | | n | | | | e | | | | n |
| | | | | | | | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
Figure 1: Model for Real-Time Interactions Figure 1: Model for Real-Time Interactions
Figure 1 presents a general model for how users interact with Figure 1 presents a general model for how users interact with
applications. Generally, users interact with a user interface through applications. Generally, users interact with a user interface through
a user device. A user device can be a telephone, or it can be a PC a user device. A user device can be a telephone, or it can be a PC
with a web browser. Its role is to pass the user input from the user, with a web browser. Its role is to pass the user input from the user,
to the user interface. The user interface provides the user with to the user interface. The user interface provides the user with
context in order to make decisions about what they want. The user context in order to make decisions about what they want. The user
enters information into the user interface. The user interface enters information into the user interface. The user interface
interprets the information, and passes it to the application. The interprets the information, and passes it to the application. The
application may be able to modify the user interface based on this application may be able to modify the user interface based on this
skipping to change at page 7, line 48 skipping to change at page 8, line 22
application, when the user is prompted to enter their PIN, the prompt application, when the user is prompted to enter their PIN, the prompt
should generally stop immediately once the first digit of the PIN is should generally stop immediately once the first digit of the PIN is
entered. This is referred to as barge-in. After the user-interface entered. This is referred to as barge-in. After the user-interface
collects the rest of the PIN, it can tell the user to "please wait collects the rest of the PIN, it can tell the user to "please wait
while processing". The PIN can then be gradually transmitted to the while processing". The PIN can then be gradually transmitted to the
application. In this example, the user interface has compensated for application. In this example, the user interface has compensated for
a slow UI to application interface by asking the user to wait. a slow UI to application interface by asking the user to wait.
The separation between user interface and application is absolutely The separation between user interface and application is absolutely
fundamental to the entire framework provided in this document. Its fundamental to the entire framework provided in this document. Its
importance cannot be understated. importance cannot be overstated.
With this basic model, we can begin to taxonomize the types of With this basic model, we can begin to taxonomize the types of
systems that can be built. systems that can be built.
3.1 Function vs. Stimulus 3.1 Function vs. Stimulus
The first way to taxonomize the system is to consider the interface The first way to taxonomize the system is to consider the interface
between the UI and the application. There are two fundamentally between the UI and the application. There are two fundamentally
different models for this interface. In a functional interface, the different models for this interface. In a functional interface, the
user interface has detailed knowledge about the application, and is, user interface has detailed knowledge about the application, and is,
skipping to change at page 8, line 40 skipping to change at page 9, line 13
application in order to change the way in which they render application in order to change the way in which they render
information to the user, stimulus user interfaces are usually slower, information to the user, stimulus user interfaces are usually slower,
less user friendly, and less responsive than a functional less user friendly, and less responsive than a functional
counterpart. However, they allow for substantial innovation in counterpart. However, they allow for substantial innovation in
applications, since no standardization activity is needed to built a applications, since no standardization activity is needed to built a
new application, as long as it can interact with the user within the new application, as long as it can interact with the user within the
confines of the user interface mechanism. confines of the user interface mechanism.
In SIP systems, functional interfaces are provided by extending the In SIP systems, functional interfaces are provided by extending the
SIP protocol to provide the needed functionality. For example, the SIP protocol to provide the needed functionality. For example, the
SIP caller preferences specification [4] provides a functional SIP caller preferences specification [5] provides a functional
interface that allows a user to request applications to route the interface that allows a user to request applications to route the
call to specific types of user agents. Functional interfaces are call to specific types of user agents. Functional interfaces are
important, but are not the subject of this framework. The primary important, but are not the subject of this framework. The primary
goal of this framework is to address the role of stimulus interfaces goal of this framework is to address the role of stimulus interfaces
to SIP applications. to SIP applications.
3.2 Real-Time vs. Non-Real Time 3.2 Real-Time vs. Non-Real Time
Application interaction systems can also be real-time or non-real- Application interaction systems can also be real-time or
time. Non-real interaction allows the user to enter information about non-real-time. Non-real interaction allows the user to enter
application operation in asynchronously with its invocation. information about application operation in asynchronously with its
Frequently, this is done through provisioning systems. As an example, invocation. Frequently, this is done through provisioning systems. As
a user can set up the forwarding number for a call-forward on no- an example, a user can set up the forwarding number for a
answer application using a web page. Real-time interaction requires call-forward on no-answer application using a web page. Real-time
the user to interact with the application at the time of its interaction requires the user to interact with the application at the
invocation. time of its invocation.
3.3 Client-Local vs. Client-Remote 3.3 Client-Local vs. Client-Remote
Another axis in the taxonomization is whether the user interface is Another axis in the taxonomization is whether the user interface is
co-resident with the user device (which we refer to as a client-local co-resident with the user device (which we refer to as a client-local
user interface), or the user interface runs in a host separated from user interface), or the user interface runs in a host separated from
the client (which we refer to as a client-remote user interface). In the client (which we refer to as a client-remote user interface). In
a client-remote user interface, there exists some kind of protocol a client-remote user interface, there exists some kind of protocol
between the client device and the UI that allows the client to between the client device and the UI that allows the client to
interact with the user interface over a network. interact with the user interface over a network.
The most important way to separate the UI and the client device is The most important way to separate the UI and the client device is
through media interaction. In media interaction, the interface through media interaction. In media interaction, the interface
between the user and the user interface is through media - audio, between the user and the user interface is through media - audio,
video, messaging, and so on. This is the classic mode of operation video, messaging, and so on. This is the classic mode of operation
for VoiceXML [5], where the user interface (also referred to as the for VoiceXML [2], where the user interface (also referred to as the
voice browser) runs on a platform in the network. Users communicate voice browser) runs on a platform in the network. Users communicate
with the voice browser through the telephone network (or using a SIP with the voice browser through the telephone network (or using a SIP
session). The voice browser interacts with the application using HTTP session). The voice browser interacts with the application using HTTP
to convey the information collected from the user. to convey the information collected from the user.
We refer to the second sub-case as a client-local user interface. In We refer to the second sub-case as a client-local user interface. In
this case, the user interface runs co-located with the user. The this case, the user interface runs co-located with the user. The
interface between them is through the software that interprets the interface between them is through the software that interprets the
users input and passes them to the user interface. The classic users input and passes them to the user interface. The classic
example of this is the web. In the web, the user interface is a web example of this is the web. In the web, the user interface is a web
skipping to change at page 11, line 4 skipping to change at page 11, line 25
(such as PCMU). An alternative, and generally the preferred approach, (such as PCMU). An alternative, and generally the preferred approach,
is to transmit the keypad input using RFC 2833 [7], which provides an is to transmit the keypad input using RFC 2833 [7], which provides an
encoding mechanism for carrying keypad input within RTP. encoding mechanism for carrying keypad input within RTP.
In this classic model, the user interface would run on a server in In this classic model, the user interface would run on a server in
the IP network. It would perform speech recognition and DTMF the IP network. It would perform speech recognition and DTMF
recognition to derive the user intent, feed them through the user recognition to derive the user intent, feed them through the user
interface, and provide the result to an application. interface, and provide the result to an application.
3.4.2 Client Local 3.4.2 Client Local
An alternative model is for the entire user interface to reside on An alternative model is for the entire user interface to reside on
the telephone. The user interface can be a VoiceXML browser, running the telephone. The user interface can be a VoiceXML browser, running
speech recognition on the microphone input, and feeding the keypad speech recognition on the microphone input, and feeding the keypad
input directly into the script. As discussed above, the VoiceXML input directly into the script. As discussed above, the VoiceXML
script could be rendered using text instead of voice, if the script could be rendered using text instead of voice, if the
telephone had a textual display. telephone had a textual display.
3.4.3 Flip-Flop 3.4.3 Flip-Flop
A middle-ground approach is to flip back and forth between a client- A middle-ground approach is to flip back and forth between a
local and client-remote user interface. Many voice applications are client-local and client-remote user interface. Many voice
of the type which listen to the media stream and wait for some applications are of the type which listen to the media stream and
specific trigger that kicks off a more complex user interaction. The wait for some specific trigger that kicks off a more complex user
long pound in a pre-paid calling card application is one example. interaction. The long pound in a pre-paid calling card application is
Another example is a conference recording application, where the user one example. Another example is a conference recording application,
can press a key at some point in the call to begin recording. When where the user can press a key at some point in the call to begin
the key is pressed, the user hears a whisper to inform them that recording. When the key is pressed, the user hears a whisper to
recording has started. inform them that recording has started.
The idea way to support such an application is to install a client- The ideal way to support such an application is to install a
local user interface component that waits for the trigger to kick off client-local user interface component that waits for the trigger to
the real interaction. Once the trigger is received, the application kick off the real interaction. Once the trigger is received, the
connects the user to a client-remote user interface that can play application connects the user to a client-remote user interface that
announements, collect more information, and so on. can play announements, collect more information, and so on.
The benefit of flip-flopping between a client-local and client-remote The benefit of flip-flopping between a client-local and client-remote
user interface is cost. The client-local user interface will user interface is cost. The client-local user interface will
eliminate the need to send media streams into the network just to eliminate the need to send media streams into the network just to
wait for the user to press the pound key on the keypad. wait for the user to press the pound key on the keypad.
The Keypad Markup Language (KPML) was designed to support exactly The Keypad Markup Language (KPML) was designed to support exactly
this kind of need. It models the keypad on a phone, and allows an this kind of need [8]. It models the keypad on a phone, and allows an
application to be informed when any sequence of keys have been application to be informed when any sequence of keys have been
pressed. However, KPML has no presentation component. Since user pressed. However, KPML has no presentation component. Since user
interfaces generally require a response to user input, the interfaces generally require a response to user input, the
presentation will need to be done using a client-remote user presentation will need to be done using a client-remote user
interface that gets instantiated as a result of the trigger. interface that gets instantiated as a result of the trigger.
It is tempting to use a hybrid model, where a prompt-and-collect It is tempting to use a hybrid model, where a prompt-and-collect
application is implemented by using a client-remote user interface application is implemented by using a client-remote user interface
that plays the prompts, and a client-local user interface, described that plays the prompts, and a client-local user interface, described
by KPML, that collects digits. However, this only complicates the by KPML, that collects digits. However, this only complicates the
application. Firstly, the keypad input will be sent to both the media application. Firstly, the keypad input will be sent to both the media
stream and the KPML user interface. This requires the application to stream and the KPML user interface. This requires the application to
sort out which user inputs are duplicates, a process that is very sort out which user inputs are duplicates, a process that is very
complicated. Secondly, the primary benefit of KPML is to avoid having complicated. Secondly, the primary benefit of KPML is to avoid having
a media stream towards a user interface. However, there is already a a media stream towards a user interface. However, there is already a
media stream for the prompting, so there is no real savings. media stream for the prompting, so there is no real savings.
That said, the framework does support this hybrid model. 4. Framework Overview
4 Framework Overview
In this framework, we use the term "SIP application" to refer to a In this framework, we use the term "SIP application" to refer to a
broad set of functionality. A SIP application is a program running on broad set of functionality. A SIP application is a program running on
a SIP-based element (such as a proxy or user agent) that provides a SIP-based element (such as a proxy or user agent) that provides
some value-added function to a user or system administrator. SIP some value-added function to a user or system administrator. SIP
applications can execute on behalf of a caller, a called party, or a applications can execute on behalf of a caller, a called party, or a
multitude of users at once. multitude of users at once.
Each application has a number of instances that are executing at any Each application has a number of instances that are executing at any
given time. An instance represents a single execution path for an given time. An instance represents a single execution path for an
skipping to change at page 13, line 15 skipping to change at page 14, line 13
interface, and in what format. In this framework, all client-local interface, and in what format. In this framework, all client-local
user interface components are described by a markup language. A user interface components are described by a markup language. A
markup language describes a logical flow of presentation of markup language describes a logical flow of presentation of
information to the user, collection of information from the user, and information to the user, collection of information from the user, and
transmission of that information to an application. Examples of transmission of that information to an application. Examples of
markup languages include HTML, WML, VoiceXML, the Keypad Markup markup languages include HTML, WML, VoiceXML, the Keypad Markup
Language (KPML) [8] and the Media Server Control Markup Language Language (KPML) [8] and the Media Server Control Markup Language
(MSCML) [9]. (MSCML) [9].
The interface between the user interface component and the The interface between the user interface component and the
application is typically markup-language specific. However, all of application is typically markup-language specific. For those markups
the markup languages discussed above use HTTP form POST requests as which support rendering of information to a user, such as HTML, HTTP
the primary interface [note that this is still an open issue with form POST operations are used. For those markups where no information
KPML]. As discussed in Section 3, this interface is well suited to is rendered to the user, the markup can play one of two roles. The
HTTP, which is a good match for its latency, reliability, and content first is called "one shot". In the one-shot role, the markup waits
requirements. for a user to enter some information, and when they do, reports this
event to the application. The application then does something, and
the markup is no longer used. In the other modality, called
"monitor", the markup stays permanently resident, and reports
information back to an application continuously. However, the act of
reporting information back to the application does not cause the
installation of a new markup. In markups where one-shot or monitor
modalities are used, a SIP MESSAGE request is used to report the
status.
To create a client-local user interface, the application passes the To create a client-local user interface, the application passes the
markup document (or a reference to it) in a SIP message to that markup document (or a reference to it) in a SIP message to that
client. The SIP message can be one explicitly generated by the client. The SIP message can be one explicitly generated by the
application (in which case the application has to be a UA or B2BUA), application (in which case the application has to be a UA or B2BUA),
or it can be placed in a SIP message that passes by (in which case or it can be placed in a SIP message that passes by (in which case
the application can be running in a proxy). the application can be running in a proxy).
Client local user interface components are always associated with the Client local user interface components are always associated with the
dialog that the SIP message itself is associated with. Consequently, dialog that the SIP message itself is associated with. Consequently,
skipping to change at page 13, line 47 skipping to change at page 15, line 5
which the application knows a UI can be created. However, the which the application knows a UI can be created. However, the
application does need to connect the user device to the user application does need to connect the user device to the user
interface. This will require manipulation of media streams in order interface. This will require manipulation of media streams in order
to establish that connection. to establish that connection.
Once a user interface component is created, the application needs to Once a user interface component is created, the application needs to
be able to change it, and to remove it. Finally, more advanced be able to change it, and to remove it. Finally, more advanced
applications may require coupling between application components. The applications may require coupling between application components. The
framework supports rudimentary capabilities there. framework supports rudimentary capabilities there.
5 Client Local Interfaces 5. Client Local Interfaces
One key component of this framework is support for client local user One key component of this framework is support for client local user
interfaces. interfaces.
5.1 Discovering Capabilities 5.1 Discovering Capabilities
A client local user interface can only be instantiated on a client if A client local user interface can only be instantiated on a client if
the user device has the capabilities needed to do so. Specifically, the user device has the capabilities needed to do so. Specifically,
an application needs to know what markup languages, if any, are an application needs to know what markup languages, if any, are
supported by the client. For example, does the client support HTML? supported by the client. For example, does the client support HTML?
VoiceXML? However, that information is not sufficient to determine VoiceXML? However, that information is not sufficient to determine if
if a client local user interface can be instantiated. In order to a client local user interface can be instantiated. In order to
instantiate the user interface, the application needs to transfer the instantiate the user interface, the application needs to transfer the
markup document to the client. There are two ways in which the markup markup document to the client. There are two ways in which the markup
document can be transferred. The application can send the client a document can be transferred. The application can send the client a
URI which the client can use to fetch the markup, or the markup can URI which the client can use to fetch the markup, or the markup can
be sent inline within the message. The application needs to know be sent inline within the message. The application needs to know
which of these modes are supported, and in the case of indirection, which of these modes are supported, and in the case of indirection,
which URI schemes are supported to obtain the indirection. which URI schemes are supported to obtain the indirection.
Many applications will need to know these capabilities at the time an Many applications will need to know these capabilities at the time an
application instance is first created. Since applications can be application instance is first created. Since applications can be
created through SIP requests or responses, SIP needs to provide a created through SIP requests or responses, SIP needs to provide a
means to convey this information. This introduces several concrete means to convey this information. This introduces several concrete
requirements for SIP: requirements for SIP:
REQ 1: A SIP request or response must be capable of conveying REQ 1: A SIP request or response must be capable of conveying the set
the set of markup languages supported by the UA that of markup languages supported by the UA that generated the request
generated the request or response. or response.
REQ 2: A SIP request or response must be capable of indicating REQ 2: A SIP request or response must be capable of indicating
whether a UA can obtain markups inline, or through an whether a UA can obtain markups inline, or through an indirection.
indirection. In the case of indirection, the UA must be In the case of indirection, the UA must be capable of indicating
capable of indicating what URI schemes it supports. what URI schemes it supports.
5.2 Pushing an Initial Interface Component 5.2 Pushing an Initial Interface Component
Once the application has determined that the UA is capable of Once the application has determined that the UA is capable of
supporting client local user interfaces, the next step is for the supporting client local user interfaces, the next step is for the
application to push an interface component to the application. application to push an interface component to the user device.
Generally, we anticipate that interface components will need to be Generally, we anticipate that interface components will need to be
created at various different points in a SIP session. Clearly, they created at various different points in a SIP session. Clearly, they
will need to be pushed during an initial INVITE, in both responses will need to be pushed during an initial INVITE, in both responses
(so as to place a component into the calling UA) and in the request (so as to place a component into the calling UA) and in the request
(so as to place a component into the called UA). As an example, a (so as to place a component into the called UA). As an example, a
conference recording application allows the users to record the media conference recording application allows the users to record the media
for the session at any time. The application would like to push an for the session at any time. The application would like to push an
HTML user interface component to both the caller and callee at the HTML user interface component to both the caller and callee at the
time the call is setup, allowing either to record the session. The time the call is setup, allowing either to record the session. The
HTML component would have buttons to start and stop recording. To HTML component would have buttons to start and stop recording. To
push the HTML component to the caller, it needs to be pushed in the push the HTML component to the caller, it needs to be pushed in the
200 OK (and possibly provisional response), and to push it to the 200 OK (and possibly provisional response), and to push it to the
callee, in the INVITE itself. callee, in the INVITE itself.
To state the requirement more concretely: To state the requirement more concretely:
REQ 3: An application must be able to add a reference to, or an REQ 3: An application must be able to add a reference to, or an
inline version of, a user interface component into any inline version of, a user interface component into any request or
request or response that passes through or is eminated from response that passes through or is emanated from that application.
that application.
However, there will also be cases where the application needs to push However, there will also be cases where the application needs to push
a new interface component to a UA, but it is not as a result of any a new interface component to a UA, but it is not as a result of any
SIP message. As an example, a pre-paid calling card application will SIP message. As an example, a pre-paid calling card application will
set a timer that determines how long the call can proceed, given the set a timer that determines how long the call can proceed, given the
availability of funds in the user's account. When the timer fires, availability of funds in the user's account. When the timer fires,
the application would like to push a new interface component to the the application would like to push a new interface component to the
calling UA, allowing them to click to add more funds. calling UA, allowing them to click to add more funds.
In this case, there is no message already in transit that can be used In this case, there is no message already in transit that can be used
as a vehicle for pushing a user interface component. This requires as a vehicle for pushing a user interface component. This requires
that applications can generate their own messages to push a new that applications can generate their own messages to push a new
component to a UA: component to a UA:
REQ 4: A UA application must be able to send a SIP message to REQ 4: A UA application must be able to send a SIP message to the UA
the UA at the other end of the dialog, asking it to create at the other end of the dialog, asking it to create a new
a new interface component. interface component.
In all cases, the information passed from the application to the UA In all cases, the information passed from the application to the UA
must include more than just the interface component itself (or a must include more than just the interface component itself (or a
reference to it). The user must be able to decide whether or not it reference to it). The user must be able to decide whether or not it
wants to proceed with this application. To make that determination, wants to proceed with this application. To make that determination,
the user must have information about the application. Specifically, the user must have information about the application. Specifically,
it will need the name of the application, and an identifier of the it will need the name of the application, and an identifier of the
owner or administrator for the application. As an example, a typical owner or administrator for the application. As an example, a typical
name would be "Prepaid Calling Card" and the owner could be name would be "Prepaid Calling Card" and the owner could be
"voiceprovider.com". "voiceprovider.com".
REQ 5: Any user interface component passed to a client (either REQ 5: Any user interface component passed to a client (either inline
inline or through a reference) must also include markup or through a reference) must also include markup meta-data,
meta-data, including a human readable name of the including a human readable name of the application, and an
application, and an identifier of the owner of the identifier of the owner of the application.
application.
Clearly, there are security implications. The user will need to Clearly, there are security implications. The user will need to
verify the identity of the application owner, and be sure that the verify the identity of the application owner, and be sure that the
user interface component is not being replayed, that is, it actually user interface component is not being replayed, that is, it actually
belongs with this specific SIP message. belongs with this specific SIP message.
REQ 6: It must be possible for the client to validate the REQ 6: It must be possible for the client to validate the
authenticity and integrity of the markup document (or its authenticity and integrity of the markup document (or its
reference) and its associated meta-data. It must be reference) and its associated meta-data. It must be possible for
possible for the client to verify that the information has the client to verify that the information has not been replayed
not been replayed from a previous SIP message. from a previous SIP message.
If the user decides not to execute the user interface component, it If the user decides not to execute the user interface component, it
simply discards it. There is no explicit requirement for the user to simply discards it. There is no explicit requirement for the user to
be able to inform the application that the component was discarded. be able to inform the application that the component was discarded.
Effectively, the application will think that the component was Effectively, the application will think that the component was
executed, but that the user never entered any information. executed, but that the user never entered any information.
OPEN ISSUE: Are we certain? Adding support for this makes
the system more complicated though. Warning headers may
make sense here.
5.3 Updating an Interface Component 5.3 Updating an Interface Component
Once a user interface component has been created on a client, it can Once a user interface component has been created on a client, it can
be updated in two ways. The first way is the "normal" path inherent be updated in two ways. The first way is the "normal" path inherent
to that component. The client enters some data, the user interface to that component. The client enters some data, the user interface
transfers the information to the application (typically through transfers the information to the application (typically through
HTTP), and the result of that transfer brings a new markup document HTTP), and the result of that transfer brings a new markup document
describing an updated interface. This is referred to as a synchronous describing an updated interface. This is referred to as a synchronous
update, since it is syncrhonized with user interaction. update, since it is synchronized with user interaction.
However, synchronous updates are not sufficient for many However, synchronous updates are not sufficient for many
applications. Frequently, the interface will need to be updated applications. Frequently, the interface will need to be updated
asynchronously by the application, without an explicit user action. A asynchronously by the application, without an explicit user action. A
good example of this is, once again, the pre-paid calling card good example of this is, once again, the pre-paid calling card
application. The application might like to update the user interface application. The application might like to update the user interface
when the timer runs out on the call. This introduces several when the timer runs out on the call. This introduces several
requirements: requirements:
REQ 7: It must be possible for an application to asynchronously REQ 7: It must be possible for an application to asynchronously push
push an update to an existing user interface component, an update to an existing user interface component, either in a
either in a message that was already in transit, or by message that was already in transit, or by generating a new
generating a new message. message.
REQ 8: It must be possible for the client to associate the new REQ 8: It must be possible for the client to associate the new
interface component with the one that it is supposed to interface component with the one that it is supposed to replace,
replace, so that the old one can be removed. so that the old one can be removed.
Unfortunately, pushing of application components introduces a race Unfortunately, pushing of application components introduces a race
condition. What if the user enters data into the old component, condition. What if the user enters data into the old component,
causing an HTTP request to the application, while an update of that causing an HTTP request to the application, while an update of that
component is in progress? The client will get an interface component component is in progress? The client will get an interface component
in the HTTP response, and also get the new one in the SIP message. in the HTTP response, and also get the new one in the SIP message.
Which one does the client use? There needs to be a way in which to Which one does the client use? There needs to be a way in which to
properly order the components: properly order the components:
REQ 9: It must be possible for the client to relatively order REQ 9: It must be possible for the client to relatively order user
user interface updates it receives as the result of interface updates it receives as the result of synchronous and
syncrhonous and asynchronous messaging. asynchronous messaging.
5.4 Terminating an Interface Component 5.4 Terminating an Interface Component
User interface components have a well defined lifetime. They are User interface components have a well defined lifetime. They are
created when the component is first pushed to the client. User created when the component is first pushed to the client. User
interface components are always associated with the SIP dialog on interface components are always associated with the SIP dialog on
which they were pushed. As such, their lifetime is bound by the which they were pushed. As such, their lifetime is bound by the
lifetime of the dialog. When the dialog ends, so does the interface lifetime of the dialog. When the dialog ends, so does the interface
component. component.
This rule applies to early dialogs as well. If a user interface This rule applies to early dialogs as well. If a user interface
component is passed in a provisional response to INVITE, and a component is passed in a provisional response to INVITE, and a
separate branch eventually answers the call, the component terminates separate branch eventually answers the call, the component terminates
with the arrival of the 2xx. Thats because the early dialog itself with the arrival of the 2xx. That's because the early dialog itself
terminates with the arrival of the 2xx. terminates with the arrival of the 2xx.
However, there are some cases where the application would like to However, there are some cases where the application would like to
terminate the user interface component before its natural termination terminate the user interface component before its natural termination
point. To do this, the application pushes a "null" update to the point. To do this, the application pushes a "null" update to the
client. This is an update that replaces the existing user interface client. This is an update that replaces the existing user interface
component with nothing. component with nothing.
REQ 10: It must be possible for an application to terminate a REQ 10: It must be possible for an application to terminate a user
user interface component before its natural expiration. interface component before its natural expiration.
The user can also terminate the user interface component. However, The user can also terminate the user interface component. However,
there is no explicit signaling required in this case. The component there is no explicit signaling required in this case. The component
is simply dismissed. To the application, it appears as if the user is simply dismissed. To the application, it appears as if the user
has simply ceased entering data. has simply ceased entering data.
6 Client Remote Interfaces 6. Client Remote Interfaces
As an alternative to, or in conjunction with client local user As an alternative to, or in conjunction with client local user
interfaces, an application can make use of client remote user interfaces, an application can make use of client remote user
interfaces. These user interfaces can execute co-resident with the interfaces. These user interfaces can execute co-resident with the
application itself (in which case no standardized interfaces between application itself (in which case no standardized interfaces between
the UI and the application need to be used), or it can run the UI and the application need to be used), or it can run
separately. This framework assumes that the user interface runs on a separately. This framework assumes that the user interface runs on a
host that has a sufficient trust relationship with the application. host that has a sufficient trust relationship with the application.
As such, the means for instantiating the user interface is not As such, the means for instantiating the user interface is not
considered here. considered here.
skipping to change at page 18, line 29 skipping to change at page 19, line 41
application. Its a terminating application because the user application. Its a terminating application because the user
explicitly calls it; i.e., it is the actual called party. An example explicitly calls it; i.e., it is the actual called party. An example
of an originating application is a wakeup call application, which of an originating application is a wakeup call application, which
calls a user at a specified time in order to wake them up. calls a user at a specified time in order to wake them up.
Because originating and terminating applications are a natural Because originating and terminating applications are a natural
termination point of the dialog, manipulation of the media session by termination point of the dialog, manipulation of the media session by
the application is trivial. Traditional SIP techniques for adding and the application is trivial. Traditional SIP techniques for adding and
removing media streams, modifying codecs, and changing the address of removing media streams, modifying codecs, and changing the address of
the recipient of the media streams, can be applied. Similarly, the the recipient of the media streams, can be applied. Similarly, the
application can direclty authenticate itself to the user through application can direclty authenticate itself to the user through S/
S/MIME, since it is the peer UA in the dialog. MIME, since it is the peer UA in the dialog.
6.2 Intermediary Applications 6.2 Intermediary Applications
Intermediary application are, at the same time, more common than Intermediary application are, at the same time, more common than
originating/terminating applications, and more complex. Intermediary originating/terminating applications, and more complex. Intermediary
applications are applications that are neither the actual caller or applications are applications that are neither the actual caller or
called party. Rather, they represent a "third party" that wishes to called party. Rather, they represent a "third party" that wishes to
interact with the user. The classic example is the ubiquitous pre- interact with the user. The classic example is the ubiquitous
paid calling card application. pre-paid calling card application.
In order for the intermediary application to add a client remote user In order for the intermediary application to add a client remote user
interface, it needs to manipulate the media streams of the user agent interface, it needs to manipulate the media streams of the user agent
to terminate on that user interface. This also introduces a to terminate on that user interface. This also introduces a
fundamental feature interaction issue. Since the intermediary fundamental feature interaction issue. Since the intermediary
application is not an actual participant in the call, how does the application is not an actual participant in the call, how does the
user interact with the intermediary application, and its actual peer user interact with the intermediary application, and its actual peer
in the dialog, at the same time? This is discussed in more detail in in the dialog, at the same time? This is discussed in more detail in
Section 7. In fact, the choice about how this problem is solved Section 7.
completely determines the architecture of the application.
7. Inter-Application Feature Interaction
7 Inter-Application Feature Interaction
The inter-application feature interaction problem is inherent to The inter-application feature interaction problem is inherent to
stimulus signaling. Whenever there are multiple applications, there stimulus signaling. Whenever there are multiple applications, there
are multiple user interfaces. When the user provides an input, to are multiple user interfaces. When the user provides an input, to
which user interface is the input destined? That question is the which user interface is the input destined? That question is the
essence of the inter-application feature interaction problem. essence of the inter-application feature interaction problem.
Inter-application feature interaction is not an easy problem to Inter-application feature interaction is not an easy problem to
resolve. For now, we consider separately the issues for client-local resolve. For now, we consider separately the issues for client-local
and client-remote user interface components. and client-remote user interface components.
skipping to change at page 19, line 50 skipping to change at page 22, line 4
clear to which application the user input is targeted. clear to which application the user input is targeted.
As another example, consider the same two applications, but on a As another example, consider the same two applications, but on a
"smart phone" that has a set of buttons, and next to each button, an "smart phone" that has a set of buttons, and next to each button, an
LCD display that can provide the user with an option. This user LCD display that can provide the user with an option. This user
interface can be represented using the Wireless Markup Language interface can be represented using the Wireless Markup Language
(WML). (WML).
The phone would allocate some number of buttons to each application. The phone would allocate some number of buttons to each application.
The prepaid calling card would get one button for its "hangup" The prepaid calling card would get one button for its "hangup"
command, and the recording application would get one for its command, and the recording application would get one for its "start/
"start/stop" command. The user can easily determine which application stop" command. The user can easily determine which application to
to interact with by pressing the appropriate button. Pressing a interact with by pressing the appropriate button. Pressing a button
button determines focus and provides user input, both at the same determines focus and provides user input, both at the same time.
time.
Unfortunately, not all devices will have these advanced displays. A Unfortunately, not all devices will have these advanced displays. A
PSTN gateway, or a basic IP telephone, may only have a 12-key keypad. PSTN gateway, or a basic IP telephone, may only have a 12-key keypad.
The user interfaces for these devices are provided through the Keypad The user interfaces for these devices are provided through the Keypad
Markup Language (KPML). Considering once again the feature Markup Language (KPML). Considering once again the feature
interaction case above, the pre-paid calling card application and the interaction case above, the pre-paid calling card application and the
call recording application would both pass a KPML document to the call recording application would both pass a KPML document to the
device. When the user presses a button on the keypad, to which device. When the user presses a button on the keypad, to which
document does the input apply? The user interface does not allow the document does the input apply? The user interface does not allow the
user to select. A user interface where the user cannot provide focus user to select. A user interface where the user cannot provide focus
is called a focusless user interface. This is quite a hard problem to is called a focusless user interface. This is quite a hard problem to
solve. This framework does not make any explicit normative solve. This framework does not make any explicit normative
recommendation, but concludes that the best option is to send the recommendation, but concludes that the best option is to send the
input to both user interfaces. This is a sensible choice by analogy - input to both user interfaces unless the markup in one interface has
its exactly what the existing circuit switched telephone network will indicated that it should be suppressed from others. This is a
do. It is an explicit non-goal to provide a better mechanism for sensible choice by analogy - its exactly what the existing circuit
feature interaction resolution than the PSTN on devices which have switched telephone network will do. It is an explicit non-goal to
the same user interface as they do on the PSTN. Devices with better provide a better mechanism for feature interaction resolution than
displays, such as PCs or screen phones, can benefit from the the PSTN on devices which have the same user interface as they do on
capabilities of this framework, allowing the user to determine which the PSTN. Devices with better displays, such as PCs or screen phones,
application they are interacting with. can benefit from the capabilities of this framework, allowing the
user to determine which application they are interacting with.
Indeed, when a user provides input on a focusless device, the input Indeed, when a user provides input on a focusless device, the input
must be passed to all client local user interfaces, AND all client must be passed to all client local user interfaces, AND all client
remote user interfaces. In the case of KPML, key events are passed to remote user interfaces, unless the markup tells the UI to suppress
remote user interfaces by encoding them in RFC 2833 [7]. Of course, the media. In the case of KPML, key events are passed to remote user
since a client cannot determine if a media stream terminates in a interfaces by encoding them in RFC 2833 [7]. Of course, since a
remote user interface or not, these key events are passed in all client cannot determine if a media stream terminates in a remote user
audio media streams. interface or not, these key events are passed in all audio media
streams unless the "Q" digit is used to suppress.
7.2 Client-Remote UI 7.2 Client-Remote UI
When the user interfaces run remotely, the determination of focus can When the user interfaces run remotely, the determination of focus can
be much, much harder. There are three architectures supported in this be much, much harder. There are many architectures that can be
framework for determining focus. The first is a centralized server deployed to handle the interaction. None are ideal. However, all are
model, the second is a pipe-and-filter model, and the third is a beyond the scope of this specification.
client model.
7.2.1 Centralized Server
One approach to resolving the feature interaction is to deploy a
centralized server whose goal is to do just that. The user sends a
single copy of their media to this server, and the server is the sole
source of media towards the user. Each application that wishes to
interact with the user does so using a client local user interface.
However, the user interface is not instantiated on the client, its
instantiated on this central server. The central server is presumed
to know enough about each application so that it can do a good job of
determining how media should be passed to each user interface
requested by each application. This is shown pictorially in Figure 2.
This model has minimal impact on the client, but it only works well
in a controlled environment where the entire set of applications is
known ahead of time.
7.2.2 Pipe-and-Filter
In order to resolve the interaction, each application acts as a B2BUA
and as a media relay. This is shown in Figure 3. Each application
takes its media from the "previous hop", which will be an end-user or
another B2BUA application, and passes some or all of it on to the
"next hop". Each application can pick off any media input it feels is
relevant to its operation, passing the result off to the next hop.
Furthermore, it can inject media in each direction as it so chooses.
Conceptually, its each application pipes the media it receives to the
next hop, and can filter it appropriately before sending it on. Thus
the name, pipe-and-filter.
The pipe-and-filter model describes the resolution of focus as
provided in the existing circuit-switched telephony network.
Of course, it is not strictly necessary for the application to always
be a focal point for media. The application can allow the media to
pass directly between participants when the application has no media
to present to the user. When the application does have media to
present to the user, it can execute a re-INVITE to move the media
streams to a central point of control.
An example of this is shown in Figure 4. In this example, there are
two applications - a prepaid calling card application and a call
recording application. The user makes a call to the prepaid number
(1). The prepaid application acts as a UAS, answering the INVITE (2-
3). It prompts the user to enter their calling card, PIN, and
destination number (4). Once the user has done that, the prepaid
application makes a call towards the destination number (5). This
passes through the recording application, which acts as a B2BUA with
media (i.e., it will also be a media intermediary), and forwards the
INVITE to the called party (6). The called party answers (7), and the
200 OKs and ACKs are propagated normally (8-10). At this point, both
the prepaid application and the call recording application are B2BUA,
so that the media flows between the caller and the prepaid app (11),
then to the call recording app (12), and then to the called party
(13).
However, once the call is established, the prepaid calling card
application does not really wish to remain on the media path. All it
wants is to wait for the long-pound which the caller users to signal
the end of the call. To do that, it uses a re-INVITE (14) to both
remove itself from the media path, and to instantiate a client-local
user interface, using KPML, into the calling UA. That INVITE contains
no SDP, as it uses flow I from the third party call control
specification [10]. The 200 OK from the caller contains its SDP (15),
which is passed from the prepaid application to the call recording
application (16). Since the call recording application is a B2BUA, it
modifies the SDP to keep itself on the media path, passing that SDP
to the called party (17). The called party answers with its updated
SDP (18), which is passed to the call recording application, modified
by it, and passed to the prepaid application (19). The prepaid
application passes this SDP to the caller in an ACK (22), and then
generates an ACK back towards the call recording application (20-21).
Now, media flows from the caller to the call recording application
(23), and from there, towards the called party (24).
At some point later, the caller presses the long pound. This is
passed to the KPML document, which has a single rule waiting for that
sequence. The result is passed to the prepaid calling card
application (25). The calling card application now knows that it
needs to terminate the call with the called party. So, it sends a BYE
(27), which is propagated normally (28-30). Now, the prepaid
application needs to prompt the user for the next number. To do that,
it needs to re-establish a media connection to it, in order to
execute its client-remote user interface. To do that, it uses a re-
INVITE (31-33), connecting the application to the caller (34).
7.2.2.1 Client Resolution
Having the client resolve the interaction represents a fundemantally
different way of thinking about intermediary applications.
Instead of having intermediary applications be a B2BUA just to insert
themselves into the media stream, they are implemented as a UA (i.e.,
not back-to-back). Each application is a separate UA, and as such,
will create and maintain a separate dialog with the user that it
wishes to interact with. How does the user handle this multiplicity
of dialogs? Simply put, it acts like a focus. A focus, as defined in
the SIP conferencing framework [3], is a SIP element that terminates
multiple SIP dialogs, each of which represents a participant into the
conference. Effectively, the conferencing framework itself provides
+-+ +-+
|A| |A|
|p| |p|
|p| |p|
|1| |2|
| | | |
|U| |U|
|I| |I|
+-+ +-+
+---------+ +------+ +------+
| | | | | |
| Central |........>| App1 |..........>| App2 |
| Server | | | | |
| |+++ +------+ +------+
+---------+** ++++ .
^ + * **** ++++ .
. + * *** +++++ .
. + * **** ++++ .
. + * *** ++++ .
. + * **** ++++ .
. + * *** +++ V
+---+--+ **** +------+
| | ** | |
|Client| |Callee|
| | | |
+------+ +------+
+++++++ RTP Path
******* SIP Dialog
....... SIP INVITE Path
Figure 2: Centralized Server Resolution
+--------+ +--------+
| |+++++++++ | |
| App1 |********* | App1 |
| |........> | |
+--------+ +--------+
^ * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
. * + . * +
* + V * +
+--------+ +--------+
| | | |
| Caller | | Callee |
| | | |
+--------+ +--------+
+++++++ RTP Path
******* SIP Dialog
....... SIP INVITE Path
Figure 3: Pipe-and-Filter Model
the foundation upon which client resolution of multiple applications
will take place.
Each application has particular requirements on how it would like its
media stream treated in relation to the other media streams that the
focus may be managing. As an example, a prepaid calling card
application will generate media towards the client, in order to
inform them that they are running out of time in the call. The
Caller Prepaid App Recorder App Callee
|(1) INVITE | | |
|--------------->| | |
|(2) 200 OK | | |
|<---------------| | |
|(3) ACK | | |
|--------------->| | |
|(4) RTP | | |
|collect PIN | | |
|and number | | |
|................| | |
| |(5) INVITE | |
| |--------------->| |
| | |(6) INVITE |
| | |--------------->|
| | |(7) 200 OK |
| | |<---------------|
| | |(8) ACK |
| | |--------------->|
| |(9) 200 OK | |
| |<---------------| |
| |(10) ACK | |
| |--------------->| |
|(11) RTP | | |
|................| | |
| |(12) RTP | |
| |................| |
| | |(13) RTP |
| | |................|
|(14) INVITE | | |
|no SDP | | |
|KPML | | |
|<---------------| | |
|(15) 200 OK | | |
|SDP1 | | |
|--------------->| | |
| |(16) INVITE | |
| |SDP1 | |
| |--------------->| |
| | |(17) INVITE |
| | |SDP2 |
| | |--------------->|
| | |(18) 200 OK |
| | |SDP3 |
| | |<---------------|
| |(19) 200 OK | |
| |SDP4 | |
| |<---------------| |
| |(20) ACK | |
| |--------------->| |
| | |(21) ACK |
| | |--------------->|
|(22) ACK | | |
|SDP4 | | |
|<---------------| | |
|(23) RTP | | |
|.................................| |
| | |(24) RTP |
| | |................|
|Hit # | | |
|(25) HTTP POST | | |
|--------------->| | |
|(26) 200 OK | | |
|<---------------| | |
| |(27) BYE | |
| |--------------->| |
| | |(28) BYE |
| | |--------------->|
| | |(29) 200 OK |
| | |<---------------|
| |(30) 200 OK | |
| |<---------------| |
|(31) INVITE | | |
|<---------------| | |
|(32) 200 OK | | |
|--------------->| | |
|(33) ACK | | |
|<---------------| | |
|(34) RTP | | |
|................| | |
Figure 4: Pre-Paid Application with Pipe-and-Filter
application would like this announcement to be spoken more loudly
than the media from the other participants in the call (which is
usually just the other party in the call, but could include other
applications too!). Furthermore, the prepaid calling card application
would like to receive media from just the calling user, not from any
other applications or from the other participant in the call. To
implement this, the application uses the media policy control
protocol [3]. This protocol allows a participant in a conference to
inform the focus about its desired policies for media handling. Each
application would act as a client of this protocol, passing its
request to the media policy server, which actually runs on the end
user device.
The media policy server in the end user device would reconcile the
various requests, and generate the appropriate media streams towards
each application, and towards the other user in the call. Indeed, the
media policy server can reconcile the requests in any way it likes,
so long as it has sufficient information about what each application
wants to do. When the user device has a powerful user interface, the
user themselves can be asked to select which application their media
is targeted to. Effectively, the client determines the application
focus, just as in the client-local user interface case (Section 7.1).
Figure 5 depicts this basic model pictorially. The calling device
makes an initial INVITE to setup a basic call with the called party.
This INVITE passes through two proxies, both of which kick off
applications (app1 and app2) as the request is proxied towards the
called party. The result is a single dialog setup between the caller
and called party (dialog C). However, the INVITE from the caller
indicated that the device is capable of acting as a focus. How did it
do that? It did so by indicating support for the SIP Join extension
[11] which allows a UA to request to be conferenced into an existing
dialog. As such, both app1 and app2, acting as a pure UAC, generate
an INVITE towards this focus, with a Join header requesting to be
added to a conference which includes the original dialog. The result
is two additional dialogs, dialog A and dialog B respectively, which
join the original dialog in their connection to a focus co-resident
with the caller. Both app1 and app2 use the media policy control
protocol to interact with the media policy server co-resident with
the user device (interaction not shown). This would require the
caller to have indicated that it supports a media policy control
server.
REQ 11: There must be a way for a UA to indicate that it
supports a media policy server function.
In this model, there may be a media stream from the called party,
app1, and app2, towards the mixer present in the calling UA. This
"may" is important. In many cases, each application is not really
actively generating media towards the user. It may only need to
sporadically interact with the user, and during those times, the
desired effect is for media from other applications, and the peer
user, to be suppressed. Therefore, a client can support this model of
resolution without ever needing to actually mix any media!
Interestingly, this model for resolving the interaction problem does
not introduce any new requirements into SIP. The existing
conferencing framework and its associated requirements provide all
the tools that are needed. For example, the framework will allow an
application to initiate a new dialog towards the endpoint focus,
allowing it to join the call without "ringing" the phone again.
Figure 6 shows a call flow for the example scenario of Section 7.2.2,
but using the client resolution architecture. The caller sends out an
initial INVITE to the prepaid application (1). This INVITE contains a
Supported header indicating the ability to receive INVITE requests
with Join headers. It also indicates that the UA supports a media
policy control server. This arrives at the pre-paid application. The
pre-paid application generates a 183 to the initial INVITE (2). Then,
it sends a brand new INVITE request (i.e., not a re-INVITE, and not
with the same dialog identifiers as the original INVITE) towards the
caller (3). This INVITE has a Join header containing the dialog
identifiers from the 183. This is received by the caller. The caller
mutates into a focus [3], and generates a 200 OK to the INVITE (4).
The Contact header field in this 200 OK contains the conference URI.
Effectively, the caller is now hosting a conference that has two
dialogs - one towards the prepaid application, and the other, an
early dialog. The prepaid application uses the media policy control
protocol, and informs the caller that it wishes to be the sole source
and sink of media (6). This media policy request could be presented
to the user, informing them that the prepaid calling card application
is now in focus. The application prompts the user for their calling
card number, their PIN, and the destination number. Once collected,
the prepaid calling card application acts as a B2BUA on the original
INVITE request, and forwards it to the call recording application
(8). Note that the prepaid application is a B2BUA on this dialog
because it needs to hang up the call. It does not act as a B2BUA with
media on this dialog; that is, it does not touch the SDP.
The forwarded INVITE is received by the call recording application.
At this point, it just proxies the request towards the called party
(9). It is not a B2BUA on this dialog, although it does record-route.
The called party receives the INVITE, and answers with a 200 OK (10).
This is propagated to the call recording application, which carefully
+------+ +------+
| | 2 | |
> | App1 | .............>| App1 |
. | | | | .
. +------+ +------+ .
. * ** .
. ** *** .
. * **** .
. *A *** .
1. ** *** .
. * ***B .
. ** *** .3
. * **** .
. * *** .
. ** *** .
+----*----**---------------+ .
| +----------+ | .
| | Endpoint | **** | .
| | Focus | ******* | .
| +----------+ ******* .
| * +-----+ +--------+| ******* V
| * |mixer| | Media || C******* +--------+
| * +-----+ | Policy || ****| |
| +------+ | Server || |+------+|
| | User | +--------+| || User ||
| +------+ | |+------+|
+--------------------------+ +--------+
Calling Device Called Device
........ Path of initial SIP INVITE
******** SIP Dialog
Figure 5: Architecture for Client Resolution
Caller Prepaid App Recorder App Callee
|(1) INVITE | | |
|--------------->| | |
|(2) 183 | | |
|<---------------| | |
|(3) INVITE | | |
|Join | | |
|<---------------| | |
|(4) 200 OK | | |
|--------------->| | |
|(5) ACK | | |
|<---------------| | |
|(6) MS-CTRL | | |
|just me | | |
|<---------------| | |
|(7) RTP | | |
|collect PIN | | |
|and number | | |
|................| | |
| |(8) INVITE | |
| |--------------->| |
| | |(9) INVITE |
| | |--------------->|
| | |(10) 200 OK |
| | |<---------------|
| |(11) 200 OK | |
| |<---------------| |
|(12) 200 OK | | |
|<---------------| | |
|(13) ACK | | |
|--------------->| | |
| |(14) ACK | |
| |--------------->| |
| | |(15) ACK |
| | |--------------->|
|(16) BYE | | |
|<---------------| | |
|(17) 200 OK | | |
|--------------->| | |
|(18) INVITE | | |
|Join,no media | | |
|KPML | | |
|<---------------| | |
|(19) 200 OK | | |
|--------------->| | |
|(20) ACK | | |
|<---------------| | |
|(21) INVITE | | |
|Join | | |
|<--------------------------------| |
|(22) 200 OK | | |
|-------------------------------->| |
|(23) ACK | | |
|<--------------------------------| |
|(24) MS-CTRL | | |
|fork to me | | |
|<--------------------------------| |
|Hits # | | |
|(25) HTTP POST | | |
|--------------->| | |
|(26) 200 OK | | |
|<---------------| | |
| |(27) BYE | |
| |--------------->| |
| | |(28) BYE |
| | |--------------->|
| | |(29) 200 OK |
| | |<---------------|
| |(30) 200 OK | |
| |<---------------| |
|(31) BYE | | |
|<--------------------------------| |
|(32) 200 OK | | |
|-------------------------------->| |
|(33) INVITE | | |
|enable | | |
|media | | |
|<---------------| | |
|(34) 200 OK | | |
|--------------->| | |
|(35) ACK | | |
|<---------------| | |
|(36) MS-CTRL | | |
|just me | | |
|<---------------| | |
|(37) RTP | | |
|................| | |
Figure 6: Prepaid Application with Client Resolution
notes the dialog identifier. This 200 OK is passed to the prepaid
application (11), which also notes the dialog identifier. The 200 OK
is passed towards the caller (12). The ACK is propagated back towards
the called party normally (13-15). The 200 OK will have the effect of
terminating the early dialog that was established by the pre-paid
calling card application. This leaves the caller with a hosted
conference with itself, and the pre-paid application as members,
along with a new dialog (outside of the conference) created from the
200 OK.
Knowing this is the case, the prepaid calling card application
terminates its previous dialog with the caller (16-17). This dialog
is not useful any more, since it is not joined with the dialog which
was actually created for the call. However, the prepaid calling card
application would like to be involved in the successful dialog. For
now, it doesn't need media, but it wishes to install a client-local
user interface, in KPML, to watch for the long pound. So, it sends an
INVITE with to media, with a Join header containing the dialog
identifier for the established call. The INVITE also contains a KPML
document (18). This INVITE completes successfully (19-20).
Now, the call recording application needs to receive a copy of the
media stream, in order to record it. To do that, it also generates an
INVITE towards the caller (21), with a Join header containing the
dialog identifiers from message 10. The INVITE indicates a receive
only media stream. This dialog completes succesfully (22-23). Now,
the caller is hosting a conference which contains itself, the prepaid
calling card application (which neither sending or receiving media),
the recording application (which is receiving media), and the called
party (which is sending and receiving media). The call recording
application instructs the media policy server in the UA (24) that it
would like to receive a copy of the media, including that received
from the called party. Note that there is no need for endpoint mixing
to support this conference.
The caller has their conversation. Eventually, they hit the long
pound to hang up. This results in an HTTP POST to the prepaid
application, based on the rules in the KPML (25). The prepaid calling
card application sends a BYE towards the recording application (27).
The recording application proxies it (28), and it completes normally
(29-30). Now, recall that the call recording application was actually
a combination of a proxy (for the original dialog), and a pure UA (to
record the media stream). Now that the call is over, it terminates
its dialog with the caller (31-32), and it is now out of the loop.
The prepaid calling card would now like to communicate with the
caller. It already has a dialog active with it. So, it merely
generates a re-INVITE on that dialog (33), adding media streams. This
dialog completes sucessfully, (34-35). Now, the pre-paid application
uses the media policy control protocol to tell the caller that they
are the only ones that should be sending or receiving a media stream
(36). The prepaid application can then prompt for the next number.
7.2.3 Comparison
There are important differences between the three models. Both have
pros and cons. We generally compare only the client and pipe-and-
filter models; the centralized server model is not generally
applicable since it assumes centralized coordination of applications.
The model in Section 7.2.2 has many benefits. First, it has excellent
security properties. Because each application has a direct dialog
with the user, and that dialog manages media streams directly between
the user and each application, the existing SIP security tools can be
directly used. S/MIME and potentially TLS (if there are no
intervening proxies between each application and the user device) can
provide for authentictation services. The client device can know the
complete set of applications it is interacting with, since each one
can authenticate directly with the UA (and vice-a-versa). In the
model of Section 7.2.2, there is a single dialog between the user and
their "first" application. Therefore, the user cannot directly
authenticate each application, and vice-a-versa.
Similarly, each media stream can be properly secured using SRTP [12].
Because each application is a UA, and not a B2BUA, SRTP key exchanges
(using MIKEY, for example [13]) are done directly with the
application to which the media is being sent. In the model of Section
7.2.2, the applications are the terminating point of the signaling,
but may not even touch the media stream (once again, consider the
pre-paid calling card application). Such a configuration might
preclude the use of SRTP, since the intermediary application would
appear as a man-in-the-middle attacker!
B2BUAs also have well understood interactions with end-to-end
encryption. If the caller should encrypt their SDP, B2BUA
applications will not be able to manipulate it, and so the model of
Section 7.2.2 will simply fail. However, the endpoint-based model of
Section 7.2.2 still works in the presence of end-to-end encryption of
SDP. This is, of course, because there are no B2BUAs.
That leads to another benefit - feature transparency. B2BUAs can
interfere with the operations of features when messages are
propagated through them. This problem is completely eliminated in the
client-based architecture of Section 7.2.2.
There is another interesting benefit of the client-based architecture
- firewall traversal. In the application-based architecture of
Section 7.2.2, many applications will not need to always be on the
media path. The applications will use re-INVITEs to move the media
streams to themselves when needed, and then move them back when done.
The result of this, as far as the user is concerned, is that a single
media stream will, at times, appear to be coming from different
source IP addresses. This means that a SIP-enabled firewall (or one
controlled by MIDCOM [14]) will need to open a "cone" for the media
stream - allowing it to go to the user, but come from any source
address. Such cones are more insecure, and less desirable, than a
pinhole. With the client-based architecture of Section 7.2.2, a SIP-
enabled firewall can open a cone initially, and when the media
arrives from the application, close the cone to a pinhole by
restricting media packets to always have the same source IP address
from then on. This restriction is possible because media on a
particular dialog comes from a single source - the application or the
user, depending on which dialog. The source of the media does not
change within a single dialog, as it does in the model of Section
7.2.2.
TODO: A picture and some more words are needed here to
explain this.
Conceptually, the client-based architecture allows for a unified view
of applications. A SIP application that desires to instantiate a
remote client user interface is always a normal user agent, whether
it be a "terminating" type of application, or "intermediary" type of
application. These two cases therefore become merged into one.
Furthermore, the inter-application feature interaction between client
local user interfaces and client remote user interfaces become
unified - both become local focus determination problems.
Furthermore, much of the interactions between application components
(discussed in Section 8) are simplified because of the simple
correlation of a dialog to a single application.
Unfortunately, the benefits of the client-based architecture come at
a cost of complexity. End devices need to support a focus capability,
a media policy server function, and possibly a media mixer, although
the latter can probably be avoided. The model also requires the
client to construct a globally routable URI to represent its focus,
something which is not trivial in an IP network laden with NATs and
firewalls.
8 Intra Application Feature Interaction 8. Intra Application Feature Interaction
An application can instantiate a multiplicity of user interface An application can instantiate a multiplicity of user interface
components. For example, a single application can instantiate two components. For example, a single application can instantiate two
separate HTML components and one WML component. Furthermore, an separate HTML components and one WML component. Furthermore, an
application can instantiate both client local and client remote user application can instantiate both client local and client remote user
interfaces. interfaces.
The feature interaction issues between these components within the The feature interaction issues between these components within the
same application are less severe. If an application has multiple same application are less severe. If an application has multiple
client user interface components, their interaction is resolved client user interface components, their interaction is resolved
identically to the inter-application case - through focus identically to the inter-application case - through focus
determination. However, the problems in focusless user interfaces determination. However, the problems in focusless user interfaces
(such as a keypad) generally won't exist, since the application can (such as a keypad) generally won't exist, since the application can
generate user interfaces which do not overlap in their usage of an generate user interfaces which do not overlap in their usage of an
input. input.
The real issue is that the optimal user experience frequently The real issue is that the optimal user experience frequently
requires some kind of coupling between the differing user interface requires some kind of coupling between the differing user interface
components. This is a classic problem in multi-modal user interfaces, components. This is a classic problem in multi-modal user interfaces,
such as those described by SALT [15]. As an example, consider a user such as those described by Speech Application Language Tags (SALT).
interface where a user can either press a labeled button to make a As an example, consider a user interface where a user can either
selection, or listen to a prompt, and speak the desired selection. press a labeled button to make a selection, or listen to a prompt,
Ideally, when the user presses the button, the prompt should cease and speak the desired selection. Ideally, when the user presses the
immediately, since both of them were targeted at collecting the same button, the prompt should cease immediately, since both of them were
information in parallel. Such interactions are best handled by targeted at collecting the same information in parallel. Such
markups which natively support such interactions, such as SALT, and interactions are best handled by markups which natively support such
thus require no explicit support from this framework. interactions, such as SALT, and thus require no explicit support from
this framework.
There is, however, a very common interaction in voice-based
applications which merits support from this framework. Many
interactive voice response systems (IVR) allow for a user to
"interrupt" a prompt by generating a response before the prompt
finishes. The ideal user experience is achieved by having the prompt
cease immediately when the user speaks the input. This is known as
barge-in.
In a traditional implementation of an IVR system, there would be a
client-remote user interface, rendered in VoiceXML. VoiceXML has
native support for barge-in. However, because the VoiceXML script is
interpreted remotely, there is a fundamental latency between the
client and the remote user interface. That is, when the user speaks
or presses a key, the speech or key must be transmitted to the
platform and interpreted, and then the VoiceXML server ceases playing
out media. For this to be observed by the client, the last media
packet must still travel from the VoiceXML server to the client,
through its playout buffers, and out the speaker system.
This framework allows for better performance. A VoiceXML user
interface can actually delegate a component of the user interface to
be interpreted on the client. Specifically, the collection of the
keypad input from the user can be delegated to the client by placing
a KPML-based user interface on the client solely for this purpose.
KPML has a barge-in feature as well. When the barge-in option is
selected, and user input matches a regular expression, all incoming
media streams associated with the application are muted, and the
playout buffers on the client are flushed. This situation persists
until the beginning of the next talkspurt, framed by the market bit
in the RTP stream.
OPEN ISSUE: Is the marker bit the right way to do this?
In this framework, a client local user interface is bound to a
dialog. A media stream is said to be associated with that user
interface component if the media stream is managed on the same dialog
the user interface component is bound to. As a result, if a KPML
script results in a barge-in, all media streams on that dialog are
muted until their marker bits flip.
A similar delegation can occur by placing instantiating a VoiceXML-
based user interface into the client. That would allow barge-in to
operate for speech driven IVR, in addition to keypad driven IVR.
This capability can allow VoIP-based IVR applications to operate with
zero-latency barge-in, better than todays circuit-switched IVR
applications. This is shown in Figure 7, which demonstrates a call
flow for this example. The caller makes an INVITE to a VoiceXML
server (1). The VoiceXML server fetches the script to execute (2).
The script, returned in (3), indicates that a prompt should be
played, and if the user presses bound, to barge-in. So, the VoiceXML
server generates a KPML script that looks for pound, and sets the
barge flag to true. This is returned in the 200 OK (4). The user is
played the prompt, and presses pound in the middle. The KPML notes
this, and the UA ceases playout of the prompt immediately. At the
same time, the client generates a POST to the VoiceXML server (7).
The VoiceXML server knows that the pound has been pressed. So, it
fetches the next VoiceXML script (8), and extracts from it the next
KPML script, passed in the 200 OK response to the POST from the
client (10).
9 Examples 9. Examples
TODO. TODO.
10 Security Considerations 10. Security Considerations
There are many security considerations associated with this There are many security considerations associated with this
framework. It allows applications in the network to instantiate user framework. It allows applications in the network to instantiate user
interface components on a client device. Such instantiations need to interface components on a client device. Such instantiations need to
be from authenticated applications, and also need to be authorized to be from authenticated applications, and also need to be authorized to
place a UI into the client. place a UI into the client. Indeed, the stronger requirement is
authorization. It is not so important to know that name of the
The means by which the authentication and authorization are done provider of the application, but rather, that the provider is
depend on the architectural model in use. A pipe-and-filter model authorized to instantiate components.
will make it difficult for the user device to authenticate each
application, since there is no direct dialog between them. Direct
dialogs are needed since they are needed for S/MIME, which is the
primary tool for client authentication of a server through proxies.
However, authorization is reasonably simple. An application is
authorized if it was on the original call path. By using a secure SIP
URI [1], the caller can obtain this guarantee as long as it trusts
each element on the call setup path.
With the client-based resolution model, authentication is much Generally, an application should be considered authorized if it was
better, as noted in Section 7.2.2, since it can be done with S/MIME. an application that was legitimately part of the call setup path.
Authorization works identically to the pipe-and-filter model. If the With this definition, authorization can be enforced using the sips
caller initiated the call with a secure SIP URI, an application could URI scheme when the call is initiated.
never learn the dialog identifiers unless it was in-path. Therefore,
an application which generates an INVITE to join a dialog created
from a SIPS URI must have been on the call path. However, this
application itself must use SIPS to contact the UA, in order to
protect the confidentiality of the dialog identifiers.
11 Contributors 11. Contributors
This document was produced as a result of discussions amongst the This document was produced as a result of discussions amongst the
application interaction design team. All members of this team application interaction design team. All members of this team
contributed significantly to the ideas embodied in this document. The contributed significantly to the ideas embodied in this document. The
members of this team were: members of this team were:
Eric Burger Eric Burger
Cullen Jennings Cullen Jennings
Robert Fairlie-Cuninghame Robert Fairlie-Cuninghame
12 Authors Address Informative References
Jonathan Rosenberg [1] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A.,
Caller VXML Server Web Server Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP:
| | | Session Initiation Protocol", RFC 3261, June 2002.
| | |
|(1) SIP INVITE | |
|--------------->| |
| | |
| | |
| |(2) HTTP GET |
| |--------------->|
| | |
| |(3) HTTP 200 OK |
| |VXML |
| |<---------------|
| | |
|(4) SIP 200 OK | |
|KPML | |
|<---------------| |
| | |
| | |
|(5) SIP ACK | |
|--------------->| |
| | |
| | |
|(6) RTP | |
|................| |
| | |
| | |
|press # | |
| | |
| | |
| | |
|playout ends | |
| | |
| | |
| | |
|(7) HTTP POST | |
|--------------->| |
| | |
| | |
| |(8) HTTP POST |
| |--------------->|
| | |
| |(9) 200 OK |
| |VXML |
| |<---------------|
| | |
|(10) 200 OK | |
|KPML | |
|<---------------| |
| | |
| | |
| | |
| | |
Figure 7: Zero-Latency Barge In [2] McGlashan, S., Lucas, B., Porter, B., Rehor, K., Burnett, D.,
dynamicsoft Carter, J., Ferrans, J. and A. Hunt, "Voice Extensible Markup
72 Eagle Rock Avenue Language (VoiceXML) Version 2.0", W3C CR CR-voicexml20-20030220,
First Floor February 2003.
East Hanover, NJ 07936
email: jdrosen@dynamicsoft.com
13 Normative References [3] Day, M., Rosenberg, J. and H. Sugano, "A Model for Presence and
Instant Messaging", RFC 2778, February 2000.
14 Informative References [4] Rosenberg, J., "A Framework for Conferencing with the Session
Initiation Protocol",
draft-ietf-sipping-conferencing-framework-00 (work in progress),
May 2003.
[1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. [5] Rosenberg, J., Schulzrinne, H. and P. Kyzivat, "Caller
Peterson, R. Sparks, M. Handley, and E. Schooler, "SIP: session Preferences and Callee Capabilities for the Session Initiation
initiation protocol," RFC 3261, Internet Engineering Task Force, June Protocol (SIP)", draft-ietf-sip-callerprefs-08 (work in
2002. progress), March 2003.
[2] M. Day, J. Rosenberg, and H. Sugano, "A model for presence and [6] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
instant messaging," RFC 2778, Internet Engineering Task Force, Feb. "RTP: A Transport Protocol for Real-Time Applications", RFC
2000. 1889, January 1996.
[3] J. Rosenberg, "A framework for conferencing in the session [7] Schulzrinne, H. and S. Petrack, "RTP Payload for DTMF Digits,
initiation protocol," Internet Draft, Internet Engineering Task Telephony Tones and Telephony Signals", RFC 2833, May 2000.
Force, Oct. 2002. Work in progress.
[4] H. Schulzrinne and J. Rosenberg, "Session initiation protocol [8] Burger, E., "Keypad Markup Language (KPML)",
(SIP) caller preferences and callee capabilities," Internet Draft, draft-burger-sipping-kpml-02 (work in progress), July 2003.
Internet Engineering Task Force, July 2002. Work in progress.
[5] VoiceXML Forum, "Voice extensible markup language (VoiceXML) [9] Dyke, J., Burger, E. and A. Spitzer, "Media Server Control
version 1.0," W3C Note NOTE-voicexml-20000505, World Wide Web Markup Language (MSCML) and Protocol", draft-vandyke-mscml-02
Consortium (W3C), May 2000. Available at (work in progress), July 2003.
http://www.w3.org/TR/voicexml/.
[6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: a Author's Address
transport protocol for real-time applications," RFC 1889, Internet
Engineering Task Force, Jan. 1996.
[7] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits, Jonathan Rosenberg
telephony tones and telephony signals," RFC 2833, Internet dynamicsoft
Engineering Task Force, May 2000. 600 Lanidex Plaza
Parsippany, NJ 07054
US
[8] E. Burger, "The keypad markup language (kpml)," Internet Draft, Phone: +1 973 952-5000
Internet Engineering Task Force, Oct. 2002. Work in progress. EMail: jdrosen@dynamicsoft.com
URI: http://www.jdrosen.net
[9] J. V. Dyke, E. Burger, and A. Spitzer, "Snowshore media server Intellectual Property Statement
control markup language and protocol," Internet Draft, Internet
Engineering Task Force, Oct. 2002. Work in progress.
[10] J. Rosenberg, J. Peterson, H. Schulzrinne, and G. Camarillo, The IETF takes no position regarding the validity or scope of any
"Best current practices for third party call control in the session intellectual property or other rights that might be claimed to
initiation protocol," Internet Draft, Internet Engineering Task pertain to the implementation or use of the technology described in
Force, June 2002. Work in progress. this document or the extent to which any license under such rights
might or might not be available; neither does it represent that it
has made any effort to identify any such rights. Information on the
IETF's procedures with respect to rights in standards-track and
standards-related documentation can be found in BCP-11. Copies of
claims of rights made available for publication and any assurances of
licenses to be made available, or the result of an attempt made to
obtain a general license or permission for the use of such
proprietary rights by implementors or users of this specification can
be obtained from the IETF Secretariat.
[11] R. Mahy and D. Petrie, "The session initiation protocol (sip) The IETF invites any interested party to bring to its attention any
join header," Internet Draft, Internet Engineering Task Force, Oct. copyrights, patents or patent applications, or other proprietary
2002. Work in progress. rights which may cover technology that may be required to practice
this standard. Please address the information to the IETF Executive
Director.
[12] M. Baugher et al. , "The secure real-time transport protocol," Full Copyright Statement
Internet Draft, Internet Engineering Task Force, June 2002. Work in
progress.
[13] J. Arkko et al. , "MIKEY: Multimedia internet KEYing," Internet Copyright (C) The Internet Society (2003). All Rights Reserved.
Draft, Internet Engineering Task Force, Aug. 2002. Work in progress.
[14] P. Srisuresh, J. Kuthan, J. Rosenberg, A. Molitor, and A. This document and translations of it may be copied and furnished to
Rayhan, "Middlebox communication architecture and framework," RFC others, and derivative works that comment on or otherwise explain it
3303, Internet Engineering Task Force, Aug. 2002. or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
[15] S. Forum, "Speech application language tags 1.0 specification The limited permissions granted above are perpetual and will not be
(SALT)," salt forum recommendation, Salt Forum, July 2002. Work in revoked by the Internet Society or its successors or assignees.
progress.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.
 End of changes. 105 change blocks. 
1092 lines changed or deleted 361 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/