idnits 2.17.1 

draft-rosenberg-sipping-app-interaction-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 30, 2003) is 7606 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-05) exists of
     draft-ietf-sipping-conferencing-framework-00

  == Outdated reference: A later version (-10) exists of
     draft-ietf-sip-callerprefs-08

  -- Obsolete informational reference (is this intentional?): RFC 1889 (ref.
     '6') (Obsoleted by RFC 3550)

  -- Obsolete informational reference (is this intentional?): RFC 2833 (ref.
     '7') (Obsoleted by RFC 4733, RFC 4734)

  == Outdated reference: A later version (-09) exists of
     draft-vandyke-mscml-02


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	SIPPING                                                     J. Rosenberg
3	Internet-Draft                                               dynamicsoft
4	Expires: December 29, 2003                                 June 30, 2003

6	    A Framework and Requirements for Application Interaction in the
7	                   Session Initiation Protocol (SIP)
8	          draft-rosenberg-sipping-app-interaction-framework-01

10	Status of this Memo

12	   This document is an Internet-Draft and is in full conformance with
13	   all provisions of Section 10 of RFC2026.

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups. Note that other
17	   groups may also distribute working documents as Internet-Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time. It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at http://
25	   www.ietf.org/ietf/1id-abstracts.txt.

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	   This Internet-Draft will expire on December 29, 2003.

32	Copyright Notice

34	   Copyright (C) The Internet Society (2003). All Rights Reserved.

36	Abstract

38	   This document describes a framework and requirements for the
39	   interaction between users and Session Initiation Protocol (SIP) based
40	   applications. By interacting with applications, users can guide the
41	   way in which they operate. The focus of this framework is stimulus
42	   signaling, which allows a user agent to interact with an application
43	   without knowledge of the semantics of that application. Stimulus
44	   signaling can occur to a user interface running locally with the
45	   client, or to a remote user interface, through media streams.
46	   Stimulus signaling encompasses a wide range of mechanisms, ranging
47	   from clicking on hyperlinks, to pressing buttons, to traditional Dual
48	   Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling
49	   is supported through the use of markup languages, which play a key
50	   role in this framework.

52	Table of Contents

54	   1.    Introduction . . . . . . . . . . . . . . . . . . . . . . . .  3
55	   2.    Definitions  . . . . . . . . . . . . . . . . . . . . . . . .  4
56	   3.    A Model for Application Interaction  . . . . . . . . . . . .  7
57	   3.1   Function vs. Stimulus  . . . . . . . . . . . . . . . . . . .  8
58	   3.2   Real-Time vs. Non-Real Time  . . . . . . . . . . . . . . . .  9
59	   3.3   Client-Local vs. Client-Remote . . . . . . . . . . . . . . .  9
60	   3.4   Interaction Scenarios on Telephones  . . . . . . . . . . . . 10
61	   3.4.1 Client Remote  . . . . . . . . . . . . . . . . . . . . . . . 11
62	   3.4.2 Client Local . . . . . . . . . . . . . . . . . . . . . . . . 11
63	   3.4.3 Flip-Flop  . . . . . . . . . . . . . . . . . . . . . . . . . 11
64	   4.    Framework Overview . . . . . . . . . . . . . . . . . . . . . 13
65	   5.    Client Local Interfaces  . . . . . . . . . . . . . . . . . . 15
66	   5.1   Discovering Capabilities . . . . . . . . . . . . . . . . . . 15
67	   5.2   Pushing an Initial Interface Component . . . . . . . . . . . 15
68	   5.3   Updating an Interface Component  . . . . . . . . . . . . . . 17
69	   5.4   Terminating an Interface Component . . . . . . . . . . . . . 18
70	   6.    Client Remote Interfaces . . . . . . . . . . . . . . . . . . 19
71	   6.1   Originating and Terminating Applications . . . . . . . . . . 19
72	   6.2   Intermediary Applications  . . . . . . . . . . . . . . . . . 19
73	   7.    Inter-Application Feature Interaction  . . . . . . . . . . . 21
74	   7.1   Client Local UI  . . . . . . . . . . . . . . . . . . . . . . 21
75	   7.2   Client-Remote UI . . . . . . . . . . . . . . . . . . . . . . 22
76	   8.    Intra Application Feature Interaction  . . . . . . . . . . . 23
77	   9.    Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 24
78	   10.   Security Considerations  . . . . . . . . . . . . . . . . . . 25
79	   11.   Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26
80	         Informative References . . . . . . . . . . . . . . . . . . . 27
81	         Author's Address . . . . . . . . . . . . . . . . . . . . . . 28
82	         Intellectual Property and Copyright Statements . . . . . . . 29

84	1. Introduction

86	   The Session Initiation Protocol (SIP) [1] provides the ability for
87	   users to initiate, manage, and terminate communications sessions.
88	   Frequently, these sessions will involve a SIP application. A SIP
89	   application is defined as a program running on a SIP-based element
90	   (such as a proxy or user agent) that provides some value-added
91	   function to a user or system administrator. Examples of SIP
92	   applications include pre-paid calling card calls, conferencing, and
93	   presence-based [3] call routing.

95	   In order for most applications to properly function, they need input
96	   from the user to guide their operation. As an example, a pre-paid
97	   calling card application requires the user to input their calling
98	   card number, their PIN code, and the destination number they wish to
99	   reach. The process by which a user provides input to an application
100	   is called "application interaction".

102	   Application interaction can be either functional or stimulus.
103	   Functional interaction requires the user agent to understand the
104	   semantics of the application, whereas stimulus interaction does not.
105	   Stimulus signaling allows for applications to be built without
106	   requiring modifications to the client. Stimulus interaction is the
107	   subject of this framework. The framework provides a model for how
108	   users interact with applications through user interfaces, and how
109	   user interfaces and applications can be distributed throughout a
110	   network. This model is then used to describe how applications can
111	   instantiate and manage user interfaces.

113	2. Definitions

115	   SIP Application: A SIP application is defined as a program running on
116	      a SIP-based element (such as a proxy or user agent) that provides
117	      some value-added function to a user or system administrator.
118	      Examples of SIP applications include pre-paid calling card calls,
119	      conferencing, and presence-based [3] call routing.

121	   Application Interaction: The process by which a user provides input
122	      to an application.

124	   Real-Time Application Interaction: Application interaction that takes
125	      place while an application instance is executing. For example,
126	      when a user enters their PIN number into a pre-paid calling card
127	      application, this is real-time application interaction.

129	   Non-Real Time Application Interaction: Application interaction that
130	      takes place asynchronously with the execution of the application.
131	      Generally, non-real time application interaction is accomplished
132	      through provisioning.

134	   Functional Application Interaction: Application interaction is
135	      functional when the user device has an understanding of the
136	      semantics of the application that the user is interacting with.

138	   Stimulus Application Interaction: Application interaction is
139	      considered to be stimulus when the user device has no
140	      understanding of the semantics of the application that the user is
141	      interacting with.

143	   User Interface (UI): The user interface provides the user with
144	      context in order to make decisions about what they want. The user
145	      enters information into the user interface. The user interface
146	      interprets the information, and passes it to the application.

148	   User Interface Component: A piece of user interface which operates
149	      independently of other pieces of the user interface. For example,
150	      a user might have two separate web interfaces to a pre-paid
151	      calling card application - one for hanging up and making another
152	      call, and another for entering the username and PIN.

154	   User Device: The software or hardware system that the user directly
155	      interacts with in order to communicate with the application. An
156	      example of a user device is a telephone. Another example is a PC
157	      with a web browser.

159	   User Input: The "raw" information passed from a user to a user
160	      interface. Examples of user input include a spoken word or a click
161	      on a hyperlink.

163	   Client-Local User Interface: A user interface which is co-resident
164	      with the user device.

166	   Client Remote User Interface: A user interface which executes
167	      remotely from the user device. In this case, a standardized
168	      interface is needed between them. Typically, this is done through
169	      media sessions - audio, video, or application sharing.

171	   Media Interaction: A means of separating a user and a user interface
172	      by connecting them with media streams.

174	   Interactive Voice Response (IVR): An IVR is a type of user interface
175	      that allows users to speak commands to the application, and hear
176	      responses to those commands prompting for more information.

178	   Prompt-and-Collect: The basic primitive of an IVR user interface. The
179	      user is presented with a voice option, and the user speaks their
180	      choice.

182	   Barge-In: In an IVR user interface, a user is prompted to enter some
183	      information. With some prompts, the user may enter the requested
184	      information before the prompt completes. In that case, the prompt
185	      ceases. The act of entering the information before completion of
186	      the prompt is referred to as barge-in.

188	   Focus: A user interface component has focus when user input is
189	      provided fed to it, as opposed to any other user interface
190	      components. This is not to be confused with the term focus within
191	      the SIP conferencing framework, which refers to the center user
192	      agent in a conference [4].

194	   Focus Determination: The process by which the user device determines
195	      which user interface component will receive the user input.

197	   Focusless User Interface: A user interface which has no ability to
198	      perform focus determination. An example of a focusless user
199	      interface is a keypad on a telephone.

201	   Feature Interaction: A class of problems which result when multiple
202	      applications or application components are trying to provide
203	      services to a user at the same time.

205	   Inter-Application Feature Interaction: Feature interactions that
206	      occur between applications.

208	   DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones
209	      generated by circuit switched telephony devices when the user
210	      presses a key on the keypad. As a result, DTMF and keypad input
211	      are often used synonymously, when in fact one of them (DTMF) is
212	      merely a means of conveying the other (the keypad input) to a
213	      client-remote user interface (the switch, for example).

215	   Application Instance: A single execution path of a SIP application.

217	   Originating Application: A SIP application which acts as a UAC,
218	      calling the user.

220	   Terminating Application: A SIP application which acts as a UAS,
221	      answering a call generated by a user. IVR applications are
222	      terminating applications.

224	   Intermediary Application: A SIP application which is neither the
225	      caller or callee, but rather, a third party involved in a call.

227	3. A Model for Application Interaction

229	         +---+            +---+            +---+             +---+
230	         |   |            |   |            |   |             |   |
231	         |   |            | U |            | U |             | A |
232	         |   |   Input    | s |   Input    | s |   Results   | p |
233	         |   | ---------> | e | ---------> | e | ----------> | p |
234	         | U |            | r |            | r |             | l |
235	         | s |            |   |            |   |             | i |
236	         | e |            | D |            | I |             | c |
237	         | r |   Output   | e |   Output   | f |   Update    | a |
238	         |   | <--------- | v | <--------- | a | <.......... | t |
239	         |   |            | i |            | c |             | i |
240	         |   |            | c |            | e |             | o |
241	         |   |            | e |            |   |             | n |
242	         |   |            |   |            |   |             |   |
243	         +---+            +---+            +---+             +---+

245	               Figure 1: Model for Real-Time Interactions

247	   Figure 1 presents a general model for how users interact with
248	   applications. Generally, users interact with a user interface through
249	   a user device. A user device can be a telephone, or it can be a PC
250	   with a web browser. Its role is to pass the user input from the user,
251	   to the user interface. The user interface provides the user with
252	   context in order to make decisions about what they want. The user
253	   enters information into the user interface. The user interface
254	   interprets the information, and passes it to the application. The
255	   application may be able to modify the user interface based on this
256	   information. Whether or not this is possible depends on the type of
257	   user interface.

259	   User interfaces are fundamentally about rendering and interpretation.
260	   Rendering refers to the way in which the user is provided context.
261	   This can be through hyperlinks, images, sounds, videos, text, and so
262	   on. Interpretation refers to the way in which the user interface
263	   takes the "raw" data provided by the user, and returns the result to
264	   the application in a meaningful format, abstracted from the
265	   particulars of the user interface. As an example, consider a pre-paid
266	   calling card application. The user interface worries about details
267	   such as what prompt the user is provided, whether the voice is male
268	   or female, and so on. It is concerned with recognizing the speech
269	   that the user provides, in order to obtain the desired information.
270	   In this case, the desired information is the calling card number, the
271	   PIN code, and the destination number. The application needs that
272	   data, and it doesn't matter to the application whether it was
273	   collected using a male prompt or a female one.

275	   User interfaces generally have real-time requirements towards the
276	   user. That is, when a user interacts with the user interface, the
277	   user interface needs to react quickly, and that change needs to be
278	   propagated to the user right away. However, the interface between the
279	   user interface and the application need not be that fast. Faster is
280	   better, but the user interface itself can frequently compensate for
281	   long latencies there. In the case of a pre-paid calling card
282	   application, when the user is prompted to enter their PIN, the prompt
283	   should generally stop immediately once the first digit of the PIN is
284	   entered. This is referred to as barge-in. After the user-interface
285	   collects the rest of the PIN, it can tell the user to "please wait
286	   while processing". The PIN can then be gradually transmitted to the
287	   application. In this example, the user interface has compensated for
288	   a slow UI to application interface by asking the user to wait.

290	   The separation between user interface and application is absolutely
291	   fundamental to the entire framework provided in this document. Its
292	   importance cannot be overstated.

294	   With this basic model, we can begin to taxonomize the types of
295	   systems that can be built.

297	3.1 Function vs. Stimulus

299	   The first way to taxonomize the system is to consider the interface
300	   between the UI and the application. There are two fundamentally
301	   different models for this interface. In a functional interface, the
302	   user interface has detailed knowledge about the application, and is,
303	   in fact, specific to the application. The interface between the two
304	   components is through a functional protocol, capable of representing
305	   the semantics which can be exposed through the user interface.
306	   Because the user interface has knowledge of the application, it can
307	   be optimally designed for that application. As a result, functional
308	   user interfaces are almost always the most user friendly, the
309	   fastest, the and the most responsive. However, in order to allow
310	   interoperability between user devices and applications, the details
311	   of the functional protocols need to be specified in standards. This
312	   slows down innovation and limits the scope of applications that can
313	   be built.

315	   An alternative is a stimulus interface. In a stimulus interface, the
316	   user interface is generic, totally ignorant of the details of the
317	   application. Indeed, the application may pass instructions to the
318	   user interface describing how it should operate. The user interface
319	   translates user input into "stimulus" - which are data understood
320	   only by the application, and not by the user interface. Because they
321	   are generic, and because they require communications with the
322	   application in order to change the way in which they render
323	   information to the user, stimulus user interfaces are usually slower,
324	   less user friendly, and less responsive than a functional
325	   counterpart. However, they allow for substantial innovation in
326	   applications, since no standardization activity is needed to built a
327	   new application, as long as it can interact with the user within the
328	   confines of the user interface mechanism.

330	   In SIP systems, functional interfaces are provided by extending the
331	   SIP protocol to provide the needed functionality. For example, the
332	   SIP caller preferences specification [5] provides a functional
333	   interface that allows a user to request applications to route the
334	   call to specific types of user agents. Functional interfaces are
335	   important, but are not the subject of this framework. The primary
336	   goal of this framework is to address the role of stimulus interfaces
337	   to SIP applications.

339	3.2 Real-Time vs. Non-Real Time

341	   Application interaction systems can also be real-time or
342	   non-real-time. Non-real interaction allows the user to enter
343	   information about application operation in asynchronously with its
344	   invocation. Frequently, this is done through provisioning systems. As
345	   an example, a user can set up the forwarding number for a
346	   call-forward on no-answer application using a web page. Real-time
347	   interaction requires the user to interact with the application at the
348	   time of its invocation.

350	3.3 Client-Local vs. Client-Remote

352	   Another axis in the taxonomization is whether the user interface is
353	   co-resident with the user device (which we refer to as a client-local
354	   user interface), or the user interface runs in a host separated from
355	   the client (which we refer to as a client-remote user interface). In
356	   a client-remote user interface, there exists some kind of protocol
357	   between the client device and the UI that allows the client to
358	   interact with the user interface over a network.

360	   The most important way to separate the UI and the client device is
361	   through media interaction. In media interaction, the interface
362	   between the user and the user interface is through media - audio,
363	   video, messaging, and so on. This is the classic mode of operation
364	   for VoiceXML [2], where the user interface (also referred to as the
365	   voice browser) runs on a platform in the network. Users communicate
366	   with the voice browser through the telephone network (or using a SIP
367	   session). The voice browser interacts with the application using HTTP
368	   to convey the information collected from the user.

370	   We refer to the second sub-case as a client-local user interface. In
371	   this case, the user interface runs co-located with the user. The
372	   interface between them is through the software that interprets the
373	   users input and passes them to the user interface. The classic
374	   example of this is the web. In the web, the user interface is a web
375	   browser, and the interface is defined by the HTML document that it's
376	   rendering. The user interacts directly with the user interface
377	   running in the browser. The results of that user interface are sent
378	   to the application (running on the web server) using HTTP.

380	   It is important to note that whether or not the user interface is
381	   local, or remote (in the case of media interaction), is not a
382	   property of the modality of the interface, but rather a property of
383	   the system. As an example, it is possible for a web-based user
384	   interface to be provided with a client-remote user interface. In such
385	   a scenario, video and application sharing media sessions can be used
386	   between the user and the user interface. The user interface, still
387	   guided by HTML, now runs "in the network", remote from the client.
388	   Similarly, a VoiceXML document can be interpreted locally by a client
389	   device, with no media streams at all. Indeed, the VoiceXML document
390	   can be rendered using text, rather than media, with no impact on the
391	   interface between the user interface and the application.

393	   It is also important to note that systems can be hybrid. In a hybrid
394	   user interface, some aspects of it (usually those associated with a
395	   particular modality) run locally, and others run remotely.

397	3.4 Interaction Scenarios on Telephones

399	   This same model can apply to a telephone. In a traditional telephone,
400	   the user interface consists of a 12-key keypad, a speaker, and a
401	   microphone. Indeed, from here forward, the term "telephone" is used
402	   to represent any device that meets, at a minimum, the characteristics
403	   described in the previous sentence. Circuit-switched telephony
404	   applications are almost universally client-remote user interfaces. In
405	   the Public Switched Telephone Network (PSTN), there is usually a
406	   circuit interface between the user and the user interface. The user
407	   input from the keypad is conveyed used Dual-Tone Multi-Frequency
408	   (DTMF), and the microphone input as PCM encoded voice.

410	   In an IP-based system, there is more variability in how the system
411	   can be instantiated. Both client-remote and client-local user
412	   interfaces to a telephone can be provided.

414	   In this framework, a PSTN gateway can be considered a "user proxy".
415	   It is a proxy for the user because it can provide, to a user
416	   interface on an IP network, input taken from a user on a circuit
417	   switched telephone. The gateway may be able to run a client-local
418	   user interface, just as an IP telephone might.

420	3.4.1 Client Remote

422	   The most obvious instantiation is the "classic" circuit-switched
423	   telephony model. In that model, the user interface runs remotely from
424	   the client. The interface between the user and the user interface is
425	   through media, set up by SIP and carried over the Real Time Transport
426	   Protocol (RTP) [6]. The microphone input can be carried using any
427	   suitable voice encoding algorithm. The keypad input can be conveyed
428	   in one of two ways. The first is to convert the keypad input to DTMF,
429	   and then convey that DTMF using a suitance encoding algorithm for it
430	   (such as PCMU). An alternative, and generally the preferred approach,
431	   is to transmit the keypad input using RFC 2833 [7], which provides an
432	   encoding mechanism for carrying keypad input within RTP.

434	   In this classic model, the user interface would run on a server in
435	   the IP network. It would perform speech recognition and DTMF
436	   recognition to derive the user intent, feed them through the user
437	   interface, and provide the result to an application.

439	3.4.2 Client Local

441	   An alternative model is for the entire user interface to reside on
442	   the telephone. The user interface can be a VoiceXML browser, running
443	   speech recognition on the microphone input, and feeding the keypad
444	   input directly into the script. As discussed above, the VoiceXML
445	   script could be rendered using text instead of voice, if the
446	   telephone had a textual display.

448	3.4.3 Flip-Flop

450	   A middle-ground approach is to flip back and forth between a
451	   client-local and client-remote user interface. Many voice
452	   applications are of the type which listen to the media stream and
453	   wait for some specific trigger that kicks off a more complex user
454	   interaction. The long pound in a pre-paid calling card application is
455	   one example. Another example is a conference recording application,
456	   where the user can press a key at some point in the call to begin
457	   recording. When the key is pressed, the user hears a whisper to
458	   inform them that recording has started.

460	   The ideal way to support such an application is to install a
461	   client-local user interface component that waits for the trigger to
462	   kick off the real interaction. Once the trigger is received, the
463	   application connects the user to a client-remote user interface that
464	   can play announements, collect more information, and so on.

466	   The benefit of flip-flopping between a client-local and client-remote
467	   user interface is cost. The client-local user interface will
468	   eliminate the need to send media streams into the network just to
469	   wait for the user to press the pound key on the keypad.

471	   The Keypad Markup Language (KPML) was designed to support exactly
472	   this kind of need [8]. It models the keypad on a phone, and allows an
473	   application to be informed when any sequence of keys have been
474	   pressed. However, KPML has no presentation component. Since user
475	   interfaces generally require a response to user input, the
476	   presentation will need to be done using a client-remote user
477	   interface that gets instantiated as a result of the trigger.

479	   It is tempting to use a hybrid model, where a prompt-and-collect
480	   application is implemented by using a client-remote user interface
481	   that plays the prompts, and a client-local user interface, described
482	   by KPML, that collects digits. However, this only complicates the
483	   application. Firstly, the keypad input will be sent to both the media
484	   stream and the KPML user interface. This requires the application to
485	   sort out which user inputs are duplicates, a process that is very
486	   complicated. Secondly, the primary benefit of KPML is to avoid having
487	   a media stream towards a user interface. However, there is already a
488	   media stream for the prompting, so there is no real savings.

490	4. Framework Overview

492	   In this framework, we use the term "SIP application" to refer to a
493	   broad set of functionality. A SIP application is a program running on
494	   a SIP-based element (such as a proxy or user agent) that provides
495	   some value-added function to a user or system administrator. SIP
496	   applications can execute on behalf of a caller, a called party, or a
497	   multitude of users at once.

499	   Each application has a number of instances that are executing at any
500	   given time. An instance represents a single execution path for an
501	   application. Each instance has a well defined lifecycle. It is
502	   established as a result of some event. That event can be a SIP event,
503	   such as the reception of a SIP INVITE request, or it can be a non-SIP
504	   event, such as a web form post or even a timer. Application instances
505	   also have a specific end time. Some instances have a lifetime that is
506	   coupled with a SIP transaction or dialog. For example, a proxy
507	   application might begin when an INVITE arrives, and terminate when
508	   the call is answered. Other applications have a lifetime that spans
509	   multiple dialogs or transactions. For example, a conferencing
510	   application instance may exist so long as there are any dialogs
511	   connected to it. When the last dialog terminates, the application
512	   instance terminates. Other applications have a liftime that is
513	   completely decoupled from SIP events.

515	   It is fundamental to the framework described here that multiple
516	   application instances may interact with a user during a single SIP
517	   transaction or dialog. Each instance may be for the same application,
518	   or different applications. Each of the applications may be completely
519	   independent, in that they may be owned by different providers, and
520	   may not be aware of each others existence. Similarly, there may be
521	   application instances interacting with the caller, and instances
522	   interacting with the callee, both within the same transaction or
523	   dialog.

525	   The first step in the interaction with the user is to instantiate one
526	   of more user interface components for the application instance. A
527	   user interface component is a single piece of the user interface that
528	   is defined by a logical flow that is not synchronously coupled with
529	   any other component. In other words, each component runs more or less
530	   independently.

532	   A user interface component can be instantiated in one of the user
533	   devices (for a client-local user interface), or within a network
534	   element (for a client-remote user interface). If a client-local user
535	   interface is to be used, the application needs to determine whether
536	   or not the user device is capable of supporting a client-local user
537	   interface, and in what format. In this framework, all client-local
538	   user interface components are described by a markup language. A
539	   markup language describes a logical flow of presentation of
540	   information to the user, collection of information from the user, and
541	   transmission of that information to an application. Examples of
542	   markup languages include HTML, WML, VoiceXML, the Keypad Markup
543	   Language (KPML) [8] and the Media Server Control Markup Language
544	   (MSCML) [9].

546	   The interface between the user interface component and the
547	   application is typically markup-language specific. For those markups
548	   which support rendering of information to a user, such as HTML, HTTP
549	   form POST operations are used. For those markups where no information
550	   is rendered to the user, the markup can play one of two roles. The
551	   first is called "one shot". In the one-shot role, the markup waits
552	   for a user to enter some information, and when they do, reports this
553	   event to the application. The application then does something, and
554	   the markup is no longer used. In the other modality, called
555	   "monitor", the markup stays permanently resident, and reports
556	   information back to an application continuously. However, the act of
557	   reporting information back to the application does not cause the
558	   installation of a new markup. In markups where one-shot or monitor
559	   modalities are used, a SIP MESSAGE request is used to report the
560	   status.

562	   To create a client-local user interface, the application passes the
563	   markup document (or a reference to it) in a SIP message to that
564	   client. The SIP message can be one explicitly generated by the
565	   application (in which case the application has to be a UA or B2BUA),
566	   or it can be placed in a SIP message that passes by (in which case
567	   the application can be running in a proxy).

569	   Client local user interface components are always associated with the
570	   dialog that the SIP message itself is associated with. Consequently,
571	   user interface components cannot be placed in messages that are not
572	   associated with a dialog.

574	   If a user interface component is to be instantiated in the network,
575	   there is no need to determine the capabilities of the device on which
576	   the user interface is instantiated. Presumably, it is on a device on
577	   which the application knows a UI can be created. However, the
578	   application does need to connect the user device to the user
579	   interface. This will require manipulation of media streams in order
580	   to establish that connection.

582	   Once a user interface component is created, the application needs to
583	   be able to change it, and to remove it. Finally, more advanced
584	   applications may require coupling between application components. The
585	   framework supports rudimentary capabilities there.

587	5. Client Local Interfaces

589	   One key component of this framework is support for client local user
590	   interfaces.

592	5.1 Discovering Capabilities

594	   A client local user interface can only be instantiated on a client if
595	   the user device has the capabilities needed to do so. Specifically,
596	   an application needs to know what markup languages, if any, are
597	   supported by the client. For example, does the client support HTML?
598	   VoiceXML? However, that information is not sufficient to determine if
599	   a client local user interface can be instantiated. In order to
600	   instantiate the user interface, the application needs to transfer the
601	   markup document to the client. There are two ways in which the markup
602	   document can be transferred. The application can send the client a
603	   URI which the client can use to fetch the markup, or the markup can
604	   be sent inline within the message. The application needs to know
605	   which of these modes are supported, and in the case of indirection,
606	   which URI schemes are supported to obtain the indirection.

608	   Many applications will need to know these capabilities at the time an
609	   application instance is first created. Since applications can be
610	   created through SIP requests or responses, SIP needs to provide a
611	   means to convey this information. This introduces several concrete
612	   requirements for SIP:

614	   REQ 1: A SIP request or response must be capable of conveying the set
615	      of markup languages supported by the UA that generated the request
616	      or response.

618	   REQ 2: A SIP request or response must be capable of indicating
619	      whether a UA can obtain markups inline, or through an indirection.
620	      In the case of indirection, the UA must be capable of indicating
621	      what URI schemes it supports.

623	5.2 Pushing an Initial Interface Component

625	   Once the application has determined that the UA is capable of
626	   supporting client local user interfaces, the next step is for the
627	   application to push an interface component to the user device.

629	   Generally, we anticipate that interface components will need to be
630	   created at various different points in a SIP session. Clearly, they
631	   will need to be pushed during an initial INVITE, in both responses
632	   (so as to place a component into the calling UA) and in the request
633	   (so as to place a component into the called UA). As an example, a
634	   conference recording application allows the users to record the media
635	   for the session at any time. The application would like to push an
636	   HTML user interface component to both the caller and callee at the
637	   time the call is setup, allowing either to record the session. The
638	   HTML component would have buttons to start and stop recording. To
639	   push the HTML component to the caller, it needs to be pushed in the
640	   200 OK (and possibly provisional response), and to push it to the
641	   callee, in the INVITE itself.

643	   To state the requirement more concretely:

645	   REQ 3: An application must be able to add a reference to, or an
646	      inline version of, a user interface component into any request or
647	      response that passes through or is emanated from that application.

649	   However, there will also be cases where the application needs to push
650	   a new interface component to a UA, but it is not as a result of any
651	   SIP message. As an example, a pre-paid calling card application will
652	   set a timer that determines how long the call can proceed, given the
653	   availability of funds in the user's account. When the timer fires,
654	   the application would like to push a new interface component to the
655	   calling UA, allowing them to click to add more funds.

657	   In this case, there is no message already in transit that can be used
658	   as a vehicle for pushing a user interface component. This requires
659	   that applications can generate their own messages to push a new
660	   component to a UA:

662	   REQ 4: A UA application must be able to send a SIP message to the UA
663	      at the other end of the dialog, asking it to create a new
664	      interface component.

666	   In all cases, the information passed from the application to the UA
667	   must include more than just the interface component itself (or a
668	   reference to it). The user must be able to decide whether or not it
669	   wants to proceed with this application. To make that determination,
670	   the user must have information about the application. Specifically,
671	   it will need the name of the application, and an identifier of the
672	   owner or administrator for the application. As an example, a typical
673	   name would be "Prepaid Calling Card" and the owner could be
674	   "voiceprovider.com".

676	   REQ 5: Any user interface component passed to a client (either inline
677	      or through a reference) must also include markup meta-data,
678	      including a human readable name of the application, and an
679	      identifier of the owner of the application.

681	   Clearly, there are security implications. The user will need to
682	   verify the identity of the application owner, and be sure that the
683	   user interface component is not being replayed, that is, it actually
684	   belongs with this specific SIP message.

686	   REQ 6: It must be possible for the client to validate the
687	      authenticity and integrity of the markup document (or its
688	      reference) and its associated meta-data. It must be possible for
689	      the client to verify that the information has not been replayed
690	      from a previous SIP message.

692	   If the user decides not to execute the user interface component, it
693	   simply discards it. There is no explicit requirement for the user to
694	   be able to inform the application that the component was discarded.
695	   Effectively, the application will think that the component was
696	   executed, but that the user never entered any information.

698	5.3 Updating an Interface Component

700	   Once a user interface component has been created on a client, it can
701	   be updated in two ways. The first way is the "normal" path inherent
702	   to that component. The client enters some data, the user interface
703	   transfers the information to the application (typically through
704	   HTTP), and the result of that transfer brings a new markup document
705	   describing an updated interface. This is referred to as a synchronous
706	   update, since it is synchronized with user interaction.

708	   However, synchronous updates are not sufficient for many
709	   applications. Frequently, the interface will need to be updated
710	   asynchronously by the application, without an explicit user action. A
711	   good example of this is, once again, the pre-paid calling card
712	   application. The application might like to update the user interface
713	   when the timer runs out on the call. This introduces several
714	   requirements:

716	   REQ 7: It must be possible for an application to asynchronously push
717	      an update to an existing user interface component, either in a
718	      message that was already in transit, or by generating a new
719	      message.

721	   REQ 8: It must be possible for the client to associate the new
722	      interface component with the one that it is supposed to replace,
723	      so that the old one can be removed.

725	   Unfortunately, pushing of application components introduces a race
726	   condition. What if the user enters data into the old component,
727	   causing an HTTP request to the application, while an update of that
728	   component is in progress? The client will get an interface component
729	   in the HTTP response, and also get the new one in the SIP message.

731	   Which one does the client use? There needs to be a way in which to
732	   properly order the components:

734	   REQ 9: It must be possible for the client to relatively order user
735	      interface updates it receives as the result of synchronous and
736	      asynchronous messaging.

738	5.4 Terminating an Interface Component

740	   User interface components have a well defined lifetime. They are
741	   created when the component is first pushed to the client. User
742	   interface components are always associated with the SIP dialog on
743	   which they were pushed. As such, their lifetime is bound by the
744	   lifetime of the dialog. When the dialog ends, so does the interface
745	   component.

747	   This rule applies to early dialogs as well. If a user interface
748	   component is passed in a provisional response to INVITE, and a
749	   separate branch eventually answers the call, the component terminates
750	   with the arrival of the 2xx. That's because the early dialog itself
751	   terminates with the arrival of the 2xx.

753	   However, there are some cases where the application would like to
754	   terminate the user interface component before its natural termination
755	   point. To do this, the application pushes a "null" update to the
756	   client. This is an update that replaces the existing user interface
757	   component with nothing.

759	   REQ 10: It must be possible for an application to terminate a user
760	      interface component before its natural expiration.

762	   The user can also terminate the user interface component. However,
763	   there is no explicit signaling required in this case. The component
764	   is simply dismissed. To the application, it appears as if the user
765	   has simply ceased entering data.

767	6. Client Remote Interfaces

769	   As an alternative to, or in conjunction with client local user
770	   interfaces, an application can make use of client remote user
771	   interfaces. These user interfaces can execute co-resident with the
772	   application itself (in which case no standardized interfaces between
773	   the UI and the application need to be used), or it can run
774	   separately. This framework assumes that the user interface runs on a
775	   host that has a sufficient trust relationship with the application.
776	   As such, the means for instantiating the user interface is not
777	   considered here.

779	   The primary issue is to connect the user device to the remote user
780	   interface. Doing so requires the manipulation of media streams
781	   between the client and the user interface. Such manipulation can only
782	   be done by user agents. There are two types of user agent
783	   applications within this framework - originating/terminating
784	   applications, and intermediary applications.

786	6.1 Originating and Terminating Applications

788	   Originating and terminating applications are applications which are
789	   themselves the originator or the final recipient of a SIP invitation.
790	   They are "pure" user agent applications - not back-to-back user
791	   agents. The classic example of such an application is an interactive
792	   voice response (IVR) application, which is typically a terminating
793	   application. Its a terminating application because the user
794	   explicitly calls it; i.e., it is the actual called party. An example
795	   of an originating application is a wakeup call application, which
796	   calls a user at a specified time in order to wake them up.

798	   Because originating and terminating applications are a natural
799	   termination point of the dialog, manipulation of the media session by
800	   the application is trivial. Traditional SIP techniques for adding and
801	   removing media streams, modifying codecs, and changing the address of
802	   the recipient of the media streams, can be applied. Similarly, the
803	   application can direclty authenticate itself to the user through S/
804	   MIME, since it is the peer UA in the dialog.

806	6.2 Intermediary Applications

808	   Intermediary application are, at the same time, more common than
809	   originating/terminating applications, and more complex. Intermediary
810	   applications are applications that are neither the actual caller or
811	   called party. Rather, they represent a "third party" that wishes to
812	   interact with the user. The classic example is the ubiquitous
813	   pre-paid calling card application.

815	   In order for the intermediary application to add a client remote user
816	   interface, it needs to manipulate the media streams of the user agent
817	   to terminate on that user interface. This also introduces a
818	   fundamental feature interaction issue. Since the intermediary
819	   application is not an actual participant in the call, how does the
820	   user interact with the intermediary application, and its actual peer
821	   in the dialog, at the same time? This is discussed in more detail in
822	   Section 7.

824	7. Inter-Application Feature Interaction

826	   The inter-application feature interaction problem is inherent to
827	   stimulus signaling. Whenever there are multiple applications, there
828	   are multiple user interfaces. When the user provides an input, to
829	   which user interface is the input destined? That question is the
830	   essence of the inter-application feature interaction problem.

832	   Inter-application feature interaction is not an easy problem to
833	   resolve. For now, we consider separately the issues for client-local
834	   and client-remote user interface components.

836	7.1 Client Local UI

838	   When the user interface itself resides locally on the client device,
839	   the feature interaction problem is actually much simpler. The end
840	   device knows explicitly about each application, and therefore can
841	   present the user with each one separately. When the user provides
842	   input, the client device can determine to which user interface the
843	   input is destined. The user interface to which input is destined is
844	   referred to as the application in focus, and the means by which the
845	   focused application is selected is called focus determination.

847	   Generally speaking, focus determination is purely a local operation.
848	   In the PC universe, focus determination is provided by window
849	   managers. Each application does not know about focus, it merely
850	   receives the user input that has been targeted to it when its in
851	   focus. This basic concept applies to SIP-based applications as well.

853	   Focus determination will frequently be trivial, depending on the user
854	   interface type. Consider a user that makes a call from a PC. The call
855	   passes through a pre-paid calling card application, and a call
856	   recording application. Both of these wish to interact with the user.
857	   Both push an HTML-based user interface to the user. On the PC, each
858	   user interface would appear as a separate window. The user interacts
859	   with the call recording application by selecting its window, and with
860	   the pre-paid calling card application by selecting its window. Focus
861	   determination is literally provided by the PC window manager. It is
862	   clear to which application the user input is targeted.

864	   As another example, consider the same two applications, but on a
865	   "smart phone" that has a set of buttons, and next to each button, an
866	   LCD display that can provide the user with an option. This user
867	   interface can be represented using the Wireless Markup Language
868	   (WML).

870	   The phone would allocate some number of buttons to each application.
871	   The prepaid calling card would get one button for its "hangup"
872	   command, and the recording application would get one for its "start/
873	   stop" command. The user can easily determine which application to
874	   interact with by pressing the appropriate button. Pressing a button
875	   determines focus and provides user input, both at the same time.

877	   Unfortunately, not all devices will have these advanced displays. A
878	   PSTN gateway, or a basic IP telephone, may only have a 12-key keypad.
879	   The user interfaces for these devices are provided through the Keypad
880	   Markup Language (KPML). Considering once again the feature
881	   interaction case above, the pre-paid calling card application and the
882	   call recording application would both pass a KPML document to the
883	   device. When the user presses a button on the keypad, to which
884	   document does the input apply? The user interface does not allow the
885	   user to select. A user interface where the user cannot provide focus
886	   is called a focusless user interface. This is quite a hard problem to
887	   solve. This framework does not make any explicit normative
888	   recommendation, but concludes that the best option is to send the
889	   input to both user interfaces unless the markup in one interface has
890	   indicated that it should be suppressed from others. This is a
891	   sensible choice by analogy - its exactly what the existing circuit
892	   switched telephone network will do. It is an explicit non-goal to
893	   provide a better mechanism for feature interaction resolution than
894	   the PSTN on devices which have the same user interface as they do on
895	   the PSTN. Devices with better displays, such as PCs or screen phones,
896	   can benefit from the capabilities of this framework, allowing the
897	   user to determine which application they are interacting with.

899	   Indeed, when a user provides input on a focusless device, the input
900	   must be passed to all client local user interfaces, AND all client
901	   remote user interfaces, unless the markup tells the UI to suppress
902	   the media. In the case of KPML, key events are passed to remote user
903	   interfaces by encoding them in RFC 2833 [7]. Of course, since a
904	   client cannot determine if a media stream terminates in a remote user
905	   interface or not, these key events are passed in all audio media
906	   streams unless the "Q" digit is used to suppress.

908	7.2 Client-Remote UI

910	   When the user interfaces run remotely, the determination of focus can
911	   be much, much harder. There are many architectures that can be
912	   deployed to handle the interaction. None are ideal. However, all are
913	   beyond the scope of this specification.

915	8. Intra Application Feature Interaction

917	   An application can instantiate a multiplicity of user interface
918	   components. For example, a single application can instantiate two
919	   separate HTML components and one WML component. Furthermore, an
920	   application can instantiate both client local and client remote user
921	   interfaces.

923	   The feature interaction issues between these components within the
924	   same application are less severe. If an application has multiple
925	   client user interface components, their interaction is resolved
926	   identically to the inter-application case - through focus
927	   determination. However, the problems in focusless user interfaces
928	   (such as a keypad) generally won't exist, since the application can
929	   generate user interfaces which do not overlap in their usage of an
930	   input.

932	   The real issue is that the optimal user experience frequently
933	   requires some kind of coupling between the differing user interface
934	   components. This is a classic problem in multi-modal user interfaces,
935	   such as those described by Speech Application Language Tags (SALT).
936	   As an example, consider a user interface where a user can either
937	   press a labeled button to make a selection, or listen to a prompt,
938	   and speak the desired selection. Ideally, when the user presses the
939	   button, the prompt should cease immediately, since both of them were
940	   targeted at collecting the same information in parallel. Such
941	   interactions are best handled by markups which natively support such
942	   interactions, such as SALT, and thus require no explicit support from
943	   this framework.

945	9. Examples

947	   TODO.

949	10. Security Considerations

951	   There are many security considerations associated with this
952	   framework. It allows applications in the network to instantiate user
953	   interface components on a client device. Such instantiations need to
954	   be from authenticated applications, and also need to be authorized to
955	   place a UI into the client. Indeed, the stronger requirement is
956	   authorization. It is not so important to know that name of the
957	   provider of the application, but rather, that the provider is
958	   authorized to instantiate components.

960	   Generally, an application should be considered authorized if it was
961	   an application that was legitimately part of the call setup path.
962	   With this definition, authorization can be enforced using the sips
963	   URI scheme when the call is initiated.

965	11. Contributors

967	   This document was produced as a result of discussions amongst the
968	   application interaction design team. All members of this team
969	   contributed significantly to the ideas embodied in this document. The
970	   members of this team were:

972	   Eric Burger
973	   Cullen Jennings
974	   Robert Fairlie-Cuninghame

976	Informative References

978	   [1]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A.,
979	        Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP:
980	        Session Initiation Protocol", RFC 3261, June 2002.

982	   [2]  McGlashan, S., Lucas, B., Porter, B., Rehor, K., Burnett, D.,
983	        Carter, J., Ferrans, J. and A. Hunt, "Voice Extensible Markup
984	        Language (VoiceXML) Version 2.0", W3C CR CR-voicexml20-20030220,
985	        February 2003.

987	   [3]  Day, M., Rosenberg, J. and H. Sugano, "A Model for Presence and
988	        Instant Messaging", RFC 2778, February 2000.

990	   [4]  Rosenberg, J., "A Framework for Conferencing with the Session
991	        Initiation Protocol",
992	        draft-ietf-sipping-conferencing-framework-00 (work in progress),
993	        May 2003.

995	   [5]  Rosenberg, J., Schulzrinne, H. and P. Kyzivat, "Caller
996	        Preferences and Callee Capabilities for the Session Initiation
997	        Protocol (SIP)", draft-ietf-sip-callerprefs-08 (work in
998	        progress), March 2003.

1000	   [6]  Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson,
1001	        "RTP: A Transport Protocol for Real-Time Applications", RFC
1002	        1889, January 1996.

1004	   [7]  Schulzrinne, H. and S. Petrack, "RTP Payload for DTMF Digits,
1005	        Telephony Tones and Telephony Signals", RFC 2833, May 2000.

1007	   [8]  Burger, E., "Keypad Markup Language (KPML)",
1008	        draft-burger-sipping-kpml-02 (work in progress), July 2003.

1010	   [9]  Dyke, J., Burger, E. and A. Spitzer, "Media Server Control
1011	        Markup Language (MSCML) and Protocol", draft-vandyke-mscml-02
1012	        (work in progress), July 2003.

1014	Author's Address

1016	   Jonathan Rosenberg
1017	   dynamicsoft
1018	   600 Lanidex Plaza
1019	   Parsippany, NJ  07054
1020	   US

1022	   Phone: +1 973 952-5000
1023	   EMail: jdrosen@dynamicsoft.com
1024	   URI:   http://www.jdrosen.net

1026	Intellectual Property Statement

1028	   The IETF takes no position regarding the validity or scope of any
1029	   intellectual property or other rights that might be claimed to
1030	   pertain to the implementation or use of the technology described in
1031	   this document or the extent to which any license under such rights
1032	   might or might not be available; neither does it represent that it
1033	   has made any effort to identify any such rights. Information on the
1034	   IETF's procedures with respect to rights in standards-track and
1035	   standards-related documentation can be found in BCP-11. Copies of
1036	   claims of rights made available for publication and any assurances of
1037	   licenses to be made available, or the result of an attempt made to
1038	   obtain a general license or permission for the use of such
1039	   proprietary rights by implementors or users of this specification can
1040	   be obtained from the IETF Secretariat.

1042	   The IETF invites any interested party to bring to its attention any
1043	   copyrights, patents or patent applications, or other proprietary
1044	   rights which may cover technology that may be required to practice
1045	   this standard. Please address the information to the IETF Executive
1046	   Director.

1048	Full Copyright Statement

1050	   Copyright (C) The Internet Society (2003). All Rights Reserved.

1052	   This document and translations of it may be copied and furnished to
1053	   others, and derivative works that comment on or otherwise explain it
1054	   or assist in its implementation may be prepared, copied, published
1055	   and distributed, in whole or in part, without restriction of any
1056	   kind, provided that the above copyright notice and this paragraph are
1057	   included on all such copies and derivative works. However, this
1058	   document itself may not be modified in any way, such as by removing
1059	   the copyright notice or references to the Internet Society or other
1060	   Internet organizations, except as needed for the purpose of
1061	   developing Internet standards in which case the procedures for
1062	   copyrights defined in the Internet Standards process must be
1063	   followed, or as required to translate it into languages other than
1064	   English.

1066	   The limited permissions granted above are perpetual and will not be
1067	   revoked by the Internet Society or its successors or assignees.

1069	   This document and the information contained herein is provided on an
1070	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1071	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1072	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1073	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1074	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1076	Acknowledgement

1078	   Funding for the RFC Editor function is currently provided by the
1079	   Internet Society.