idnits 2.17.1 draft-rosenberg-sipping-app-interaction-framework-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 30, 2003) is 7606 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-05) exists of draft-ietf-sipping-conferencing-framework-00 == Outdated reference: A later version (-10) exists of draft-ietf-sip-callerprefs-08 -- Obsolete informational reference (is this intentional?): RFC 1889 (ref. '6') (Obsoleted by RFC 3550) -- Obsolete informational reference (is this intentional?): RFC 2833 (ref. '7') (Obsoleted by RFC 4733, RFC 4734) == Outdated reference: A later version (-09) exists of draft-vandyke-mscml-02 Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SIPPING J. Rosenberg 3 Internet-Draft dynamicsoft 4 Expires: December 29, 2003 June 30, 2003 6 A Framework and Requirements for Application Interaction in the 7 Session Initiation Protocol (SIP) 8 draft-rosenberg-sipping-app-interaction-framework-01 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that other 17 groups may also distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at http:// 25 www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on December 29, 2003. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 This document describes a framework and requirements for the 39 interaction between users and Session Initiation Protocol (SIP) based 40 applications. By interacting with applications, users can guide the 41 way in which they operate. The focus of this framework is stimulus 42 signaling, which allows a user agent to interact with an application 43 without knowledge of the semantics of that application. Stimulus 44 signaling can occur to a user interface running locally with the 45 client, or to a remote user interface, through media streams. 46 Stimulus signaling encompasses a wide range of mechanisms, ranging 47 from clicking on hyperlinks, to pressing buttons, to traditional Dual 48 Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling 49 is supported through the use of markup languages, which play a key 50 role in this framework. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . 4 56 3. A Model for Application Interaction . . . . . . . . . . . . 7 57 3.1 Function vs. Stimulus . . . . . . . . . . . . . . . . . . . 8 58 3.2 Real-Time vs. Non-Real Time . . . . . . . . . . . . . . . . 9 59 3.3 Client-Local vs. Client-Remote . . . . . . . . . . . . . . . 9 60 3.4 Interaction Scenarios on Telephones . . . . . . . . . . . . 10 61 3.4.1 Client Remote . . . . . . . . . . . . . . . . . . . . . . . 11 62 3.4.2 Client Local . . . . . . . . . . . . . . . . . . . . . . . . 11 63 3.4.3 Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 11 64 4. Framework Overview . . . . . . . . . . . . . . . . . . . . . 13 65 5. Client Local Interfaces . . . . . . . . . . . . . . . . . . 15 66 5.1 Discovering Capabilities . . . . . . . . . . . . . . . . . . 15 67 5.2 Pushing an Initial Interface Component . . . . . . . . . . . 15 68 5.3 Updating an Interface Component . . . . . . . . . . . . . . 17 69 5.4 Terminating an Interface Component . . . . . . . . . . . . . 18 70 6. Client Remote Interfaces . . . . . . . . . . . . . . . . . . 19 71 6.1 Originating and Terminating Applications . . . . . . . . . . 19 72 6.2 Intermediary Applications . . . . . . . . . . . . . . . . . 19 73 7. Inter-Application Feature Interaction . . . . . . . . . . . 21 74 7.1 Client Local UI . . . . . . . . . . . . . . . . . . . . . . 21 75 7.2 Client-Remote UI . . . . . . . . . . . . . . . . . . . . . . 22 76 8. Intra Application Feature Interaction . . . . . . . . . . . 23 77 9. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 24 78 10. Security Considerations . . . . . . . . . . . . . . . . . . 25 79 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 80 Informative References . . . . . . . . . . . . . . . . . . . 27 81 Author's Address . . . . . . . . . . . . . . . . . . . . . . 28 82 Intellectual Property and Copyright Statements . . . . . . . 29 84 1. Introduction 86 The Session Initiation Protocol (SIP) [1] provides the ability for 87 users to initiate, manage, and terminate communications sessions. 88 Frequently, these sessions will involve a SIP application. A SIP 89 application is defined as a program running on a SIP-based element 90 (such as a proxy or user agent) that provides some value-added 91 function to a user or system administrator. Examples of SIP 92 applications include pre-paid calling card calls, conferencing, and 93 presence-based [3] call routing. 95 In order for most applications to properly function, they need input 96 from the user to guide their operation. As an example, a pre-paid 97 calling card application requires the user to input their calling 98 card number, their PIN code, and the destination number they wish to 99 reach. The process by which a user provides input to an application 100 is called "application interaction". 102 Application interaction can be either functional or stimulus. 103 Functional interaction requires the user agent to understand the 104 semantics of the application, whereas stimulus interaction does not. 105 Stimulus signaling allows for applications to be built without 106 requiring modifications to the client. Stimulus interaction is the 107 subject of this framework. The framework provides a model for how 108 users interact with applications through user interfaces, and how 109 user interfaces and applications can be distributed throughout a 110 network. This model is then used to describe how applications can 111 instantiate and manage user interfaces. 113 2. Definitions 115 SIP Application: A SIP application is defined as a program running on 116 a SIP-based element (such as a proxy or user agent) that provides 117 some value-added function to a user or system administrator. 118 Examples of SIP applications include pre-paid calling card calls, 119 conferencing, and presence-based [3] call routing. 121 Application Interaction: The process by which a user provides input 122 to an application. 124 Real-Time Application Interaction: Application interaction that takes 125 place while an application instance is executing. For example, 126 when a user enters their PIN number into a pre-paid calling card 127 application, this is real-time application interaction. 129 Non-Real Time Application Interaction: Application interaction that 130 takes place asynchronously with the execution of the application. 131 Generally, non-real time application interaction is accomplished 132 through provisioning. 134 Functional Application Interaction: Application interaction is 135 functional when the user device has an understanding of the 136 semantics of the application that the user is interacting with. 138 Stimulus Application Interaction: Application interaction is 139 considered to be stimulus when the user device has no 140 understanding of the semantics of the application that the user is 141 interacting with. 143 User Interface (UI): The user interface provides the user with 144 context in order to make decisions about what they want. The user 145 enters information into the user interface. The user interface 146 interprets the information, and passes it to the application. 148 User Interface Component: A piece of user interface which operates 149 independently of other pieces of the user interface. For example, 150 a user might have two separate web interfaces to a pre-paid 151 calling card application - one for hanging up and making another 152 call, and another for entering the username and PIN. 154 User Device: The software or hardware system that the user directly 155 interacts with in order to communicate with the application. An 156 example of a user device is a telephone. Another example is a PC 157 with a web browser. 159 User Input: The "raw" information passed from a user to a user 160 interface. Examples of user input include a spoken word or a click 161 on a hyperlink. 163 Client-Local User Interface: A user interface which is co-resident 164 with the user device. 166 Client Remote User Interface: A user interface which executes 167 remotely from the user device. In this case, a standardized 168 interface is needed between them. Typically, this is done through 169 media sessions - audio, video, or application sharing. 171 Media Interaction: A means of separating a user and a user interface 172 by connecting them with media streams. 174 Interactive Voice Response (IVR): An IVR is a type of user interface 175 that allows users to speak commands to the application, and hear 176 responses to those commands prompting for more information. 178 Prompt-and-Collect: The basic primitive of an IVR user interface. The 179 user is presented with a voice option, and the user speaks their 180 choice. 182 Barge-In: In an IVR user interface, a user is prompted to enter some 183 information. With some prompts, the user may enter the requested 184 information before the prompt completes. In that case, the prompt 185 ceases. The act of entering the information before completion of 186 the prompt is referred to as barge-in. 188 Focus: A user interface component has focus when user input is 189 provided fed to it, as opposed to any other user interface 190 components. This is not to be confused with the term focus within 191 the SIP conferencing framework, which refers to the center user 192 agent in a conference [4]. 194 Focus Determination: The process by which the user device determines 195 which user interface component will receive the user input. 197 Focusless User Interface: A user interface which has no ability to 198 perform focus determination. An example of a focusless user 199 interface is a keypad on a telephone. 201 Feature Interaction: A class of problems which result when multiple 202 applications or application components are trying to provide 203 services to a user at the same time. 205 Inter-Application Feature Interaction: Feature interactions that 206 occur between applications. 208 DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones 209 generated by circuit switched telephony devices when the user 210 presses a key on the keypad. As a result, DTMF and keypad input 211 are often used synonymously, when in fact one of them (DTMF) is 212 merely a means of conveying the other (the keypad input) to a 213 client-remote user interface (the switch, for example). 215 Application Instance: A single execution path of a SIP application. 217 Originating Application: A SIP application which acts as a UAC, 218 calling the user. 220 Terminating Application: A SIP application which acts as a UAS, 221 answering a call generated by a user. IVR applications are 222 terminating applications. 224 Intermediary Application: A SIP application which is neither the 225 caller or callee, but rather, a third party involved in a call. 227 3. A Model for Application Interaction 229 +---+ +---+ +---+ +---+ 230 | | | | | | | | 231 | | | U | | U | | A | 232 | | Input | s | Input | s | Results | p | 233 | | ---------> | e | ---------> | e | ----------> | p | 234 | U | | r | | r | | l | 235 | s | | | | | | i | 236 | e | | D | | I | | c | 237 | r | Output | e | Output | f | Update | a | 238 | | <--------- | v | <--------- | a | <.......... | t | 239 | | | i | | c | | i | 240 | | | c | | e | | o | 241 | | | e | | | | n | 242 | | | | | | | | 243 +---+ +---+ +---+ +---+ 245 Figure 1: Model for Real-Time Interactions 247 Figure 1 presents a general model for how users interact with 248 applications. Generally, users interact with a user interface through 249 a user device. A user device can be a telephone, or it can be a PC 250 with a web browser. Its role is to pass the user input from the user, 251 to the user interface. The user interface provides the user with 252 context in order to make decisions about what they want. The user 253 enters information into the user interface. The user interface 254 interprets the information, and passes it to the application. The 255 application may be able to modify the user interface based on this 256 information. Whether or not this is possible depends on the type of 257 user interface. 259 User interfaces are fundamentally about rendering and interpretation. 260 Rendering refers to the way in which the user is provided context. 261 This can be through hyperlinks, images, sounds, videos, text, and so 262 on. Interpretation refers to the way in which the user interface 263 takes the "raw" data provided by the user, and returns the result to 264 the application in a meaningful format, abstracted from the 265 particulars of the user interface. As an example, consider a pre-paid 266 calling card application. The user interface worries about details 267 such as what prompt the user is provided, whether the voice is male 268 or female, and so on. It is concerned with recognizing the speech 269 that the user provides, in order to obtain the desired information. 270 In this case, the desired information is the calling card number, the 271 PIN code, and the destination number. The application needs that 272 data, and it doesn't matter to the application whether it was 273 collected using a male prompt or a female one. 275 User interfaces generally have real-time requirements towards the 276 user. That is, when a user interacts with the user interface, the 277 user interface needs to react quickly, and that change needs to be 278 propagated to the user right away. However, the interface between the 279 user interface and the application need not be that fast. Faster is 280 better, but the user interface itself can frequently compensate for 281 long latencies there. In the case of a pre-paid calling card 282 application, when the user is prompted to enter their PIN, the prompt 283 should generally stop immediately once the first digit of the PIN is 284 entered. This is referred to as barge-in. After the user-interface 285 collects the rest of the PIN, it can tell the user to "please wait 286 while processing". The PIN can then be gradually transmitted to the 287 application. In this example, the user interface has compensated for 288 a slow UI to application interface by asking the user to wait. 290 The separation between user interface and application is absolutely 291 fundamental to the entire framework provided in this document. Its 292 importance cannot be overstated. 294 With this basic model, we can begin to taxonomize the types of 295 systems that can be built. 297 3.1 Function vs. Stimulus 299 The first way to taxonomize the system is to consider the interface 300 between the UI and the application. There are two fundamentally 301 different models for this interface. In a functional interface, the 302 user interface has detailed knowledge about the application, and is, 303 in fact, specific to the application. The interface between the two 304 components is through a functional protocol, capable of representing 305 the semantics which can be exposed through the user interface. 306 Because the user interface has knowledge of the application, it can 307 be optimally designed for that application. As a result, functional 308 user interfaces are almost always the most user friendly, the 309 fastest, the and the most responsive. However, in order to allow 310 interoperability between user devices and applications, the details 311 of the functional protocols need to be specified in standards. This 312 slows down innovation and limits the scope of applications that can 313 be built. 315 An alternative is a stimulus interface. In a stimulus interface, the 316 user interface is generic, totally ignorant of the details of the 317 application. Indeed, the application may pass instructions to the 318 user interface describing how it should operate. The user interface 319 translates user input into "stimulus" - which are data understood 320 only by the application, and not by the user interface. Because they 321 are generic, and because they require communications with the 322 application in order to change the way in which they render 323 information to the user, stimulus user interfaces are usually slower, 324 less user friendly, and less responsive than a functional 325 counterpart. However, they allow for substantial innovation in 326 applications, since no standardization activity is needed to built a 327 new application, as long as it can interact with the user within the 328 confines of the user interface mechanism. 330 In SIP systems, functional interfaces are provided by extending the 331 SIP protocol to provide the needed functionality. For example, the 332 SIP caller preferences specification [5] provides a functional 333 interface that allows a user to request applications to route the 334 call to specific types of user agents. Functional interfaces are 335 important, but are not the subject of this framework. The primary 336 goal of this framework is to address the role of stimulus interfaces 337 to SIP applications. 339 3.2 Real-Time vs. Non-Real Time 341 Application interaction systems can also be real-time or 342 non-real-time. Non-real interaction allows the user to enter 343 information about application operation in asynchronously with its 344 invocation. Frequently, this is done through provisioning systems. As 345 an example, a user can set up the forwarding number for a 346 call-forward on no-answer application using a web page. Real-time 347 interaction requires the user to interact with the application at the 348 time of its invocation. 350 3.3 Client-Local vs. Client-Remote 352 Another axis in the taxonomization is whether the user interface is 353 co-resident with the user device (which we refer to as a client-local 354 user interface), or the user interface runs in a host separated from 355 the client (which we refer to as a client-remote user interface). In 356 a client-remote user interface, there exists some kind of protocol 357 between the client device and the UI that allows the client to 358 interact with the user interface over a network. 360 The most important way to separate the UI and the client device is 361 through media interaction. In media interaction, the interface 362 between the user and the user interface is through media - audio, 363 video, messaging, and so on. This is the classic mode of operation 364 for VoiceXML [2], where the user interface (also referred to as the 365 voice browser) runs on a platform in the network. Users communicate 366 with the voice browser through the telephone network (or using a SIP 367 session). The voice browser interacts with the application using HTTP 368 to convey the information collected from the user. 370 We refer to the second sub-case as a client-local user interface. In 371 this case, the user interface runs co-located with the user. The 372 interface between them is through the software that interprets the 373 users input and passes them to the user interface. The classic 374 example of this is the web. In the web, the user interface is a web 375 browser, and the interface is defined by the HTML document that it's 376 rendering. The user interacts directly with the user interface 377 running in the browser. The results of that user interface are sent 378 to the application (running on the web server) using HTTP. 380 It is important to note that whether or not the user interface is 381 local, or remote (in the case of media interaction), is not a 382 property of the modality of the interface, but rather a property of 383 the system. As an example, it is possible for a web-based user 384 interface to be provided with a client-remote user interface. In such 385 a scenario, video and application sharing media sessions can be used 386 between the user and the user interface. The user interface, still 387 guided by HTML, now runs "in the network", remote from the client. 388 Similarly, a VoiceXML document can be interpreted locally by a client 389 device, with no media streams at all. Indeed, the VoiceXML document 390 can be rendered using text, rather than media, with no impact on the 391 interface between the user interface and the application. 393 It is also important to note that systems can be hybrid. In a hybrid 394 user interface, some aspects of it (usually those associated with a 395 particular modality) run locally, and others run remotely. 397 3.4 Interaction Scenarios on Telephones 399 This same model can apply to a telephone. In a traditional telephone, 400 the user interface consists of a 12-key keypad, a speaker, and a 401 microphone. Indeed, from here forward, the term "telephone" is used 402 to represent any device that meets, at a minimum, the characteristics 403 described in the previous sentence. Circuit-switched telephony 404 applications are almost universally client-remote user interfaces. In 405 the Public Switched Telephone Network (PSTN), there is usually a 406 circuit interface between the user and the user interface. The user 407 input from the keypad is conveyed used Dual-Tone Multi-Frequency 408 (DTMF), and the microphone input as PCM encoded voice. 410 In an IP-based system, there is more variability in how the system 411 can be instantiated. Both client-remote and client-local user 412 interfaces to a telephone can be provided. 414 In this framework, a PSTN gateway can be considered a "user proxy". 415 It is a proxy for the user because it can provide, to a user 416 interface on an IP network, input taken from a user on a circuit 417 switched telephone. The gateway may be able to run a client-local 418 user interface, just as an IP telephone might. 420 3.4.1 Client Remote 422 The most obvious instantiation is the "classic" circuit-switched 423 telephony model. In that model, the user interface runs remotely from 424 the client. The interface between the user and the user interface is 425 through media, set up by SIP and carried over the Real Time Transport 426 Protocol (RTP) [6]. The microphone input can be carried using any 427 suitable voice encoding algorithm. The keypad input can be conveyed 428 in one of two ways. The first is to convert the keypad input to DTMF, 429 and then convey that DTMF using a suitance encoding algorithm for it 430 (such as PCMU). An alternative, and generally the preferred approach, 431 is to transmit the keypad input using RFC 2833 [7], which provides an 432 encoding mechanism for carrying keypad input within RTP. 434 In this classic model, the user interface would run on a server in 435 the IP network. It would perform speech recognition and DTMF 436 recognition to derive the user intent, feed them through the user 437 interface, and provide the result to an application. 439 3.4.2 Client Local 441 An alternative model is for the entire user interface to reside on 442 the telephone. The user interface can be a VoiceXML browser, running 443 speech recognition on the microphone input, and feeding the keypad 444 input directly into the script. As discussed above, the VoiceXML 445 script could be rendered using text instead of voice, if the 446 telephone had a textual display. 448 3.4.3 Flip-Flop 450 A middle-ground approach is to flip back and forth between a 451 client-local and client-remote user interface. Many voice 452 applications are of the type which listen to the media stream and 453 wait for some specific trigger that kicks off a more complex user 454 interaction. The long pound in a pre-paid calling card application is 455 one example. Another example is a conference recording application, 456 where the user can press a key at some point in the call to begin 457 recording. When the key is pressed, the user hears a whisper to 458 inform them that recording has started. 460 The ideal way to support such an application is to install a 461 client-local user interface component that waits for the trigger to 462 kick off the real interaction. Once the trigger is received, the 463 application connects the user to a client-remote user interface that 464 can play announements, collect more information, and so on. 466 The benefit of flip-flopping between a client-local and client-remote 467 user interface is cost. The client-local user interface will 468 eliminate the need to send media streams into the network just to 469 wait for the user to press the pound key on the keypad. 471 The Keypad Markup Language (KPML) was designed to support exactly 472 this kind of need [8]. It models the keypad on a phone, and allows an 473 application to be informed when any sequence of keys have been 474 pressed. However, KPML has no presentation component. Since user 475 interfaces generally require a response to user input, the 476 presentation will need to be done using a client-remote user 477 interface that gets instantiated as a result of the trigger. 479 It is tempting to use a hybrid model, where a prompt-and-collect 480 application is implemented by using a client-remote user interface 481 that plays the prompts, and a client-local user interface, described 482 by KPML, that collects digits. However, this only complicates the 483 application. Firstly, the keypad input will be sent to both the media 484 stream and the KPML user interface. This requires the application to 485 sort out which user inputs are duplicates, a process that is very 486 complicated. Secondly, the primary benefit of KPML is to avoid having 487 a media stream towards a user interface. However, there is already a 488 media stream for the prompting, so there is no real savings. 490 4. Framework Overview 492 In this framework, we use the term "SIP application" to refer to a 493 broad set of functionality. A SIP application is a program running on 494 a SIP-based element (such as a proxy or user agent) that provides 495 some value-added function to a user or system administrator. SIP 496 applications can execute on behalf of a caller, a called party, or a 497 multitude of users at once. 499 Each application has a number of instances that are executing at any 500 given time. An instance represents a single execution path for an 501 application. Each instance has a well defined lifecycle. It is 502 established as a result of some event. That event can be a SIP event, 503 such as the reception of a SIP INVITE request, or it can be a non-SIP 504 event, such as a web form post or even a timer. Application instances 505 also have a specific end time. Some instances have a lifetime that is 506 coupled with a SIP transaction or dialog. For example, a proxy 507 application might begin when an INVITE arrives, and terminate when 508 the call is answered. Other applications have a lifetime that spans 509 multiple dialogs or transactions. For example, a conferencing 510 application instance may exist so long as there are any dialogs 511 connected to it. When the last dialog terminates, the application 512 instance terminates. Other applications have a liftime that is 513 completely decoupled from SIP events. 515 It is fundamental to the framework described here that multiple 516 application instances may interact with a user during a single SIP 517 transaction or dialog. Each instance may be for the same application, 518 or different applications. Each of the applications may be completely 519 independent, in that they may be owned by different providers, and 520 may not be aware of each others existence. Similarly, there may be 521 application instances interacting with the caller, and instances 522 interacting with the callee, both within the same transaction or 523 dialog. 525 The first step in the interaction with the user is to instantiate one 526 of more user interface components for the application instance. A 527 user interface component is a single piece of the user interface that 528 is defined by a logical flow that is not synchronously coupled with 529 any other component. In other words, each component runs more or less 530 independently. 532 A user interface component can be instantiated in one of the user 533 devices (for a client-local user interface), or within a network 534 element (for a client-remote user interface). If a client-local user 535 interface is to be used, the application needs to determine whether 536 or not the user device is capable of supporting a client-local user 537 interface, and in what format. In this framework, all client-local 538 user interface components are described by a markup language. A 539 markup language describes a logical flow of presentation of 540 information to the user, collection of information from the user, and 541 transmission of that information to an application. Examples of 542 markup languages include HTML, WML, VoiceXML, the Keypad Markup 543 Language (KPML) [8] and the Media Server Control Markup Language 544 (MSCML) [9]. 546 The interface between the user interface component and the 547 application is typically markup-language specific. For those markups 548 which support rendering of information to a user, such as HTML, HTTP 549 form POST operations are used. For those markups where no information 550 is rendered to the user, the markup can play one of two roles. The 551 first is called "one shot". In the one-shot role, the markup waits 552 for a user to enter some information, and when they do, reports this 553 event to the application. The application then does something, and 554 the markup is no longer used. In the other modality, called 555 "monitor", the markup stays permanently resident, and reports 556 information back to an application continuously. However, the act of 557 reporting information back to the application does not cause the 558 installation of a new markup. In markups where one-shot or monitor 559 modalities are used, a SIP MESSAGE request is used to report the 560 status. 562 To create a client-local user interface, the application passes the 563 markup document (or a reference to it) in a SIP message to that 564 client. The SIP message can be one explicitly generated by the 565 application (in which case the application has to be a UA or B2BUA), 566 or it can be placed in a SIP message that passes by (in which case 567 the application can be running in a proxy). 569 Client local user interface components are always associated with the 570 dialog that the SIP message itself is associated with. Consequently, 571 user interface components cannot be placed in messages that are not 572 associated with a dialog. 574 If a user interface component is to be instantiated in the network, 575 there is no need to determine the capabilities of the device on which 576 the user interface is instantiated. Presumably, it is on a device on 577 which the application knows a UI can be created. However, the 578 application does need to connect the user device to the user 579 interface. This will require manipulation of media streams in order 580 to establish that connection. 582 Once a user interface component is created, the application needs to 583 be able to change it, and to remove it. Finally, more advanced 584 applications may require coupling between application components. The 585 framework supports rudimentary capabilities there. 587 5. Client Local Interfaces 589 One key component of this framework is support for client local user 590 interfaces. 592 5.1 Discovering Capabilities 594 A client local user interface can only be instantiated on a client if 595 the user device has the capabilities needed to do so. Specifically, 596 an application needs to know what markup languages, if any, are 597 supported by the client. For example, does the client support HTML? 598 VoiceXML? However, that information is not sufficient to determine if 599 a client local user interface can be instantiated. In order to 600 instantiate the user interface, the application needs to transfer the 601 markup document to the client. There are two ways in which the markup 602 document can be transferred. The application can send the client a 603 URI which the client can use to fetch the markup, or the markup can 604 be sent inline within the message. The application needs to know 605 which of these modes are supported, and in the case of indirection, 606 which URI schemes are supported to obtain the indirection. 608 Many applications will need to know these capabilities at the time an 609 application instance is first created. Since applications can be 610 created through SIP requests or responses, SIP needs to provide a 611 means to convey this information. This introduces several concrete 612 requirements for SIP: 614 REQ 1: A SIP request or response must be capable of conveying the set 615 of markup languages supported by the UA that generated the request 616 or response. 618 REQ 2: A SIP request or response must be capable of indicating 619 whether a UA can obtain markups inline, or through an indirection. 620 In the case of indirection, the UA must be capable of indicating 621 what URI schemes it supports. 623 5.2 Pushing an Initial Interface Component 625 Once the application has determined that the UA is capable of 626 supporting client local user interfaces, the next step is for the 627 application to push an interface component to the user device. 629 Generally, we anticipate that interface components will need to be 630 created at various different points in a SIP session. Clearly, they 631 will need to be pushed during an initial INVITE, in both responses 632 (so as to place a component into the calling UA) and in the request 633 (so as to place a component into the called UA). As an example, a 634 conference recording application allows the users to record the media 635 for the session at any time. The application would like to push an 636 HTML user interface component to both the caller and callee at the 637 time the call is setup, allowing either to record the session. The 638 HTML component would have buttons to start and stop recording. To 639 push the HTML component to the caller, it needs to be pushed in the 640 200 OK (and possibly provisional response), and to push it to the 641 callee, in the INVITE itself. 643 To state the requirement more concretely: 645 REQ 3: An application must be able to add a reference to, or an 646 inline version of, a user interface component into any request or 647 response that passes through or is emanated from that application. 649 However, there will also be cases where the application needs to push 650 a new interface component to a UA, but it is not as a result of any 651 SIP message. As an example, a pre-paid calling card application will 652 set a timer that determines how long the call can proceed, given the 653 availability of funds in the user's account. When the timer fires, 654 the application would like to push a new interface component to the 655 calling UA, allowing them to click to add more funds. 657 In this case, there is no message already in transit that can be used 658 as a vehicle for pushing a user interface component. This requires 659 that applications can generate their own messages to push a new 660 component to a UA: 662 REQ 4: A UA application must be able to send a SIP message to the UA 663 at the other end of the dialog, asking it to create a new 664 interface component. 666 In all cases, the information passed from the application to the UA 667 must include more than just the interface component itself (or a 668 reference to it). The user must be able to decide whether or not it 669 wants to proceed with this application. To make that determination, 670 the user must have information about the application. Specifically, 671 it will need the name of the application, and an identifier of the 672 owner or administrator for the application. As an example, a typical 673 name would be "Prepaid Calling Card" and the owner could be 674 "voiceprovider.com". 676 REQ 5: Any user interface component passed to a client (either inline 677 or through a reference) must also include markup meta-data, 678 including a human readable name of the application, and an 679 identifier of the owner of the application. 681 Clearly, there are security implications. The user will need to 682 verify the identity of the application owner, and be sure that the 683 user interface component is not being replayed, that is, it actually 684 belongs with this specific SIP message. 686 REQ 6: It must be possible for the client to validate the 687 authenticity and integrity of the markup document (or its 688 reference) and its associated meta-data. It must be possible for 689 the client to verify that the information has not been replayed 690 from a previous SIP message. 692 If the user decides not to execute the user interface component, it 693 simply discards it. There is no explicit requirement for the user to 694 be able to inform the application that the component was discarded. 695 Effectively, the application will think that the component was 696 executed, but that the user never entered any information. 698 5.3 Updating an Interface Component 700 Once a user interface component has been created on a client, it can 701 be updated in two ways. The first way is the "normal" path inherent 702 to that component. The client enters some data, the user interface 703 transfers the information to the application (typically through 704 HTTP), and the result of that transfer brings a new markup document 705 describing an updated interface. This is referred to as a synchronous 706 update, since it is synchronized with user interaction. 708 However, synchronous updates are not sufficient for many 709 applications. Frequently, the interface will need to be updated 710 asynchronously by the application, without an explicit user action. A 711 good example of this is, once again, the pre-paid calling card 712 application. The application might like to update the user interface 713 when the timer runs out on the call. This introduces several 714 requirements: 716 REQ 7: It must be possible for an application to asynchronously push 717 an update to an existing user interface component, either in a 718 message that was already in transit, or by generating a new 719 message. 721 REQ 8: It must be possible for the client to associate the new 722 interface component with the one that it is supposed to replace, 723 so that the old one can be removed. 725 Unfortunately, pushing of application components introduces a race 726 condition. What if the user enters data into the old component, 727 causing an HTTP request to the application, while an update of that 728 component is in progress? The client will get an interface component 729 in the HTTP response, and also get the new one in the SIP message. 731 Which one does the client use? There needs to be a way in which to 732 properly order the components: 734 REQ 9: It must be possible for the client to relatively order user 735 interface updates it receives as the result of synchronous and 736 asynchronous messaging. 738 5.4 Terminating an Interface Component 740 User interface components have a well defined lifetime. They are 741 created when the component is first pushed to the client. User 742 interface components are always associated with the SIP dialog on 743 which they were pushed. As such, their lifetime is bound by the 744 lifetime of the dialog. When the dialog ends, so does the interface 745 component. 747 This rule applies to early dialogs as well. If a user interface 748 component is passed in a provisional response to INVITE, and a 749 separate branch eventually answers the call, the component terminates 750 with the arrival of the 2xx. That's because the early dialog itself 751 terminates with the arrival of the 2xx. 753 However, there are some cases where the application would like to 754 terminate the user interface component before its natural termination 755 point. To do this, the application pushes a "null" update to the 756 client. This is an update that replaces the existing user interface 757 component with nothing. 759 REQ 10: It must be possible for an application to terminate a user 760 interface component before its natural expiration. 762 The user can also terminate the user interface component. However, 763 there is no explicit signaling required in this case. The component 764 is simply dismissed. To the application, it appears as if the user 765 has simply ceased entering data. 767 6. Client Remote Interfaces 769 As an alternative to, or in conjunction with client local user 770 interfaces, an application can make use of client remote user 771 interfaces. These user interfaces can execute co-resident with the 772 application itself (in which case no standardized interfaces between 773 the UI and the application need to be used), or it can run 774 separately. This framework assumes that the user interface runs on a 775 host that has a sufficient trust relationship with the application. 776 As such, the means for instantiating the user interface is not 777 considered here. 779 The primary issue is to connect the user device to the remote user 780 interface. Doing so requires the manipulation of media streams 781 between the client and the user interface. Such manipulation can only 782 be done by user agents. There are two types of user agent 783 applications within this framework - originating/terminating 784 applications, and intermediary applications. 786 6.1 Originating and Terminating Applications 788 Originating and terminating applications are applications which are 789 themselves the originator or the final recipient of a SIP invitation. 790 They are "pure" user agent applications - not back-to-back user 791 agents. The classic example of such an application is an interactive 792 voice response (IVR) application, which is typically a terminating 793 application. Its a terminating application because the user 794 explicitly calls it; i.e., it is the actual called party. An example 795 of an originating application is a wakeup call application, which 796 calls a user at a specified time in order to wake them up. 798 Because originating and terminating applications are a natural 799 termination point of the dialog, manipulation of the media session by 800 the application is trivial. Traditional SIP techniques for adding and 801 removing media streams, modifying codecs, and changing the address of 802 the recipient of the media streams, can be applied. Similarly, the 803 application can direclty authenticate itself to the user through S/ 804 MIME, since it is the peer UA in the dialog. 806 6.2 Intermediary Applications 808 Intermediary application are, at the same time, more common than 809 originating/terminating applications, and more complex. Intermediary 810 applications are applications that are neither the actual caller or 811 called party. Rather, they represent a "third party" that wishes to 812 interact with the user. The classic example is the ubiquitous 813 pre-paid calling card application. 815 In order for the intermediary application to add a client remote user 816 interface, it needs to manipulate the media streams of the user agent 817 to terminate on that user interface. This also introduces a 818 fundamental feature interaction issue. Since the intermediary 819 application is not an actual participant in the call, how does the 820 user interact with the intermediary application, and its actual peer 821 in the dialog, at the same time? This is discussed in more detail in 822 Section 7. 824 7. Inter-Application Feature Interaction 826 The inter-application feature interaction problem is inherent to 827 stimulus signaling. Whenever there are multiple applications, there 828 are multiple user interfaces. When the user provides an input, to 829 which user interface is the input destined? That question is the 830 essence of the inter-application feature interaction problem. 832 Inter-application feature interaction is not an easy problem to 833 resolve. For now, we consider separately the issues for client-local 834 and client-remote user interface components. 836 7.1 Client Local UI 838 When the user interface itself resides locally on the client device, 839 the feature interaction problem is actually much simpler. The end 840 device knows explicitly about each application, and therefore can 841 present the user with each one separately. When the user provides 842 input, the client device can determine to which user interface the 843 input is destined. The user interface to which input is destined is 844 referred to as the application in focus, and the means by which the 845 focused application is selected is called focus determination. 847 Generally speaking, focus determination is purely a local operation. 848 In the PC universe, focus determination is provided by window 849 managers. Each application does not know about focus, it merely 850 receives the user input that has been targeted to it when its in 851 focus. This basic concept applies to SIP-based applications as well. 853 Focus determination will frequently be trivial, depending on the user 854 interface type. Consider a user that makes a call from a PC. The call 855 passes through a pre-paid calling card application, and a call 856 recording application. Both of these wish to interact with the user. 857 Both push an HTML-based user interface to the user. On the PC, each 858 user interface would appear as a separate window. The user interacts 859 with the call recording application by selecting its window, and with 860 the pre-paid calling card application by selecting its window. Focus 861 determination is literally provided by the PC window manager. It is 862 clear to which application the user input is targeted. 864 As another example, consider the same two applications, but on a 865 "smart phone" that has a set of buttons, and next to each button, an 866 LCD display that can provide the user with an option. This user 867 interface can be represented using the Wireless Markup Language 868 (WML). 870 The phone would allocate some number of buttons to each application. 871 The prepaid calling card would get one button for its "hangup" 872 command, and the recording application would get one for its "start/ 873 stop" command. The user can easily determine which application to 874 interact with by pressing the appropriate button. Pressing a button 875 determines focus and provides user input, both at the same time. 877 Unfortunately, not all devices will have these advanced displays. A 878 PSTN gateway, or a basic IP telephone, may only have a 12-key keypad. 879 The user interfaces for these devices are provided through the Keypad 880 Markup Language (KPML). Considering once again the feature 881 interaction case above, the pre-paid calling card application and the 882 call recording application would both pass a KPML document to the 883 device. When the user presses a button on the keypad, to which 884 document does the input apply? The user interface does not allow the 885 user to select. A user interface where the user cannot provide focus 886 is called a focusless user interface. This is quite a hard problem to 887 solve. This framework does not make any explicit normative 888 recommendation, but concludes that the best option is to send the 889 input to both user interfaces unless the markup in one interface has 890 indicated that it should be suppressed from others. This is a 891 sensible choice by analogy - its exactly what the existing circuit 892 switched telephone network will do. It is an explicit non-goal to 893 provide a better mechanism for feature interaction resolution than 894 the PSTN on devices which have the same user interface as they do on 895 the PSTN. Devices with better displays, such as PCs or screen phones, 896 can benefit from the capabilities of this framework, allowing the 897 user to determine which application they are interacting with. 899 Indeed, when a user provides input on a focusless device, the input 900 must be passed to all client local user interfaces, AND all client 901 remote user interfaces, unless the markup tells the UI to suppress 902 the media. In the case of KPML, key events are passed to remote user 903 interfaces by encoding them in RFC 2833 [7]. Of course, since a 904 client cannot determine if a media stream terminates in a remote user 905 interface or not, these key events are passed in all audio media 906 streams unless the "Q" digit is used to suppress. 908 7.2 Client-Remote UI 910 When the user interfaces run remotely, the determination of focus can 911 be much, much harder. There are many architectures that can be 912 deployed to handle the interaction. None are ideal. However, all are 913 beyond the scope of this specification. 915 8. Intra Application Feature Interaction 917 An application can instantiate a multiplicity of user interface 918 components. For example, a single application can instantiate two 919 separate HTML components and one WML component. Furthermore, an 920 application can instantiate both client local and client remote user 921 interfaces. 923 The feature interaction issues between these components within the 924 same application are less severe. If an application has multiple 925 client user interface components, their interaction is resolved 926 identically to the inter-application case - through focus 927 determination. However, the problems in focusless user interfaces 928 (such as a keypad) generally won't exist, since the application can 929 generate user interfaces which do not overlap in their usage of an 930 input. 932 The real issue is that the optimal user experience frequently 933 requires some kind of coupling between the differing user interface 934 components. This is a classic problem in multi-modal user interfaces, 935 such as those described by Speech Application Language Tags (SALT). 936 As an example, consider a user interface where a user can either 937 press a labeled button to make a selection, or listen to a prompt, 938 and speak the desired selection. Ideally, when the user presses the 939 button, the prompt should cease immediately, since both of them were 940 targeted at collecting the same information in parallel. Such 941 interactions are best handled by markups which natively support such 942 interactions, such as SALT, and thus require no explicit support from 943 this framework. 945 9. Examples 947 TODO. 949 10. Security Considerations 951 There are many security considerations associated with this 952 framework. It allows applications in the network to instantiate user 953 interface components on a client device. Such instantiations need to 954 be from authenticated applications, and also need to be authorized to 955 place a UI into the client. Indeed, the stronger requirement is 956 authorization. It is not so important to know that name of the 957 provider of the application, but rather, that the provider is 958 authorized to instantiate components. 960 Generally, an application should be considered authorized if it was 961 an application that was legitimately part of the call setup path. 962 With this definition, authorization can be enforced using the sips 963 URI scheme when the call is initiated. 965 11. Contributors 967 This document was produced as a result of discussions amongst the 968 application interaction design team. All members of this team 969 contributed significantly to the ideas embodied in this document. The 970 members of this team were: 972 Eric Burger 973 Cullen Jennings 974 Robert Fairlie-Cuninghame 976 Informative References 978 [1] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., 979 Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP: 980 Session Initiation Protocol", RFC 3261, June 2002. 982 [2] McGlashan, S., Lucas, B., Porter, B., Rehor, K., Burnett, D., 983 Carter, J., Ferrans, J. and A. Hunt, "Voice Extensible Markup 984 Language (VoiceXML) Version 2.0", W3C CR CR-voicexml20-20030220, 985 February 2003. 987 [3] Day, M., Rosenberg, J. and H. Sugano, "A Model for Presence and 988 Instant Messaging", RFC 2778, February 2000. 990 [4] Rosenberg, J., "A Framework for Conferencing with the Session 991 Initiation Protocol", 992 draft-ietf-sipping-conferencing-framework-00 (work in progress), 993 May 2003. 995 [5] Rosenberg, J., Schulzrinne, H. and P. Kyzivat, "Caller 996 Preferences and Callee Capabilities for the Session Initiation 997 Protocol (SIP)", draft-ietf-sip-callerprefs-08 (work in 998 progress), March 2003. 1000 [6] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 1001 "RTP: A Transport Protocol for Real-Time Applications", RFC 1002 1889, January 1996. 1004 [7] Schulzrinne, H. and S. Petrack, "RTP Payload for DTMF Digits, 1005 Telephony Tones and Telephony Signals", RFC 2833, May 2000. 1007 [8] Burger, E., "Keypad Markup Language (KPML)", 1008 draft-burger-sipping-kpml-02 (work in progress), July 2003. 1010 [9] Dyke, J., Burger, E. and A. Spitzer, "Media Server Control 1011 Markup Language (MSCML) and Protocol", draft-vandyke-mscml-02 1012 (work in progress), July 2003. 1014 Author's Address 1016 Jonathan Rosenberg 1017 dynamicsoft 1018 600 Lanidex Plaza 1019 Parsippany, NJ 07054 1020 US 1022 Phone: +1 973 952-5000 1023 EMail: jdrosen@dynamicsoft.com 1024 URI: http://www.jdrosen.net 1026 Intellectual Property Statement 1028 The IETF takes no position regarding the validity or scope of any 1029 intellectual property or other rights that might be claimed to 1030 pertain to the implementation or use of the technology described in 1031 this document or the extent to which any license under such rights 1032 might or might not be available; neither does it represent that it 1033 has made any effort to identify any such rights. Information on the 1034 IETF's procedures with respect to rights in standards-track and 1035 standards-related documentation can be found in BCP-11. Copies of 1036 claims of rights made available for publication and any assurances of 1037 licenses to be made available, or the result of an attempt made to 1038 obtain a general license or permission for the use of such 1039 proprietary rights by implementors or users of this specification can 1040 be obtained from the IETF Secretariat. 1042 The IETF invites any interested party to bring to its attention any 1043 copyrights, patents or patent applications, or other proprietary 1044 rights which may cover technology that may be required to practice 1045 this standard. Please address the information to the IETF Executive 1046 Director. 1048 Full Copyright Statement 1050 Copyright (C) The Internet Society (2003). All Rights Reserved. 1052 This document and translations of it may be copied and furnished to 1053 others, and derivative works that comment on or otherwise explain it 1054 or assist in its implementation may be prepared, copied, published 1055 and distributed, in whole or in part, without restriction of any 1056 kind, provided that the above copyright notice and this paragraph are 1057 included on all such copies and derivative works. However, this 1058 document itself may not be modified in any way, such as by removing 1059 the copyright notice or references to the Internet Society or other 1060 Internet organizations, except as needed for the purpose of 1061 developing Internet standards in which case the procedures for 1062 copyrights defined in the Internet Standards process must be 1063 followed, or as required to translate it into languages other than 1064 English. 1066 The limited permissions granted above are perpetual and will not be 1067 revoked by the Internet Society or its successors or assignees. 1069 This document and the information contained herein is provided on an 1070 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1071 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1072 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1073 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1074 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1076 Acknowledgement 1078 Funding for the RFC Editor function is currently provided by the 1079 Internet Society.