idnits 2.17.1 draft-ietf-sipping-app-interaction-framework-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 701: '...ermination, a UA MUST use a unique GRU...' RFC 2119 keyword, line 710: '...hemselves, it is RECOMMENDED that appl...' RFC 2119 keyword, line 713: '...To do this, a UA SHOULD accept any REF...' RFC 2119 keyword, line 718: '... Therefore, it is RECOMMENDED that UAs...' RFC 2119 keyword, line 720: '... SHOULD use grid parameters with suf...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 20, 2003) is 7487 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'TBD' on line 694 == Outdated reference: A later version (-05) exists of draft-ietf-sipping-conferencing-framework-00 == Outdated reference: A later version (-10) exists of draft-ietf-sip-callerprefs-09 == Outdated reference: A later version (-03) exists of draft-ietf-sip-callee-caps-00 -- Obsolete informational reference (is this intentional?): RFC 1889 (ref. '7') (Obsoleted by RFC 3550) -- Obsolete informational reference (is this intentional?): RFC 2833 (ref. '8') (Obsoleted by RFC 4733, RFC 4734) == Outdated reference: A later version (-08) exists of draft-ietf-sipping-kpml-00 == Outdated reference: A later version (-06) exists of draft-ietf-sip-identity-01 == Outdated reference: A later version (-01) exists of draft-rosenberg-sip-gruu-00 Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SIPPING J. Rosenberg 3 Internet-Draft dynamicsoft 4 Expires: April 19, 2004 October 20, 2003 6 A Framework for Application Interaction in the Session Initiation 7 Protocol (SIP) 8 draft-ietf-sipping-app-interaction-framework-00 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that other 17 groups may also distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at http:// 25 www.ietf.org/ietf/1id-abstracts.txt. 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 This Internet-Draft will expire on April 19, 2004. 32 Copyright Notice 34 Copyright (C) The Internet Society (2003). All Rights Reserved. 36 Abstract 38 This document describes a framework and requirements for the 39 interaction between users and Session Initiation Protocol (SIP) based 40 applications. By interacting with applications, users can guide the 41 way in which they operate. The focus of this framework is stimulus 42 signaling, which allows a user agent to interact with an application 43 without knowledge of the semantics of that application. Stimulus 44 signaling can occur to a user interface running locally with the 45 client, or to a remote user interface, through media streams. 46 Stimulus signaling encompasses a wide range of mechanisms, ranging 47 from clicking on hyperlinks, to pressing buttons, to traditional Dual 48 Tone Multi Frequency (DTMF) input. In all cases, stimulus signaling 49 is supported through the use of markup languages, which play a key 50 role in this framework. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . 4 56 3. A Model for Application Interaction . . . . . . . . . . . . 7 57 3.1 Functional vs. Stimulus . . . . . . . . . . . . . . . . . . 8 58 3.2 Real-Time vs. Non-Real Time . . . . . . . . . . . . . . . . 9 59 3.3 Client-Local vs. Client-Remote . . . . . . . . . . . . . . . 9 60 3.4 Presentation Capable vs. Presentation Free . . . . . . . . . 10 61 3.5 Interaction Scenarios on Telephones . . . . . . . . . . . . 11 62 3.5.1 Client Remote . . . . . . . . . . . . . . . . . . . . . . . 11 63 3.5.2 Client Local . . . . . . . . . . . . . . . . . . . . . . . . 11 64 3.5.3 Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 12 65 4. Framework Overview . . . . . . . . . . . . . . . . . . . . . 13 66 5. Client Local Interfaces . . . . . . . . . . . . . . . . . . 16 67 5.1 Discovering Capabilities . . . . . . . . . . . . . . . . . . 16 68 5.2 Pushing an Initial Interface Component . . . . . . . . . . . 16 69 5.3 Updating an Interface Component . . . . . . . . . . . . . . 18 70 5.4 Terminating an Interface Component . . . . . . . . . . . . . 18 71 6. Client Remote Interfaces . . . . . . . . . . . . . . . . . . 19 72 6.1 Originating and Terminating Applications . . . . . . . . . . 19 73 6.2 Intermediary Applications . . . . . . . . . . . . . . . . . 19 74 7. Inter-Application Feature Interaction . . . . . . . . . . . 21 75 7.1 Client Local UI . . . . . . . . . . . . . . . . . . . . . . 21 76 7.2 Client-Remote UI . . . . . . . . . . . . . . . . . . . . . . 22 77 8. Intra Application Feature Interaction . . . . . . . . . . . 23 78 9. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 24 79 10. Security Considerations . . . . . . . . . . . . . . . . . . 25 80 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 81 Informative References . . . . . . . . . . . . . . . . . . . 27 82 Author's Address . . . . . . . . . . . . . . . . . . . . . . 28 83 Intellectual Property and Copyright Statements . . . . . . . 29 85 1. Introduction 87 The Session Initiation Protocol (SIP) [1] provides the ability for 88 users to initiate, manage, and terminate communications sessions. 89 Frequently, these sessions will involve a SIP application. A SIP 90 application is defined as a program running on a SIP-based element 91 (such as a proxy or user agent) that provides some value-added 92 function to a user or system administrator. Examples of SIP 93 applications include pre-paid calling card calls, conferencing, and 94 presence-based [3] call routing. 96 In order for most applications to properly function, they need input 97 from the user to guide their operation. As an example, a pre-paid 98 calling card application requires the user to input their calling 99 card number, their PIN code, and the destination number they wish to 100 reach. The process by which a user provides input to an application 101 is called "application interaction". 103 Application interaction can be either functional or stimulus. 104 Functional interaction requires the user agent to understand the 105 semantics of the application, whereas stimulus interaction does not. 106 Stimulus signaling allows for applications to be built without 107 requiring modifications to the client. Stimulus interaction is the 108 subject of this framework. The framework provides a model for how 109 users interact with applications through user interfaces, and how 110 user interfaces and applications can be distributed throughout a 111 network. This model is then used to describe how applications can 112 instantiate and manage user interfaces. 114 2. Definitions 116 SIP Application: A SIP application is defined as a program running on 117 a SIP-based element (such as a proxy or user agent) that provides 118 some value-added function to a user or system administrator. 119 Examples of SIP applications include pre-paid calling card calls, 120 conferencing, and presence-based [3] call routing. 122 Application Interaction: The process by which a user provides input 123 to an application. 125 Real-Time Application Interaction: Application interaction that takes 126 place while an application instance is executing. For example, 127 when a user enters their PIN number into a pre-paid calling card 128 application, this is real-time application interaction. 130 Non-Real Time Application Interaction: Application interaction that 131 takes place asynchronously with the execution of the application. 132 Generally, non-real time application interaction is accomplished 133 through provisioning. 135 Functional Application Interaction: Application interaction is 136 functional when the user device has an understanding of the 137 semantics of the application that the user is interacting with. 139 Stimulus Application Interaction: Application interaction is 140 considered to be stimulus when the user device has no 141 understanding of the semantics of the application that the user is 142 interacting with. 144 User Interface (UI): The user interface provides the user with 145 context in order to make decisions about what they want. The user 146 enters information into the user interface. The user interface 147 interprets the information, and passes it to the application. 149 User Interface Component: A piece of user interface which operates 150 independently of other pieces of the user interface. For example, 151 a user might have two separate web interfaces to a pre-paid 152 calling card application - one for hanging up and making another 153 call, and another for entering the username and PIN. 155 User Device: The software or hardware system that the user directly 156 interacts with in order to communicate with the application. An 157 example of a user device is a telephone. Another example is a PC 158 with a web browser. 160 User Input: The "raw" information passed from a user to a user 161 interface. Examples of user input include a spoken word or a click 162 on a hyperlink. 164 Client-Local User Interface: A user interface which is co-resident 165 with the user device. 167 Client Remote User Interface: A user interface which executes 168 remotely from the user device. In this case, a standardized 169 interface is needed between them. Typically, this is done through 170 media sessions - audio, video, or application sharing. 172 Media Interaction: A means of separating a user and a user interface 173 by connecting them with media streams. 175 Interactive Voice Response (IVR): An IVR is a type of user interface 176 that allows users to speak commands to the application, and hear 177 responses to those commands prompting for more information. 179 Prompt-and-Collect: The basic primitive of an IVR user interface. The 180 user is presented with a voice option, and the user speaks their 181 choice. 183 Barge-In: In an IVR user interface, a user is prompted to enter some 184 information. With some prompts, the user may enter the requested 185 information before the prompt completes. In that case, the prompt 186 ceases. The act of entering the information before completion of 187 the prompt is referred to as barge-in. 189 Focus: A user interface component has focus when user input is 190 provided fed to it, as opposed to any other user interface 191 components. This is not to be confused with the term focus within 192 the SIP conferencing framework, which refers to the center user 193 agent in a conference [4]. 195 Focus Determination: The process by which the user device determines 196 which user interface component will receive the user input. 198 Focusless User Interface: A user interface which has no ability to 199 perform focus determination. An example of a focusless user 200 interface is a keypad on a telephone. 202 Presentation Capable UI: A user interface which can prompt the user 203 with input, collect results, and then prompt the user with new 204 information based on those results. 206 Presentation Free UI: A user interface which cannot prompt the user 207 with information. 209 Feature Interaction: A class of problems which result when multiple 210 applications or application components are trying to provide 211 services to a user at the same time. 213 Inter-Application Feature Interaction: Feature interactions that 214 occur between applications. 216 DTMF: Dual-Tone Multi-Frequency. DTMF refer to a class of tones 217 generated by circuit switched telephony devices when the user 218 presses a key on the keypad. As a result, DTMF and keypad input 219 are often used synonymously, when in fact one of them (DTMF) is 220 merely a means of conveying the other (the keypad input) to a 221 client-remote user interface (the switch, for example). 223 Application Instance: A single execution path of a SIP application. 225 Originating Application: A SIP application which acts as a UAC, 226 calling the user. 228 Terminating Application: A SIP application which acts as a UAS, 229 answering a call generated by a user. IVR applications are 230 terminating applications. 232 Intermediary Application: A SIP application which is neither the 233 caller or callee, but rather, a third party involved in a call. 235 3. A Model for Application Interaction 237 +---+ +---+ +---+ +---+ 238 | | | | | | | | 239 | | | U | | U | | A | 240 | | Input | s | Input | s | Results | p | 241 | | ---------> | e | ---------> | e | ----------> | p | 242 | U | | r | | r | | l | 243 | s | | | | | | i | 244 | e | | D | | I | | c | 245 | r | Output | e | Output | f | Update | a | 246 | | <--------- | v | <--------- | a | <.......... | t | 247 | | | i | | c | | i | 248 | | | c | | e | | o | 249 | | | e | | | | n | 250 | | | | | | | | 251 +---+ +---+ +---+ +---+ 253 Figure 1: Model for Real-Time Interactions 255 Figure 1 presents a general model for how users interact with 256 applications. Generally, users interact with a user interface through 257 a user device. A user device can be a telephone, or it can be a PC 258 with a web browser. Its role is to pass the user input from the user, 259 to the user interface. The user interface provides the user with 260 context in order to make decisions about what they want. The user 261 enters information into the user interface. The user interface 262 interprets the information, and passes it to the application. The 263 application may be able to modify the user interface based on this 264 information. Whether or not this is possible depends on the type of 265 user interface. 267 User interfaces are fundamentally about rendering and interpretation. 268 Rendering refers to the way in which the user is provided context. 269 This can be through hyperlinks, images, sounds, videos, text, and so 270 on. Interpretation refers to the way in which the user interface 271 takes the "raw" data provided by the user, and returns the result to 272 the application in a meaningful format, abstracted from the 273 particulars of the user interface. As an example, consider a pre-paid 274 calling card application. The user interface worries about details 275 such as what prompt the user is provided, whether the voice is male 276 or female, and so on. It is concerned with recognizing the speech 277 that the user provides, in order to obtain the desired information. 278 In this case, the desired information is the calling card number, the 279 PIN code, and the destination number. The application needs that 280 data, and it doesn't matter to the application whether it was 281 collected using a male prompt or a female one. 283 User interfaces generally have real-time requirements towards the 284 user. That is, when a user interacts with the user interface, the 285 user interface needs to react quickly, and that change needs to be 286 propagated to the user right away. However, the interface between the 287 user interface and the application need not be that fast. Faster is 288 better, but the user interface itself can frequently compensate for 289 long latencies there. In the case of a pre-paid calling card 290 application, when the user is prompted to enter their PIN, the prompt 291 should generally stop immediately once the first digit of the PIN is 292 entered. This is referred to as barge-in. After the user-interface 293 collects the rest of the PIN, it can tell the user to "please wait 294 while processing". The PIN can then be gradually transmitted to the 295 application. In this example, the user interface has compensated for 296 a slow UI to application interface by asking the user to wait. 298 The separation between user interface and application is absolutely 299 fundamental to the entire framework provided in this document. Its 300 importance cannot be overstated. 302 With this basic model, we can begin to taxonomize the types of 303 systems that can be built. 305 3.1 Functional vs. Stimulus 307 The first way to taxonomize the system is to consider the interface 308 between the UI and the application. There are two fundamentally 309 different models for this interface. In a functional interface, the 310 user interface has detailed knowledge about the application, and is, 311 in fact, specific to the application. The interface between the two 312 components is through a functional protocol, capable of representing 313 the semantics which can be exposed through the user interface. 314 Because the user interface has knowledge of the application, it can 315 be optimally designed for that application. As a result, functional 316 user interfaces are almost always the most user friendly, the 317 fastest, the and the most responsive. However, in order to allow 318 interoperability between user devices and applications, the details 319 of the functional protocols need to be specified in standards. This 320 slows down innovation and limits the scope of applications that can 321 be built. 323 An alternative is a stimulus interface. In a stimulus interface, the 324 user interface is generic, totally ignorant of the details of the 325 application. Indeed, the application may pass instructions to the 326 user interface describing how it should operate. The user interface 327 translates user input into "stimulus" - which are data understood 328 only by the application, and not by the user interface. Because they 329 are generic, and because they require communications with the 330 application in order to change the way in which they render 331 information to the user, stimulus user interfaces are usually slower, 332 less user friendly, and less responsive than a functional 333 counterpart. However, they allow for substantial innovation in 334 applications, since no standardization activity is needed to built a 335 new application, as long as it can interact with the user within the 336 confines of the user interface mechanism. The web is an example of a 337 stimulus user interface to applications. 339 In SIP systems, functional interfaces are provided by extending the 340 SIP protocol to provide the needed functionality. For example, the 341 SIP caller preferences specification [5] provides a functional 342 interface that allows a user to request applications to route the 343 call to specific types of user agents. Functional interfaces are 344 important, but are not the subject of this framework. The primary 345 goal of this framework is to address the role of stimulus interfaces 346 to SIP applications. 348 3.2 Real-Time vs. Non-Real Time 350 Application interaction systems can also be real-time or 351 non-real-time. Non-real interaction allows the user to enter 352 information about application operation in asynchronously with its 353 invocation. Frequently, this is done through provisioning systems. As 354 an example, a user can set up the forwarding number for a 355 call-forward on no-answer application using a web page. Real-time 356 interaction requires the user to interact with the application at the 357 time of its invocation. 359 3.3 Client-Local vs. Client-Remote 361 Another axis in the taxonomization is whether the user interface is 362 co-resident with the user device (which we refer to as a client-local 363 user interface), or the user interface runs in a host separated from 364 the client (which we refer to as a client-remote user interface). In 365 a client-remote user interface, there exists some kind of protocol 366 between the client device and the UI that allows the client to 367 interact with the user interface over a network. 369 The most important way to separate the UI and the client device is 370 through media interaction. In media interaction, the interface 371 between the user and the user interface is through media - audio, 372 video, messaging, and so on. This is the classic mode of operation 373 for VoiceXML [2], where the user interface (also referred to as the 374 voice browser) runs on a platform in the network. Users communicate 375 with the voice browser through the telephone network (or using a SIP 376 session). The voice browser interacts with the application using HTTP 377 to convey the information collected from the user. 379 We refer to the second sub-case as a client-local user interface. In 380 this case, the user interface runs co-located with the user. The 381 interface between them is through the software that interprets the 382 users input and passes them to the user interface. The classic 383 example of this is the web. In the web, the user interface is a web 384 browser, and the interface is defined by the HTML document that it's 385 rendering. The user interacts directly with the user interface 386 running in the browser. The results of that user interface are sent 387 to the application (running on the web server) using HTTP. 389 It is important to note that whether or not the user interface is 390 local, or remote (in the case of media interaction), is not a 391 property of the modality of the interface, but rather a property of 392 the system. As an example, it is possible for a web-based user 393 interface to be provided with a client-remote user interface. In such 394 a scenario, video and application sharing media sessions can be used 395 between the user and the user interface. The user interface, still 396 guided by HTML, now runs "in the network", remote from the client. 397 Similarly, a VoiceXML document can be interpreted locally by a client 398 device, with no media streams at all. Indeed, the VoiceXML document 399 can be rendered using text, rather than media, with no impact on the 400 interface between the user interface and the application. 402 It is also important to note that systems can be hybrid. In a hybrid 403 user interface, some aspects of it (usually those associated with a 404 particular modality) run locally, and others run remotely. 406 3.4 Presentation Capable vs. Presentation Free 408 A user interface can be capable of presenting information to the user 409 (a presentation capable UI), or it can be capable only of collecting 410 user input (a presentation free UI). These are very different types 411 of user interfaces. A presentation capable UI can provide the user 412 with feedback after every input, providing the context for collecting 413 the next input. As a result, presentation capable user interfaces 414 require an update to the information provided to the user after each 415 input. The web is a classic example of this. After every input (i.e., 416 a click), the browser provides the input to the application and 417 fetches the next page to render. In a presentation free user 418 interface, this is not the case. Since the user is not provided with 419 feedback, these user interfaces tend to merely collect information as 420 its entered, and pass it to the application. 422 Another difference is that a presentation-free user interface cannot 423 support the concept of a focus. As a result, if multiple applications 424 wish to gather input from the user, there is no way for the user to 425 select which application the input is destined for. The input 426 provided to applications through presentation-free user interfaces is 427 more of a broadcast or notification operation, as a result. 429 3.5 Interaction Scenarios on Telephones 431 This same model can apply to a telephone. In a traditional telephone, 432 the user interface consists of a 12-key keypad, a speaker, and a 433 microphone. Indeed, from here forward, the term "telephone" is used 434 to represent any device that meets, at a minimum, the characteristics 435 described in the previous sentence. Circuit-switched telephony 436 applications are almost universally client-remote user interfaces. In 437 the Public Switched Telephone Network (PSTN), there is usually a 438 circuit interface between the user and the user interface. The user 439 input from the keypad is conveyed used Dual-Tone Multi-Frequency 440 (DTMF), and the microphone input as PCM encoded voice. 442 In an IP-based system, there is more variability in how the system 443 can be instantiated. Both client-remote and client-local user 444 interfaces to a telephone can be provided. 446 In this framework, a PSTN gateway can be considered a "user proxy". 447 It is a proxy for the user because it can provide, to a user 448 interface on an IP network, input taken from a user on a circuit 449 switched telephone. The gateway may be able to run a client-local 450 user interface, just as an IP telephone might. 452 3.5.1 Client Remote 454 The most obvious instantiation is the "classic" circuit-switched 455 telephony model. In that model, the user interface runs remotely from 456 the client. The interface between the user and the user interface is 457 through media, set up by SIP and carried over the Real Time Transport 458 Protocol (RTP) [7]. The microphone input can be carried using any 459 suitable voice encoding algorithm. The keypad input can be conveyed 460 in one of two ways. The first is to convert the keypad input to DTMF, 461 and then convey that DTMF using a suitance encoding algorithm for it 462 (such as PCMU). An alternative, and generally the preferred approach, 463 is to transmit the keypad input using RFC 2833 [8], which provides an 464 encoding mechanism for carrying keypad input within RTP. 466 In this classic model, the user interface would run on a server in 467 the IP network. It would perform speech recognition and DTMF 468 recognition to derive the user intent, feed them through the user 469 interface, and provide the result to an application. 471 3.5.2 Client Local 473 An alternative model is for the entire user interface to reside on 474 the telephone. The user interface can be a VoiceXML browser, running 475 speech recognition on the microphone input, and feeding the keypad 476 input directly into the script. As discussed above, the VoiceXML 477 script could be rendered using text instead of voice, if the 478 telephone had a textual display. 480 3.5.3 Flip-Flop 482 A middle-ground approach is to flip back and forth between a 483 client-local and client-remote user interface. Many voice 484 applications are of the type which listen to the media stream and 485 wait for some specific trigger that kicks off a more complex user 486 interaction. The long pound in a pre-paid calling card application is 487 one example. Another example is a conference recording application, 488 where the user can press a key at some point in the call to begin 489 recording. When the key is pressed, the user hears a whisper to 490 inform them that recording has started. 492 The ideal way to support such an application is to install a 493 client-local user interface component that waits for the trigger to 494 kick off the real interaction. Once the trigger is received, the 495 application connects the user to a client-remote user interface that 496 can play announements, collect more information, and so on. 498 The benefit of flip-flopping between a client-local and client-remote 499 user interface is cost. The client-local user interface will 500 eliminate the need to send media streams into the network just to 501 wait for the user to press the pound key on the keypad. 503 The Keypad Markup Language (KPML) was designed to support exactly 504 this kind of need [10]. It models the keypad on a phone, and allows 505 an application to be informed when any sequence of keys have been 506 pressed. However, KPML has no presentation component. Since user 507 interfaces generally require a response to user input, the 508 presentation will need to be done using a client-remote user 509 interface that gets instantiated as a result of the trigger. 511 It is tempting to use a hybrid model, where a prompt-and-collect 512 application is implemented by using a client-remote user interface 513 that plays the prompts, and a client-local user interface, described 514 by KPML, that collects digits. However, this only complicates the 515 application. Firstly, the keypad input will be sent to both the media 516 stream and the KPML user interface. This requires the application to 517 sort out which user inputs are duplicates, a process that is very 518 complicated. Secondly, the primary benefit of KPML is to avoid having 519 a media stream towards a user interface. However, there is already a 520 media stream for the prompting, so there is no real savings. 522 4. Framework Overview 524 In this framework, we use the term "SIP application" to refer to a 525 broad set of functionality. A SIP application is a program running on 526 a SIP-based element (such as a proxy or user agent) that provides 527 some value-added function to a user or system administrator. SIP 528 applications can execute on behalf of a caller, a called party, or a 529 multitude of users at once. 531 Each application has a number of instances that are executing at any 532 given time. An instance represents a single execution path for an 533 application. Each instance has a well defined lifecycle. It is 534 established as a result of some event. That event can be a SIP event, 535 such as the reception of a SIP INVITE request, or it can be a non-SIP 536 event, such as a web form post or even a timer. Application instances 537 also have a specific end time. Some instances have a lifetime that is 538 coupled with a SIP transaction or dialog. For example, a proxy 539 application might begin when an INVITE arrives, and terminate when 540 the call is answered. Other applications have a lifetime that spans 541 multiple dialogs or transactions. For example, a conferencing 542 application instance may exist so long as there are any dialogs 543 connected to it. When the last dialog terminates, the application 544 instance terminates. Other applications have a liftime that is 545 completely decoupled from SIP events. 547 It is fundamental to the framework described here that multiple 548 application instances may interact with a user during a single SIP 549 transaction or dialog. Each instance may be for the same application, 550 or different applications. Each of the applications may be completely 551 independent, in that they may be owned by different providers, and 552 may not be aware of each others existence. Similarly, there may be 553 application instances interacting with the caller, and instances 554 interacting with the callee, both within the same transaction or 555 dialog. 557 The first step in the interaction with the user is to instantiate one 558 of more user interface components for the application instance. A 559 user interface component is a single piece of the user interface that 560 is defined by a logical flow that is not synchronously coupled with 561 any other component. In other words, each component runs more or less 562 independently. 564 A user interface component can be instantiated in one of the user 565 agents in a dialog (for a client-local user interface), or within a 566 network element (for a client-remote user interface). If a 567 client-local user interface is to be used, the application needs to 568 determine whether or not the user agent is capable of supporting a 569 client-local user interface, and in what format. In this framework, 570 all client-local user interface components are described by a markup 571 language. A markup language describes a logical flow of presentation 572 of information to the user, collection of information from the user, 573 and transmission of that information to an application. Examples of 574 markup languages include HTML, WML, VoiceXML, and the Keypad Markup 575 Language (KPML) [10]. 577 Unlike an application instance, which has very flexible lifetimes, a 578 user interface component has a very fixed lifetime. A user interface 579 component is always associated with a dialog. The user interface 580 component can be created at any point after the dialog (or early 581 dialog) is created. However, the user interface component terminates 582 when the dialog terminates. The user interface component can be 583 terminated earlier by the user agent, and possibly by the 584 application, but its lifetime never exceeds that of its associated 585 dialog. 587 There are two ways to create a client local interface component. For 588 interface components that are presentation capable, the application 589 sends a REFER [9] request to the user agent. The Refer-To header 590 field contains an HTTP URI that points to the markup for the user 591 interface. For interface components that are presentation free (such 592 as those defined by KPML), the application sends a SUBSCRIBE request 593 to the user agent. The body of the SUBSCRIBE request contains a 594 filter, which, in this case, is the markup that defines when 595 information is to be sent to the application in a NOTIFY. 597 If a user interface component is to be instantiated in the network, 598 there is no need to determine the capabilities of the device on which 599 the user interface is instantiated. Presumably, it is on a device on 600 which the application knows a UI can be created. However, the 601 application does need to connect the user device to the user 602 interface. This will require manipulation of media streams in order 603 to establish that connection. 605 The interface between the user interface component and the 606 application depends on the type of user interface. For presentation 607 capable user interfaces, such as those described by HTML and 608 VoiceXML, HTTP form POST operations are used. For presentation free 609 user interfaces, a SIP NOTIFY is used. The differing needs and 610 capabilities of these two user interfaces, as described in Section 611 3.4, is what drives the different choices for the interactions. Since 612 presentation capable user interfaces require an update to the 613 presentation every time user data is entered, they are a good match 614 for HTTP. Since presentation free user interfaces merely transmit 615 user input to the application, a NOTIFY is more appropriate. 617 Indeed, for presentation free user interfaces, there are two 618 different modalities of operation. The first is called "one shot". In 619 the one-shot role, the markup waits for a user to enter some 620 information, and when they do, reports this event to the application. 621 The application then does something, and the markup is no longer 622 used. In the other modality, called "monitor", the markup stays 623 permanently resident, and reports information back to an application 624 until termination of the associated dialog. 626 5. Client Local Interfaces 628 One key component of this framework is support for client local user 629 interfaces. 631 5.1 Discovering Capabilities 633 A client local user interface can only be instantiated on a user 634 agent if the user agent supports that type of user interface 635 component. Support for client local user interface components is 636 declared by both the UAC and a UAS in its Accept, Allow, Contact and 637 Allow-Event header fields. If the Allow header field indicates 638 support for the SIP SUBSCRIBE method, and the Allow-Event header 639 field indicates support for the [TBD] package, it means that the UA 640 can instantiate presentation free user interface components. The 641 specific markup languages that can be supported are indicated in the 642 Accept header field. If the Allow header field indicates support for 643 the SIP REFER method, and the Contact header field contains UA 644 capabilities [6] that indicate support for the HTTP URI scheme, it 645 means that the UA supports presentation capable user interface 646 components. The specific markups that are supported are indicated in 647 the Allow header field. 649 The Accept, Allow, Contact and Allow-Event header fields are sent in 650 dialog initiating requests and responses. As a result, an application 651 will generally need to wait for a dialog-initiating request or 652 response to pass by before it can examine the contents of these 653 headers and determine what kinds of user interface components the UA 654 supports. Because these headers are examined by intermediaries, a UA 655 that wishes to support client local user interfaces should not 656 encrypt them. 658 5.2 Pushing an Initial Interface Component 660 Once the application has determined that the UA is capable of 661 supporting client local user interfaces, the next step is for the 662 application to push an interface component to the user device. 664 Generally, we anticipate that interface components will need to be 665 created at various different points in a SIP session. Clearly, they 666 will need to be pushed during session setup, or after the session is 667 established. A user interface component is always associated with a 668 specific dialog, however. 670 To create a presentation capable UI component on the UA, the 671 application sends a REFER request to the UA. This REFER is sent to 672 the Globally Routable UA URI (GRUU) [12] advertised by that UA in the 673 Contact header field of the dialog initiating request or response 674 sent by that UA. This means that any UA which wants to support this 675 framework has to support GRUUs. Note that this REFER request creates 676 a separate dialog between the application and the UA. 678 OPEN ISSUE: This document has evolved into one that really is 679 describing normative behavior. We could split the document in 680 half, one of which is an informational framework, and the other is 681 a standards track mechanism document. Or, we could have a single 682 framework document that just happens to be standards track. 684 The Refer-To header field of the REFER request contains an HTTP URI 685 that references the markup document to be fetched. The application 686 should identify itself in the From header field of the request. Once 687 the markup is fetched, the UA renders it and the user can interact 688 with it as needed. 690 To create a presentation free user interface component, the 691 application sends a SUBSCRIBE request to the UA. The SUBSCRIBE is 692 sent to the GRUU advertised by the UA. Note that this SUBSCRIBE 693 request creates a separate dialog. The SUBSCRIBE request is for the 694 [TBD] event package. The body of the SUBSCRIBE request contains the 695 markup document that defines the conditions under which the 696 application wishes to be notified of user input. The application 697 should identify itself in the From header field of the request. 699 Since the UI components are bound to the lifetime of the dialog, the 700 UA needs to know which dialog each component is associated with. To 701 make this determination, a UA MUST use a unique GRUU in the Contact 702 header field of each dialog. This uniqueness is across dialogs 703 terminating at that UA. This uniqueness can be achieved by using the 704 grid URI parameter defined in [12]. 706 OPEN ISSUE: This would require a UA to always use a unique GRUU in 707 each dialog, since it doesnt know whether an application will try 708 to create a UI component. Is that OK? 710 To authenticate themselves, it is RECOMMENDED that applications use 711 the SIP identity mechanism [11] in the REFER or SUBSCRIBE requests 712 they generate. A UA will need to authorize these subscriptions and 713 refers. To do this, a UA SHOULD accept any REFER or SUBSCRIBE sent to 714 the GRUU it used for that dialog. This would imply that only elements 715 privy to the INVITE requests and responses could send a REFER or 716 SUBSCRIBE to the UA. The usage of the sips URI scheme provides 717 cryptographic assurances that only elements on the call setup path 718 could see such information. Therefore, it is RECOMMENDED that UAs 719 compliant to this specification use sips whenever possible. A client 720 SHOULD use grid parameters with sufficient randomness to eliminate 721 the possibility of an attacker guessing the GRUU. 723 5.3 Updating an Interface Component 725 Once a user interface component has been created on a client, it can 726 be updated. The means for updating it depends on the type of UI 727 component. 729 Presentation capable UI components are updated using techniques 730 already in place for those markups. In particular, user input will 731 cause an HTTP POST operation to push the user input to the 732 application. The result of the POST operation is a new markup that 733 the UI is supposed to use. This allows the UI to updated in response 734 to user action. Some markups, such as HTML, provide the ability to 735 force a refresh after a certain period of time, so that the UI can be 736 updated without user input. Those mechanisms can be used here as 737 well. However, there is no support for an asynchronous push of an 738 updated UI component from the appliciation to the user agent. A new 739 REFER request to the same GRUU would create a new UI component rather 740 than updating any components already in place. 742 For presentation free UI, the story is different. The application can 743 update the filter at any time by generating a SUBSCRIBE refresh with 744 the new filter. The UA will immediately begin using this new filter. 746 5.4 Terminating an Interface Component 748 User interface components have a well defined lifetime. They are 749 created when the component is first pushed to the client. User 750 interface components are always associated with the SIP dialog on 751 which they were pushed. As such, their lifetime is bound by the 752 lifetime of the dialog. When the dialog ends, so does the interface 753 component. 755 However, there are some cases where the application would like to 756 terminate the user interface component before its natural termination 757 point. For presentation capable user interfaces, this is not 758 possible. For presentation free user interfaces, the application can 759 terminate the component by sending a SUBSCRIBE with Expires equal to 760 zero. This terminates the subscription, which removes the UI 761 component. 763 A client can remove a UI component at any time. For presentation 764 aware UI, this is analagous to the user dismissing the web form 765 window. There is no mechanism provided for reporting this kind of 766 event to the application. The applicatio needs to be prepared to time 767 out, and never receive input from a user. For presentation free user 768 interfaces, the UA can explicitly terminate the subscription. This 769 will result in the generation of a NOTIFY with a Subscription-State 770 header field equal to terminated. 772 6. Client Remote Interfaces 774 As an alternative to, or in conjunction with client local user 775 interfaces, an application can make use of client remote user 776 interfaces. These user interfaces can execute co-resident with the 777 application itself (in which case no standardized interfaces between 778 the UI and the application need to be used), or it can run 779 separately. This framework assumes that the user interface runs on a 780 host that has a sufficient trust relationship with the application. 781 As such, the means for instantiating the user interface is not 782 considered here. 784 The primary issue is to connect the user device to the remote user 785 interface. Doing so requires the manipulation of media streams 786 between the client and the user interface. Such manipulation can only 787 be done by user agents. There are two types of user agent 788 applications within this framework - originating/terminating 789 applications, and intermediary applications. 791 6.1 Originating and Terminating Applications 793 Originating and terminating applications are applications which are 794 themselves the originator or the final recipient of a SIP invitation. 795 They are "pure" user agent applications - not back-to-back user 796 agents. The classic example of such an application is an interactive 797 voice response (IVR) application, which is typically a terminating 798 application. Its a terminating application because the user 799 explicitly calls it; i.e., it is the actual called party. An example 800 of an originating application is a wakeup call application, which 801 calls a user at a specified time in order to wake them up. 803 Because originating and terminating applications are a natural 804 termination point of the dialog, manipulation of the media session by 805 the application is trivial. Traditional SIP techniques for adding and 806 removing media streams, modifying codecs, and changing the address of 807 the recipient of the media streams, can be applied. Similarly, the 808 application can direclty authenticate itself to the user through S/ 809 MIME, since it is the peer UA in the dialog. 811 6.2 Intermediary Applications 813 Intermediary application are, at the same time, more common than 814 originating/terminating applications, and more complex. Intermediary 815 applications are applications that are neither the actual caller or 816 called party. Rather, they represent a "third party" that wishes to 817 interact with the user. The classic example is the ubiquitous 818 pre-paid calling card application. 820 In order for the intermediary application to add a client remote user 821 interface, it needs to manipulate the media streams of the user agent 822 to terminate on that user interface. This also introduces a 823 fundamental feature interaction issue. Since the intermediary 824 application is not an actual participant in the call, how does the 825 user interact with the intermediary application, and its actual peer 826 in the dialog, at the same time? This is discussed in more detail in 827 Section 7. 829 7. Inter-Application Feature Interaction 831 The inter-application feature interaction problem is inherent to 832 stimulus signaling. Whenever there are multiple applications, there 833 are multiple user interfaces. When the user provides an input, to 834 which user interface is the input destined? That question is the 835 essence of the inter-application feature interaction problem. 837 Inter-application feature interaction is not an easy problem to 838 resolve. For now, we consider separately the issues for client-local 839 and client-remote user interface components. 841 7.1 Client Local UI 843 When the user interface itself resides locally on the client device, 844 the feature interaction problem is actually much simpler. The end 845 device knows explicitly about each application, and therefore can 846 present the user with each one separately. When the user provides 847 input, the client device can determine to which user interface the 848 input is destined. The user interface to which input is destined is 849 referred to as the application in focus, and the means by which the 850 focused application is selected is called focus determination. 852 Generally speaking, focus determination is purely a local operation. 853 In the PC universe, focus determination is provided by window 854 managers. Each application does not know about focus, it merely 855 receives the user input that has been targeted to it when its in 856 focus. This basic concept applies to SIP-based applications as well. 858 Focus determination will frequently be trivial, depending on the user 859 interface type. Consider a user that makes a call from a PC. The call 860 passes through a pre-paid calling card application, and a call 861 recording application. Both of these wish to interact with the user. 862 Both push an HTML-based user interface to the user. On the PC, each 863 user interface would appear as a separate window. The user interacts 864 with the call recording application by selecting its window, and with 865 the pre-paid calling card application by selecting its window. Focus 866 determination is literally provided by the PC window manager. It is 867 clear to which application the user input is targeted. 869 As another example, consider the same two applications, but on a 870 "smart phone" that has a set of buttons, and next to each button, an 871 LCD display that can provide the user with an option. This user 872 interface can be represented using the Wireless Markup Language 873 (WML). 875 The phone would allocate some number of buttons to each application. 876 The prepaid calling card would get one button for its "hangup" 877 command, and the recording application would get one for its "start/ 878 stop" command. The user can easily determine which application to 879 interact with by pressing the appropriate button. Pressing a button 880 determines focus and provides user input, both at the same time. 882 Unfortunately, not all devices will have these advanced displays. A 883 PSTN gateway, or a basic IP telephone, may only have a 12-key keypad. 884 The user interfaces for these devices are provided through the Keypad 885 Markup Language (KPML). Considering once again the feature 886 interaction case above, the pre-paid calling card application and the 887 call recording application would both pass a KPML document to the 888 device. When the user presses a button on the keypad, to which 889 document does the input apply? The user interface does not allow the 890 user to select. A user interface where the user cannot provide focus 891 is called a focusless user interface. This is quite a hard problem to 892 solve. This framework does not make any explicit normative 893 recommendation, but concludes that the best option is to send the 894 input to both user interfaces unless the markup in one interface has 895 indicated that it should be suppressed from others. This is a 896 sensible choice by analogy - its exactly what the existing circuit 897 switched telephone network will do. It is an explicit non-goal to 898 provide a better mechanism for feature interaction resolution than 899 the PSTN on devices which have the same user interface as they do on 900 the PSTN. Devices with better displays, such as PCs or screen phones, 901 can benefit from the capabilities of this framework, allowing the 902 user to determine which application they are interacting with. 904 Indeed, when a user provides input on a focusless device, the input 905 must be passed to all client local user interfaces, AND all client 906 remote user interfaces, unless the markup tells the UI to suppress 907 the media. In the case of KPML, key events are passed to remote user 908 interfaces by encoding them in RFC 2833 [8]. Of course, since a 909 client cannot determine if a media stream terminates in a remote user 910 interface or not, these key events are passed in all audio media 911 streams unless the "Q" digit is used to suppress. 913 7.2 Client-Remote UI 915 When the user interfaces run remotely, the determination of focus can 916 be much, much harder. There are many architectures that can be 917 deployed to handle the interaction. None are ideal. However, all are 918 beyond the scope of this specification. 920 8. Intra Application Feature Interaction 922 An application can instantiate a multiplicity of user interface 923 components. For example, a single application can instantiate two 924 separate HTML components and one WML component. Furthermore, an 925 application can instantiate both client local and client remote user 926 interfaces. 928 The feature interaction issues between these components within the 929 same application are less severe. If an application has multiple 930 client user interface components, their interaction is resolved 931 identically to the inter-application case - through focus 932 determination. However, the problems in focusless user interfaces 933 (such as a keypad) generally won't exist, since the application can 934 generate user interfaces which do not overlap in their usage of an 935 input. 937 The real issue is that the optimal user experience frequently 938 requires some kind of coupling between the differing user interface 939 components. This is a classic problem in multi-modal user interfaces, 940 such as those described by Speech Application Language Tags (SALT). 941 As an example, consider a user interface where a user can either 942 press a labeled button to make a selection, or listen to a prompt, 943 and speak the desired selection. Ideally, when the user presses the 944 button, the prompt should cease immediately, since both of them were 945 targeted at collecting the same information in parallel. Such 946 interactions are best handled by markups which natively support such 947 interactions, such as SALT, and thus require no explicit support from 948 this framework. 950 9. Examples 952 TODO. 954 10. Security Considerations 956 There are many security considerations associated with this 957 framework. It allows applications in the network to instantiate user 958 interface components on a client device. Such instantiations need to 959 be from authenticated applications, and also need to be authorized to 960 place a UI into the client. Indeed, the stronger requirement is 961 authorization. It is not so important to know that name of the 962 provider of the application, but rather, that the provider is 963 authorized to instantiate components. 965 Generally, an application should be considered authorized if it was 966 an application that was legitimately part of the call setup path. 967 With this definition, authorization can be enforced using the sips 968 URI scheme when the call is initiated. 970 11. Contributors 972 This document was produced as a result of discussions amongst the 973 application interaction design team. All members of this team 974 contributed significantly to the ideas embodied in this document. The 975 members of this team were: 977 Eric Burger 978 Cullen Jennings 979 Robert Fairlie-Cuninghame 981 Informative References 983 [1] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., 984 Peterson, J., Sparks, R., Handley, M. and E. Schooler, "SIP: 985 Session Initiation Protocol", RFC 3261, June 2002. 987 [2] McGlashan, S., Lucas, B., Porter, B., Rehor, K., Burnett, D., 988 Carter, J., Ferrans, J. and A. Hunt, "Voice Extensible Markup 989 Language (VoiceXML) Version 2.0", W3C CR 990 CR-voicexml20-20030220, February 2003. 992 [3] Day, M., Rosenberg, J. and H. Sugano, "A Model for Presence and 993 Instant Messaging", RFC 2778, February 2000. 995 [4] Rosenberg, J., "A Framework for Conferencing with the Session 996 Initiation Protocol", 997 draft-ietf-sipping-conferencing-framework-00 (work in 998 progress), May 2003. 1000 [5] Rosenberg, J., Schulzrinne, H. and P. Kyzivat, "Caller 1001 Preferences for the Session Initiation Protocol (SIP)", 1002 draft-ietf-sip-callerprefs-09 (work in progress), July 2003. 1004 [6] Rosenberg, J., "Indicating User Agent Capabilities in the 1005 Session Initiation Protocol (SIP)", 1006 draft-ietf-sip-callee-caps-00 (work in progress), June 2003. 1008 [7] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 1009 "RTP: A Transport Protocol for Real-Time Applications", RFC 1010 1889, January 1996. 1012 [8] Schulzrinne, H. and S. Petrack, "RTP Payload for DTMF Digits, 1013 Telephony Tones and Telephony Signals", RFC 2833, May 2000. 1015 [9] Sparks, R., "The Session Initiation Protocol (SIP) Refer 1016 Method", RFC 3515, April 2003. 1018 [10] Burger, E., "Keypad Stimulus Protocol (KPML)", 1019 draft-ietf-sipping-kpml-00 (work in progress), September 2003. 1021 [11] Peterson, J., "Enhancements for Authenticated Identity 1022 Management in the Session Initiation Protocol (SIP)", 1023 draft-ietf-sip-identity-01 (work in progress), March 2003. 1025 [12] Rosenberg, J., "Obtaining and Using Globally Routable User 1026 Agent (UA) URIs (GRUU) in the Session Initiation Protocol 1027 (SIP)", draft-rosenberg-sip-gruu-00 (work in progress), October 1028 2003. 1030 Author's Address 1032 Jonathan Rosenberg 1033 dynamicsoft 1034 600 Lanidex Plaza 1035 Parsippany, NJ 07054 1036 US 1038 Phone: +1 973 952-5000 1039 EMail: jdrosen@dynamicsoft.com 1040 URI: http://www.jdrosen.net 1042 Intellectual Property Statement 1044 The IETF takes no position regarding the validity or scope of any 1045 intellectual property or other rights that might be claimed to 1046 pertain to the implementation or use of the technology described in 1047 this document or the extent to which any license under such rights 1048 might or might not be available; neither does it represent that it 1049 has made any effort to identify any such rights. Information on the 1050 IETF's procedures with respect to rights in standards-track and 1051 standards-related documentation can be found in BCP-11. Copies of 1052 claims of rights made available for publication and any assurances of 1053 licenses to be made available, or the result of an attempt made to 1054 obtain a general license or permission for the use of such 1055 proprietary rights by implementors or users of this specification can 1056 be obtained from the IETF Secretariat. 1058 The IETF invites any interested party to bring to its attention any 1059 copyrights, patents or patent applications, or other proprietary 1060 rights which may cover technology that may be required to practice 1061 this standard. Please address the information to the IETF Executive 1062 Director. 1064 Full Copyright Statement 1066 Copyright (C) The Internet Society (2003). All Rights Reserved. 1068 This document and translations of it may be copied and furnished to 1069 others, and derivative works that comment on or otherwise explain it 1070 or assist in its implementation may be prepared, copied, published 1071 and distributed, in whole or in part, without restriction of any 1072 kind, provided that the above copyright notice and this paragraph are 1073 included on all such copies and derivative works. However, this 1074 document itself may not be modified in any way, such as by removing 1075 the copyright notice or references to the Internet Society or other 1076 Internet organizations, except as needed for the purpose of 1077 developing Internet standards in which case the procedures for 1078 copyrights defined in the Internet Standards process must be 1079 followed, or as required to translate it into languages other than 1080 English. 1082 The limited permissions granted above are perpetual and will not be 1083 revoked by the Internet Society or its successors or assignees. 1085 This document and the information contained herein is provided on an 1086 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1087 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1088 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1089 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1090 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1092 Acknowledgement 1094 Funding for the RFC Editor function is currently provided by the 1095 Internet Society.