idnits 2.17.1 draft-ietf-mediactrl-vxml-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? -- It seems you're using the 'non-IETF stream' Licence Notice instead Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 47 instances of lines with non-RFC2606-compliant FQDNs in the document. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (Feb 8, 2009) is 5549 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 3261' is mentioned on line 378, but not defined == Missing Reference: 'RFC 4627' is mentioned on line 400, but not defined ** Obsolete undefined reference: RFC 4627 (Obsoleted by RFC 7158, RFC 7159) -- Looks like a reference, but probably isn't: '0' on line 646 == Missing Reference: 'VXML' is mentioned on line 878, but not defined == Unused Reference: 'RFC3969' is defined on line 1578, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HTML4' ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 3016 (Obsoleted by RFC 6416) ** Obsolete normative reference: RFC 3265 (Obsoleted by RFC 6665) ** Obsolete normative reference: RFC 4244 (Obsoleted by RFC 7044) ** Obsolete normative reference: RFC 4627 (Obsoleted by RFC 7158, RFC 7159) -- Possible downref: Non-RFC (?) normative reference: ref. 'VXML20' -- Possible downref: Non-RFC (?) normative reference: ref. 'VXML21' == Outdated reference: A later version (-28) exists of draft-ietf-speechsc-mrcpv2-13 Summary: 7 errors (**), 0 flaws (~~), 8 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Mediactrl D. Burke 3 Internet-Draft Google 4 Intended status: Standards Track M. Scott 5 Expires: August 12, 2009 Genesys 6 Feb 8, 2009 8 SIP Interface to VoiceXML Media Services 9 draft-ietf-mediactrl-vxml-04.txt 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on August 12, 2009. 34 Copyright Notice 36 Copyright (c) 2009 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. 46 Abstract 48 This document describes a SIP interface to VoiceXML media services. 49 Commonly, application servers controlling media servers use this 50 protocol for pure VoiceXML processing capabilities. This protocol is 51 an adjunct to the full MEDIACTRL protocol and packages mechanism. 53 Comments 55 Please send comments on this draft to the MEDIACTRL mail list, 56 mediactrl@ietf.org. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 61 1.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 62 1.1.1. IVR Services with Application Servers . . . . . . . . 5 63 1.1.2. PSTN IVR Service Node . . . . . . . . . . . . . . . . 6 64 1.1.3. 3GPP IMS Media Resource Function (MRF) . . . . . . . . 7 65 1.1.4. CCXML <-> VoiceXML Interaction . . . . . . . . . . . . 8 66 1.1.5. Other Use Cases . . . . . . . . . . . . . . . . . . . 8 67 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 8 68 2. VoiceXML Session Establishment and Termination . . . . . . . . 10 69 2.1. Service Identification . . . . . . . . . . . . . . . . . . 10 70 2.2. Initiating a VoiceXML Session . . . . . . . . . . . . . . 13 71 2.3. Preparing a VoiceXML Session . . . . . . . . . . . . . . . 14 72 2.4. Session Variable Mappings . . . . . . . . . . . . . . . . 15 73 2.5. Terminating a VoiceXML Session . . . . . . . . . . . . . . 18 74 2.6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 18 75 2.6.1. Basic Session Establishment . . . . . . . . . . . . . 18 76 2.6.2. VoiceXML Session Preparation . . . . . . . . . . . . . 19 77 2.6.3. MRCP Establishment . . . . . . . . . . . . . . . . . . 20 78 3. Media Support . . . . . . . . . . . . . . . . . . . . . . . . 23 79 3.1. Offer/Answer . . . . . . . . . . . . . . . . . . . . . . . 23 80 3.2. Early Media . . . . . . . . . . . . . . . . . . . . . . . 23 81 3.3. Modifying the Media Session . . . . . . . . . . . . . . . 25 82 3.4. Audio and Video Codecs . . . . . . . . . . . . . . . . . . 25 83 3.5. DTMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 84 4. Returning Data to the Application Server . . . . . . . . . . . 27 85 4.1. HTTP Mechanism . . . . . . . . . . . . . . . . . . . . . . 27 86 4.2. SIP Mechanism . . . . . . . . . . . . . . . . . . . . . . 27 87 5. Outbound Calling . . . . . . . . . . . . . . . . . . . . . . . 30 88 6. Call Transfer . . . . . . . . . . . . . . . . . . . . . . . . 31 89 6.1. Blind . . . . . . . . . . . . . . . . . . . . . . . . . . 31 90 6.2. Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . 33 91 6.3. Consultation . . . . . . . . . . . . . . . . . . . . . . . 34 92 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 37 93 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 38 94 9. Security Considerations . . . . . . . . . . . . . . . . . . . 39 95 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 96 11. Changes since last version: . . . . . . . . . . . . . . . . . 41 97 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 42 98 12.1. Normative References . . . . . . . . . . . . . . . . . . . 42 99 12.2. Informative References . . . . . . . . . . . . . . . . . . 44 100 Appendix A. Notes on Normative References . . . . . . . . . . . . 46 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 47 103 1. Introduction 105 VoiceXML [VXML20], [VXML21] is a World Wide Web Consortium (W3C) 106 standard for creating audio and video dialogs that feature 107 synthesized speech, digitized audio, recognition of spoken and DTMF 108 key input, recording of audio and video, telephony, and mixed 109 initiative conversations. VoiceXML allows Web-based development and 110 content delivery paradigms to be used with interactive video and 111 voice response applications. 113 This document describes a SIP [RFC3261] interface to VoiceXML media 114 services. Commonly, application servers controlling media servers 115 use this protocol for pure VoiceXML processing capabilities. SIP is 116 responsible for initiating a media session to the VoiceXML media 117 server and simultaneously triggering the execution of a specified 118 VoiceXML application. This protocol is an adjunct to the full 119 MEDIACTRL protocol and packages mechanism. 121 The interface described here leverages a mechanism for identifying 122 dialog media services first described in [RFC4240]. The interface 123 has been updated and extended to support the W3C Recommendation for 124 VoiceXML 2.0 [VXML20] and VoiceXML 2.1 [VXML21]. A set of commonly 125 implemented functions and extensions have been specified including 126 VoiceXML dialog preparation, outbound calling, video media support, 127 and transfers. VoiceXML session variable mappings have been defined 128 for SIP with an extensible mechanism for passing application-specific 129 values into the VoiceXML application. Mechanisms for returning data 130 to the Application Server have also been added. 132 1.1. Use Cases 134 The VoiceXML media service user in this document is generically 135 referred to as an Application Server. In practice, it is intended 136 that the interface defined by this document is applicable across a 137 wide range of use cases. Several intended use cases are described 138 below. 140 1.1.1. IVR Services with Application Servers 142 SIP Application Servers provide services to users of the network. 143 Typically, there may be several Application Servers in the same 144 network, each specialised in providing a particular service. 145 Throughout this specification and without loss of generality, we 146 posit the presence of an Application Server specialised in providing 147 IVR services. A typical configuration for this use case is 148 illustrated below. 150 +--------------+ 151 | | 152 | Application |\ 153 | Server | \ 154 | | \ HTTP 155 SIP +--------------+ \ 156 / \ \ 157 +-------------+ / SIP \ +--------------+ 158 | |/ \| | 159 | SIP | | VoiceXML | 160 | User Agent | RTP/SRTP | Media Server | 161 | |=====================| | 162 +-------------+ +--------------+ 164 Assuming the Application Server also supports HTTP, the VoiceXML 165 application may be hosted on it and served up via HTTP [RFC2616]. 166 Note, however, that the Web model allows the VoiceXML application to 167 be hosted on a separate (HTTP) Application Server from the (SIP) 168 Application Server that interacts with the VoiceXML Media Server via 169 this specification. It is also possible for a static VoiceXML 170 application to be stored locally on the VoiceXML Media Server, 171 leveraging the VoiceXML 2.1 [VXML21] mechanism to interact 172 with a Web/Application Server when dynamic behavior is required. The 173 viability of static VoiceXML applications is further enhanced by the 174 mechanisms defined in section 2.4, through which the Application 175 Server can make session-specific information available within the 176 VoiceXML session context. 178 The approach described in this document is sometimes termed the 179 "delegation model" - the Application Server is essentially delegating 180 programmatic control of the human-machine interactions to one or more 181 VoiceXML documents running on the VoiceXML Media Server. During the 182 human-machine interactions, the Application Server remains in the 183 signaling path and can respond to results returned from the VoiceXML 184 Media Server or other external network events. 186 1.1.2. PSTN IVR Service Node 188 While this document is intended to enable enhanced use of VoiceXML as 189 a component of larger systems and services, it is intended that 190 devices that are completely unaware of this specification remain 191 capable of invoking VoiceXML services offered by a VoiceXML Media 192 Server compliant with this document. A typical configuration for 193 this use case is as follows: 195 +-------------+ SIP +--------------+ 196 | |---------------------| | 197 | IP/PSTN | | VoiceXML | 198 | Gateway | RTP/SRTP | Media Server | 199 | |=====================| | 200 +-------------+ +--------------+ 202 Note also that beyond the invocation and termination of a VoiceXML 203 dialog, the semantics defined for call transfers using REFER are 204 intended to be compatible with standard, existing IP/PSTN gateways. 206 1.1.3. 3GPP IMS Media Resource Function (MRF) 208 The 3GPP IP Multimedia Subsystem (IMS) [TS23002] defines a Media 209 Resource Function (MRF) used to offer media processing services such 210 as conferencing, transcoding, and prompt/collect. The capabilities 211 offered by VoiceXML are ideal for offering richer media processing 212 services in the context of the MRF. In this architecture, the 213 interface defined here corresponds to the "Mr" interface to the MRFC; 214 the implementation of this interface might use separated MRFC and 215 MRFP elements (as per the IMS architecture), or might be an 216 integrated MRF (as is common practice). 218 +----------+ 219 | App | 220 | Server | 221 +----------+ 222 | 223 | SIP (ISC) 224 | 225 +----------+ SIP (Mr) +--------------+ 226 | S-CSCF |---------------| VoiceXML | 227 | | | MRF | 228 +----------+ +--------------+ 229 || 230 || RTP/SRTP (Mb) 231 || 233 The above diagram is highly simplified and shows a subset of nodes 234 typically involved in MRF interactions. It should be noted that 235 while the MRF will primarily be used by the Application Server via 236 the S-CSCF, it is also possible for calls to be routed directly to 237 the MRF without the involvement of an Application Server. 239 Although the above is described in terms of the 3GPP IMS 240 architecture, it is intended that it is also applicable to 3GPP2, 241 NGN, and PacketCable architectures that are converging with 3GPP IMS 242 standards. 244 1.1.4. CCXML <-> VoiceXML Interaction 246 Call Control eXtensible Markup Language (CCXML) 1.0 [CCXML10] 247 applications provide services mainly through controlling the 248 interaction between Connections, Conferences, and Dialogs. Although 249 CCXML is capable of supporting arbitrary dialog environments, 250 VoiceXML is commonly used as a dialog environment in conjunction with 251 CCXML applications; CCXML is specifically designed to effectively 252 support the use of VoiceXML. CCXML 1.0 defines language elements 253 that allow for Dialogs to be prepared, started, and terminated; it 254 further allows for data to be returned by the dialog environment, for 255 call transfers to be requested (by the dialog) and responded to by 256 the CCXML application, and for arbitrary eventing between the CCXML 257 application and running dialog application. 259 The interface described in this document can be used by CCXML 1.0 260 implementations to control VoiceXML Media Servers. Note, however, 261 that some CCXML language features require eventing facilities between 262 CCXML and VoiceXML sessions that go beyond what is defined in this 263 specification. For example, VoiceXML-controlled call transfers and 264 mid-dialog application-defined events cannot be fully realized using 265 this specification alone. A SIP event package [RFC3265] MAY be used 266 in addition to this specification to provide extended eventing. 268 1.1.5. Other Use Cases 270 In addition to the use cases described in some detail above, there 271 are a number of other intended use cases that are not described in 272 detail, such as: 274 1. Use of a VoiceXML Media Server as an adjunct to an IP-based PBX/ 275 ACD, possibly to provide voicemail/messaging, automated 276 attendant, or other capabilities. 278 2. Invocation and control of a VoiceXML session that provides the 279 voice modality component in a multimodal system. 281 1.2. Terminology 283 Application Server: A SIP Application Server hosts and executes 284 services, in particular by terminating SIP sessions on a media 285 server. The Application Server MAY also act as an HTTP server 286 [RFC2616] in interactions with media servers. 288 VoiceXML Media Server: A VoiceXML interpreter including a SIP-based 289 interpreter context and the requisite media processing 290 capabilities to support VoiceXML functionality. 292 VoiceXML Session: A VoiceXML Session is a multimedia session 293 comprising of at least a SIP user agent, a VoiceXML Media Server, 294 the data streams between them, and an executing VoiceXML 295 application. 297 VoiceXML Dialog: Equivalent to VoiceXML Session. 299 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 300 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 301 document are to be interpreted as described in [RFC2119]. 303 2. VoiceXML Session Establishment and Termination 305 This section describes how to establish a VoiceXML Session, with or 306 without preparation, and how to terminate a session. This section 307 also addresses how session information is made available to VoiceXML 308 applications. 310 2.1. Service Identification 312 The SIP Request-URI is used to identify the VoiceXML media service. 313 The user part of the SIP Request-URI is fixed to "dialog". This is 314 done to ensure compatibility with [RFC4240], since this document 315 extends the dialog interface defined in that specification, and 316 because this convention from [RFC4240] is widely adopted by existing 317 media servers. 319 Standardizing the SIP Request-URI including the user part also 320 improves interoperability between application servers and media 321 servers, and reduces the provisioning overhead that would be required 322 if use of a media server by an application server required an 323 individually provisioned URI. In this respect, this document (and 324 [RFC4240]) do not add semantics to the user part, but rather 325 standardize the way that targets on media servers are provisioned. 326 Further, since application servers - and not human beings - are 327 generally the clients of media servers, issues such as interpretation 328 and internationalization do not apply. 330 Exposing a VoiceXML media service with a well-known address may 331 enhance the possibility of exploitation: the VoiceXML Media Server is 332 RECOMMENDED to use standard SIP mechanisms to authenticate endpoints 333 as discussed in Section 9. 335 The initial VoiceXML document is specified with the "voicexml" 336 parameter. In addition, parameters are defined that control how the 337 VoiceXML Media Server fetches the specified VoiceXML document. The 338 list of parameters defined by this specification is as follows (note 339 the parameter names are case-insensitive): 341 voicexml: URI of the initial VoiceXML document to fetch. This will 342 typically contain an HTTP URI, but may use other URI schemes, for 343 example to refer to local, static VoiceXML documents. If the 344 "voicexml" parameter is omitted, the VoiceXML Media Server may 345 select the initial VoiceXML document by other means, such as by 346 applying a default, or may reject the request. 348 maxage: Used to set the max-age value of the Cache-Control header in 349 conjunction with VoiceXML documents fetched using HTTP, as per 350 [RFC2616]. If omitted, the VoiceXML Media Server will use a 351 default value. 353 maxstale: Used to set the max-stale value of the Cache-Control 354 header in conjunction with VoiceXML documents fetched using HTTP, 355 as per [RFC2616]. If omitted, the VoiceXML Media Server will use 356 a default value. 358 method: Used to set the HTTP method applied in the fetch of the 359 initial VoiceXML document. Allowed values are "get" or "post" 360 (case-insensitive). Default is "get". 362 postbody: Used to set the application/x-www-form-urlencoded encoded 363 [HTML4] HTTP body for "post" requests (or is otherwise ignored). 365 ccxml: This parameter is used to specify a "JSON value" [RFC4627] 366 that is mapped to the session.connection.ccxml VoiceXML session 367 variable - see section 2.4 369 aai: This parameter is used to specify a "JSON value" [RFC4627] that 370 is mapped to the session.connection.aai VoiceXML session variable 371 - see section 2.4 373 Other application-specific parameters may be added to the Request-URI 374 and are exposed in VoiceXML session variables (see section 2.4). 376 Formally, the Request-URI for the VoiceXML media service has a fixed 377 user part 'dialog'. Seven URI parameters are defined (see the 378 definition of uri-parameter in Section 25.1 of [RFC 3261]). 380 dialog-param = "voicexml=" vxml-url ; vxml-url follows the URI 381 ; syntax defined in [RFC3986] 382 maxage-param = "maxage=" 1*DIGIT 384 maxstale-param = "maxstale=" 1*DIGIT 386 method-param = "method=" ("get" / "post") 388 postbody-param = "postbody=" token 390 ccxml-param = "ccxml=" json-value 392 aai-param = "aai=" json-value 394 json-value = false / 395 null / 396 true / 397 object / 398 array / 399 number / 400 string ; defined in [RFC 4627] 402 Parameters of the Request-URI in subsequent re-INVITEs are ignored. 403 One consequence of this is that the VoiceXML Media Server cannot be 404 instructed by the Application Server to change the executing VoiceXML 405 Application after a VoiceXML Session has been started. 407 Special characters contained in the dialog-param, postbody-param, 408 ccxml-param, and aai-param values must be URL-encoded ("escaped") as 409 required by the SIP URI syntax, for example '?' (%3f), '=' (%3d), and 410 ';' (%3b). The VoiceXML Media Server MUST therefore unescape these 411 parameter values before making use of them or exposing them to 412 running VoiceXML applications. It is important that the VoiceXML 413 Media Server only unescape the parameter values once since the 414 desired VoiceXML URI value could itself be URL encoded, for example. 416 Since some applications may choose to transfer confidential 417 information, the VoiceXML Media Server MUST support the sip: scheme 418 as discussed in Section 9. 420 Informative note: With respect to the postbody-param value, since the 421 application/x-www-form-urlencoded content itself escapes non- 422 alphanumeric characters by inserting %HH replacements, the escaping 423 rules above will result in the '%' characters being further escaped 424 in addition to the '&' and '=' name/value separators. 426 As an example, the following SIP Request-URI identifies the use of 427 VoiceXML media services, with 428 'http://appserver.example.com/promptcollect.vxml' as the initial 429 VoiceXML document, to be fetched with max-age/max-stale values of 430 3600s/0s respectively: 432 sip:dialog@mediaserver.example.com; \ 433 voicexml=http://appserver.example.com/promptcollect.vxml; \ 434 maxage=3600;maxstale=0 436 2.2. Initiating a VoiceXML Session 438 A VoiceXML Session is initiated via the Application Server using a 439 SIP INVITE. Typically, the Application Server will be specialized in 440 providing VoiceXML services. At a minimum, the Application Server 441 may behave as a simple proxy by rewriting the Request-URI received 442 from the User Agent to a Request-URI suitable for consumption by the 443 VoiceXML Media Server (as specified in section 2.1). For example, a 444 User Agent might present a dialed number: 446 tel:+1-201-555-0123 448 which the Application Server maps to a directory assistance 449 application on the VoiceXML Media Server with a Request-URI of: 451 sip:dialog@ms1.example.com; \ 452 voicexml=http://as1.example.com/da.vxml 454 Certain header values in the INVITE message to the VoiceXML Media 455 Server are mapped into VoiceXML session variables and are specified 456 in section 2.4. 458 On receipt of the INVITE, the VoiceXML Media Server issues a 459 provisional response, 100 Trying, and commences the fetch of the 460 initial VoiceXML document. The 200 OK response indicates that the 461 VoiceXML document has been fetched and parsed correctly and is ready 462 for execution. Application execution commences on receipt of the ACK 463 (except if the dialog is being prepared as specified in section 2.3). 464 Note that the 100 Trying response will usually be sent on receipt of 465 the INVITE in accordance with [RFC3261], since the VoiceXML Media 466 Server cannot in general guarantee that the initial fetch will 467 complete in less than 200 ms. However, certain implementations may 468 be able to guarantee response times to the initial INVITE, and thus 469 may not need to send a 100 Trying response. 471 As an optimization, prior to sending the 200 OK response, the 472 VoiceXML Media Server MAY execute the application up to the point of 473 the first VoiceXML waiting state or prompt flush. 475 A VoiceXML Media Server, like any SIP User Agent, may be unable to 476 accept the INVITE request for a variety of reasons. For instance, an 477 SDP offer contained in the INVITE might require the use of codecs 478 that are not supported by the Media Server. In such cases, the Media 479 Server should respond as defined by [RFC3261]. However, there are 480 error conditions specific to VoiceXML, as follows: 482 1. If the Request-URI does not conform to this specification, a 400 483 Bad Request MUST be returned (unless it is used to select other 484 services not defined by this specification). 486 2. If an init-param is repeated, then the request MUST be rejected 487 with a 400 Bad Request response. 489 3. If the Request-URI does not include a "voicexml" parameter, and 490 the VoiceXML Media Server does not elect to use a default page, 491 the VoiceXML Media Server MUST return a final response of 400 Bad 492 Request, and SHOULD include a Warning header with a 3-digit code 493 of 399 and a human readable error message. 495 4. If the VoiceXML document cannot be fetched or parsed, the 496 VoiceXML Media Server MUST return a final response of 500 Server 497 Internal Error and SHOULD include a Warning header with a 3-digit 498 code of 399 and a human readable error message. 500 Informational note: Certain applications may pass a significant 501 amount of data to the VoiceXML dialog in the form of Request-URI 502 parameters. This may cause the total size of the INVITE request to 503 exceed the MTU of the underlying network. In such cases, 504 applications/implementations must take care either to use a transport 505 appropriate to these larger messages (such as TCP), or to use 506 alternative means of passing the required information to the VoiceXML 507 dialog (such as supplying a unique session identifier in the initial 508 VoiceXML URI and later using that identifier as a key to retrieve 509 data from the HTTP server). 511 2.3. Preparing a VoiceXML Session 513 In certain scenarios, it is beneficial to prepare a VoiceXML Session 514 for execution prior to running it. A previously prepared VoiceXML 515 Session is expected to execute with minimal delay when instructed to 516 do so. 518 If a media-less SIP dialog is established with the initial INVITE to 519 the VoiceXML Media Server, the VoiceXML Application will not execute 520 after receipt of the ACK. To run the VoiceXML Application, the AS 521 must issue a re-INVITE to establish a media session. 523 A media-less SIP dialog can be established by sending SDP containing 524 no media lines in the initial INVITE. Alternatively, if no SDP is 525 sent in the initial INVITE, the VoiceXML Media Server will include an 526 offer in the 200 OK message, which can be responded to with an answer 527 in the ACK with the media port(s) set to 0. 529 Once a VoiceXML Application is running, a re-INVITE which disables 530 the media streams (i.e. sets the ports to 0) will not otherwise 531 affect the executing application (except that recognition actions 532 initiated while the media streams are disabled will result in noinput 533 timeouts). 535 2.4. Session Variable Mappings 537 The standard VoiceXML session variables are assigned values according 538 to: 540 session.connection.local.uri: Evaluates to the SIP URI specified in 541 the To: header of the initial INVITE. 543 session.connection.remote.uri: Evaluates to the SIP URI specified in 544 the From: header of the initial INVITE. 546 session.connection.redirect: This array is populated by information 547 contained in the History-Info [RFC4244] header in the initial 548 INVITE or is otherwise undefined. Each entry (hi-entry) in the 549 History-Info header is mapped, in reverse order, into an element 550 of the session.connection.redirect array. Properties of each 551 element of the array are determined as follows: 553 * uri - Set to the hi-targeted-to-uri value of the History-Info 554 entry 556 * pi - Set to 'true' if hi-targeted-to-uri contains a 557 'Privacy=history' parameter, or if the INVITE Privacy header 558 includes 'history'; 'false' otherwise 560 * si - Set to the value of the 'si' parameter if it exists, 561 undefined otherwise 563 * reason - Set verbatim to the value of the 'Reason' parameter of 564 hi-targeted-to-uri 566 session.connection.protocol.name: Evaluates to "sip". Note that 567 this is intended to reflect the use of SIP in general, and does 568 not distinguish between whether the media server was accessed via 569 SIP or SIPS procedures. 571 session.connection.protocol.version: Evaluates to "2.0". 573 session.connection.protocol.sip.headers: This is an associative 574 array where each key in the array is the non-compact name of a SIP 575 header in the initial INVITE converted to lower-case (note the 576 case conversion does not apply to the header value). If multiple 577 header fields of the same field name are present, the values are 578 combined into a single comma-separated value. Implementations 579 MUST at a minimum include the Call-ID header and MAY include other 580 headers. For example, 581 session.connection.protocol.sip.headers["call-id"] evaluates to 582 the Call-ID of the SIP dialog. 584 session.connection.protocol.sip.requesturi: This is an associative 585 array where the array keys and values are formed from the URI 586 parameters on the SIP Request-URI of the initial INVITE. The 587 array key is the URI parameter name converted to lower-case (note 588 the case conversion does not apply to the parameter value). The 589 corresponding array value is obtained by evaluating the URI 590 parameter value as a "JSON value" [RFC4627] in the case of the 591 ccxml-param and aai-param values and otherwise as a string. In 592 addition, the array's toString() function returns the full SIP 593 Request-URI. For example, assuming a Request-URI of sip:dialog@ 594 example.com;voicexml=http://example.com;aai=%7b"x":1%2c"y":true%7d 595 then session.connection.protocol.sip.requesturi["voicexml"] 596 evaluates to "http://example.com", 597 session.connection.protocol.sip.requesturi["aai"].x evaluates to 1 598 (type Number), session.connection.protocol.sip.requesturi["aai"].y 599 evaluates to true (type Boolean), and 600 session.connection.protocol.sip.requesturi evaluates to the 601 complete Request-URI (type String) 'sip:dialog@ 602 example.com;voicexml=http://example.com;aai={"x":1,"y":true}'. 604 session.connection.aai: Evaluates to 605 session.connection.protocol.sip.requesturi["aai"] 607 session.connection.ccxml: Evaluates to 608 session.connection.protocol.sip.requesturi["ccxml"] 610 session.connection.protocol.sip.media: This is an array where each 611 array element is an object with the following properties: 613 * type: - This required property indicates the type of the media 614 associated with the stream. The value is a string. It is 615 strongly recommended that the following values are used for 616 common types of media: "audio" for audio media, and "video" for 617 video media. 619 * direction: - This required property indicates the 620 directionality of the media relative to 621 session.connection.originator. Defined values are sendrecv, 622 sendonly, recvonly, and inactive. 624 * format: - This property is optional. If defined, the value of 625 the property is an array. Each array element is an object 626 which specifies information about one format of the media 627 (there is an array element for each payload type on the 628 m-line). The object contains at least one property called name 629 whose value is the MIME subtype of the media format (MIME 630 subtypes are registered in [RFC4855]). Other properties may be 631 defined with string values; these correspond to required and, 632 if defined, optional parameters of the format. 634 As a consequence of this definition, there is an array entry in 635 session.connection.protocol.sip.media for each non-disabled m-line 636 for the negotiated media session. Note that this session variable 637 is updated if the media session characteristics for the VoiceXML 638 Session change (i.e. due to a re-INVITE). For an example, 639 consider a connection with bi-directional G.711 mu-law audio 640 sampled at 8kHz. In this case, 641 session.connection.protocol.sip.media[0].type evaluates to 642 "audio", session.connection.protocol.sip.media[0].direction to 643 "sendrecv", and 644 session.connection.protocol.sip.media[0].format[0].name evaluates 645 to "audio/PCMU" and 646 session.connection.protocol.sip.media[0].format[0].rate evaluates 647 to "8000". 649 Note that when accessing SIP headers and Request-URI parameters via 650 the session.connection.protocol.sip.headers and 651 session.connection.protocol.sip.requesturi associative arrays defined 652 above, applications can choose between two semantically equivalent 653 ways of referring to the array. For example, either of the following 654 can be used to access a Request-URI parameter named 'foo': 656 session.connection.protocol.sip.requesturi["foo"] 657 session.connection.protocol.sip.requesturi.foo 659 However, it is important to note that not all SIP header names or 660 Request-URI parameter names are valid ECMAScript identifiers, and as 661 such, can only be accessed using the first form (array notation). 662 For example, the Call-ID header can only be accessed as 663 session.connection.protocol.sip.headers["call-id"]; attempting to 664 access the same value as 665 session.connection.protocol.sip.headers.call-id would result in an 666 error. 668 2.5. Terminating a VoiceXML Session 670 The Application Server can terminate a VoiceXML Session by issuing a 671 BYE to the VoiceXML Media Server. Upon receipt of a BYE in the 672 context of an existing VoiceXML Session, the VoiceXML Media Server 673 MUST send a 200 OK response, and MUST throw a 674 'connection.disconnect.hangup' event to the VoiceXML application. If 675 the Reason header [RFC3326] is present on the BYE Request, then the 676 value of the Reason header is provided verbatim via the '_message' 677 variable within the catch element's anonymous variable scope. 679 The VoiceXML Media Server may also initiate termination of the 680 session by issuing a BYE request. This will typically occur as a 681 result of encountering a or in the VoiceXML 682 application, due to the VoiceXML application running to completion, 683 or due to unhandled errors within the VoiceXML application. 685 See Section 4 for mechanisms to return data to the Application 686 Server. 688 2.6. Examples 690 2.6.1. Basic Session Establishment 692 This example illustrates an Application Server setting up a VoiceXML 693 Session on behalf of a User Agent. 695 SIP VoiceXML HTTP 696 User Application Media Application 697 Agent Server Server Server 698 | | | | 699 |(1) INVITE [offer] | | | 700 |------------------->|(2) INVITE [offer] | | 701 |(3) 100 Trying |------------------->| | 702 |<-------------------|(4) 100 Trying | | 703 | |<-------------------| | 704 | | | | 705 | | |(5) GET | 706 | | |------------------->| 707 | | |(6) 200 OK [VXML] | 708 | | |<-------------------| 709 | | | | 710 | |(7) 200 OK [answer] | | 711 |(8) 200 OK [answer] |<-------------------| | 712 |<-------------------| | | 713 |(9) ACK | | | 714 |------------------->|(10) ACK | | 715 | |------------------->| (execute | 716 |(11) RTP/SRTP | | VoiceXML | 717 |.........................................| application) | 718 | | | | 720 2.6.2. VoiceXML Session Preparation 722 This example demonstrates the preparation of a VoiceXML Session. In 723 this example, the VoiceXML session is prepared prior to placing an 724 outbound call to a User Agent, and is started as soon as the User 725 Agent answers. 727 The [answer1:0] notation is used to indicate an SDP answer with the 728 media ports set to 0. 730 SIP VoiceXML HTTP 731 User Application Media Application 732 Agent Server Server Server 733 | | | | 734 | |(1) INVITE | | 735 | |-------------------->| | 736 | |(2) 100 Trying | | 737 | |<--------------------| | 738 | | | | 739 | | |(3) GET | 740 | | |------------------->| 741 | | |(4) 200 OK [VXML] | 742 | | |<-------------------| 743 | | | | 744 | |(5) 200 OK [offer1] | | 745 | |<--------------------| | 746 | |(6) ACK [answer1:0] | | 747 |(7) INVITE |-------------------->| | 748 |<-------------------| | | 749 |(8) 200 OK [offer2] | | | 750 |------------------->|(9) INVITE [offer2'] | | 751 | |-------------------->| | 752 | |(10) 100 Trying | | 753 | |<--------------------| | 754 | |(11) 200 OK [answer2]| | 755 |(12) ACK [answer2] |<--------------------| | 756 |<-------------------|(13) ACK | | 757 | |-------------------->| (execute | 758 |(14) RTP/SRTP | VoiceXML | 759 |..........................................| application) | 760 | | | | 762 Implementation detail: offer2' is derived from offer2 - it duplicates 763 the m-lines and a-lines from offer2. However, offer2' differs from 764 offer2 since it must contain the same o-line as used in answer1:0 but 765 with the version number incremented. Also, if offer1 has more 766 m-lines than offer2, then offer2' must be padded with extra 767 (rejected) m-lines. 769 2.6.3. MRCP Establishment 771 MRCP [MRCPv2] is a protocol that enables clients such as a VoiceXML 772 Media Server to control media service resources such as speech 773 synthesizers, recognizers, verifiers and identifiers residing in 774 servers on the network. 776 The example below illustrates how a VoiceXML Media Server may 777 establish an MRCP session in response to an initial INVITE. 779 VoiceXML HTTP 780 User Media MRCPv2 Application 781 Agent Server Server Server 782 | | | | 783 |(1) INVITE [offer1] | | | 784 |------------------->| | | 785 |(2) 100 Trying | | | 786 |<-------------------|(3) GET | | 787 | |---------------------------------------->| 788 | | | | 789 | |(4) 200 OK [VXML] | | 790 | |<----------------------------------------| 791 | | | | 792 | |(5) INVITE [offer2] | | 793 | |--------------------->| | 794 | | | | 795 | |(6) 200 OK [answer2] | | 796 | |<---------------------| | 797 | | | | 798 | |(7) ACK | | 799 | |--------------------->| | 800 | | | | 801 | |(8) MRCP connection | | 802 | |<-------------------->| | 803 |(9) 200 OK [answer1]| | | 804 |<-------------------| | | 805 | | | | 806 |(10) ACK | | | 807 |------------------->| | | 808 | | | | 809 |(11) RTP/SRTP | | | 810 .............................................| | 811 | | | | 813 In this example, the VoiceXML Media Server is responsible for 814 establishing a session with the MRCPv2 Media Resource Server prior to 815 sending the 200 OK response to the initial INVITE. The VoiceXML 816 Media Server will perform the appropriate offer/answer with the 817 MRCPv2 Media Resource Server based on the SDP capabilities of the 818 Application Server and the MRCPv2 Media Resource Server. The 819 VoiceXML Media Server will change the offer received from step 1 to 820 establish a MRCPv2 session in step (5) and will re-write the SDP to 821 include an m-line for each MRCPv2 resource to be used and other 822 required SDP modifications as specified by MRCPv2. Once the VoiceXML 823 Media Server performs the offer/answer with the MRCPv2 Media Resource 824 Server, it will establish a MRCPv2 control channel in step (8). The 825 MRCPv2 resource is deallocated when the VoiceXML Media Server 826 receives or sends a BYE (not shown). 828 3. Media Support 830 This section describes the mandatory and optional media support 831 required by this interface. 833 3.1. Offer/Answer 835 The VoiceXML Media Server MUST support the standard offer/answer 836 mechanism of [RFC3264]. In particular, if an SDP offer is not 837 present in the INVITE, the VoiceXML Media Server will make an offer 838 in the 200 OK response listing its supported codecs. 840 3.2. Early Media 842 The VoiceXML Media Server MAY support early establishment of media 843 streams as described in [RFC3960]. This allows the Application 844 Server to establish media streams between a user agent and the 845 VoiceXML Media Server in parallel with the initial VoiceXML document 846 being processed (which may involve dynamic VoiceXML page generation 847 and interaction with databases or other systems). This is useful 848 primarily for minimizing the delay in starting a VoiceXML Session, 849 particularly in cases where a session with the user agent already 850 exists but the media stream associated with that session needs to be 851 redirected to a VoiceXML Media Server. 853 The following flow demonstrates the use of early media (using the 854 Gateway model defined in [RFC3960]): 856 SIP VoiceXML HTTP 857 User Application Media Application 858 Agent Server Server Server 859 | | | | 860 |..(existing session)..| | | 861 | |(1) INVITE | | 862 | |------------------>| | 863 | | |(2) HTTP GET | 864 | | |------------------>| 865 | |(3) 183 [offer] | | 866 |(4) re-INVITE [offer] |<------------------| | 867 |<---------------------| | | 868 |(5) 200 OK [answer] | | | 869 |--------------------->| | | 870 |(6) ACK | | | 871 |<---------------------| | | 872 | | (7) PRACK [answer]| | 873 | |------------------>| | 874 | | (8) PRACK 200 OK | | 875 | |<------------------| | 876 |(9) RTP/SRTP | | | 877 |..........................................| | 878 | | |(10) 200 OK [VXML] | 879 | | |<------------------| 880 | | | | 881 | |(11) 200 OK | | 882 | |<------------------| | 883 | |(12) ACK | | 884 | |------------------>| (execute | 885 | | | VoiceXML | 886 | | | application) | 887 | | | | 889 Although [RFC3960] prefers the use of the Application Server model 890 for early media over the Gateway model, the primary issue with the 891 Gateway model - forking - is significantly less common when issuing 892 requests to VoiceXML Media Servers. This is because VoiceXML Media 893 Servers respond to all requests with 200 OK responses in the absence 894 of unusual errors, and typically do so within several hundred 895 milliseconds. This makes them unlikely targets in forking scenarios, 896 since alternative targets of the forking process would virtually 897 never be able to respond more quickly than an automated system, 898 unless they are themselves automated systems - in which case there is 899 little point in setting up a response time race between two automated 900 systems. Issues with ringing tone generation in the Gateway model 901 are also mitigated, both by the typically quick 200 OK response time, 902 and because this specification mandates that no media packets are 903 generated until the receipt of an ACK (thus eliminating the need for 904 the user agent to perform media packet analysis). 906 Note that the offer of early media by a VoiceXML Media Server does 907 not imply that the referenced VoiceXML application can always be 908 fetched and executed successfully. For instance, if the HTTP 909 Application Server were to return a 4xx response in step 10 above, or 910 if the provided VoiceXML content was not valid, the VoiceXML Media 911 Server would still return a 500 response (as per section 2.2). At 912 this point, it would be the responsibility of the application server 913 to tear down any media streams established with the media server. 915 3.3. Modifying the Media Session 917 The VoiceXML Media Server MUST allow the media session to be modified 918 via a re-INVITE and SHOULD support the UPDATE method [RFC3311] for 919 the same purpose. In particular, it MUST be possible to change 920 streams between sendrecv, sendonly, and recvonly as specified in 921 [RFC3264]. 923 Unidirectional streams are useful for announcement- or listening-only 924 (hotword). The preferred mechanism for putting the media session on 925 hold is specified in [RFC3264], i.e. the UA modifies the stream to be 926 sendonly and mutes its own stream. Modification of the media session 927 does not affect VoiceXML application execution (except that 928 recognition actions initiated while on hold will result in noinput 929 timeouts). 931 3.4. Audio and Video Codecs 933 For the purposes of achieving a basic level of interoperability, this 934 section specifies a minimal subset of codecs and RTP [RFC3550] 935 payload formats that MUST be supported by the VoiceXML Media Server. 937 For audio-only applications, G.711 mu-law and A-law MUST be supported 938 using the RTP payload type 0 and 8 [RFC3551]. Other codecs and 939 payload formats MAY be supported. 941 Video telephony applications, which employ a video stream in addition 942 to the audio stream, are possible in VoiceXML 2.0/2.1 through the use 943 of multimedia file container formats such as the .3gp [TS26244] and 944 .mp4 formats [IEC14496-14]. Video support is optional for this 945 specification. If video is supported then: 947 1. H.263 Baseline [RFC4629] MUST be supported. For legacy reasons, 948 the 1996 version of H.263 MAY be supported using the RTP payload 949 format defined in [RFC2190] (payload type 34 [RFC3551]). 951 2. AMR-NB audio [RFC4867] SHOULD be supported. 953 3. MPEG-4 video [RFC3016] SHOULD be supported. 955 4. MPEG-4 AAC audio [RFC3016] SHOULD be supported. 957 5. Other codecs and payload formats MAY be supported. 959 Video record operations carried out by the VoiceXML Media Server 960 typically require receipt of an intra-frame before the recording can 961 commence. The VoiceXML Media Server SHOULD use the mechanism 962 described in [RFC4585] to request that a new intra-frame be sent. 964 Since some applications may choose to transfer confidential 965 information, the VoiceXML Media Server MUST support Secure RTP (SRTP) 966 [RFC3711] as discussed in Section 9. 968 3.5. DTMF 970 DTMF events [RFC4733] MUST be supported. When the user agent does 971 not indicate support for [RFC4733] the VoiceXML Media Server MAY 972 perform DTMF detection using other means such as detecting DTMF tones 973 in the audio stream. Implementation note: the reason why only 974 [RFC4733] telephone-events must be used when the user agent indicates 975 support of it is to avoid the risk of double detection of DTMF if 976 detection on the audio stream was simultaneously applied. 978 4. Returning Data to the Application Server 980 This section discusses the mechanisms for returning data (e.g. 981 collected utterance or digit information) from the VoiceXML Media 982 Server to the Application Server. 984 4.1. HTTP Mechanism 986 At any time during the execution of the VoiceXML application, data 987 can be returned to the Application Server via a HTTP POST using 988 standard VoiceXML elements such as or . Notably, 989 the element in VoiceXML 2.1 [VXML21] allows data to be sent to 990 the Application Server efficiently without requiring a VoiceXML page 991 transition and is ideal for short VoiceXML applications such as 992 "prompt and collect". 994 For most applications, it is necessary to correlate the information 995 being passed over HTTP with a particular VoiceXML Session. One way 996 this can be achieved is to include the SIP Call-ID (accessible in 997 VoiceXML via the session.connection.protocol.sip.headers array) 998 within the HTTP POST fields. Alternatively, a unique "POST-back URI" 999 can be specified as an application-specific URI parameter in the 1000 Request-URI of the initial INVITE (accessible in VoiceXML via the 1001 session.connection.protocol.sip.requesturi array). 1003 Since some applications may choose to transfer confidential 1004 information, the VoiceXML Media Server MUST support the https: scheme 1005 as discussed in Section 9. 1007 4.2. SIP Mechanism 1009 Data can be returned to the Application Server via the expr or 1010 namelist attribute on or the namelist attribute on 1011 . A VoiceXML Media Server MUST support encoding of the 1012 expr / namelist data in the message body of a BYE request sent from 1013 the VoiceXML Media Server as a result of encountering the or 1014 element. A VoiceXML Media Server MAY support inclusion 1015 of the expr / namelist data in the message body of the 200 OK message 1016 in response to a received BYE request (i.e. when the VoiceXML 1017 Application responds to the connection.disconnect.hangup event and 1018 subsequently executes an element with the expr or namelist 1019 attribute specified). 1021 Note that sending expr/namelist data in the 200 OK response requires 1022 that the VoiceXML Media Server delay the final response to the 1023 received BYE request until the VoiceXML Application's post-disconnect 1024 final processing state terminates. This mechanism is subject to the 1025 constraint that the VoiceXML Media Server must respond before the 1026 UAC's timer F expires (defaults to 32 seconds). Moreover, for 1027 unreliable transports, the UAC will retransmit the BYE request 1028 according to the rules of [RFC3261]. The VoiceXML Media Server 1029 SHOULD implement the recommendations of [RFC4320] regarding when to 1030 send the 100 Trying provisional response to the BYE request. 1032 If a VoiceXML Application executes a [VXML21] and then 1033 subsequently executes an with namelist information, the 1034 namelist information from the element is discarded. 1036 Namelist variables are first converted to their JSON value equivalent 1037 [RFC4627] and encoded in the message body using the application/ 1038 x-www-form-urlencoded format content type [HTML4]. The behavior 1039 resulting from specifying a recording variable in the namelist or an 1040 ECMAScript object with circular references is not defined. If the 1041 expr attribute is specified on the element instead of the 1042 namelist attribute, the reserved name __exit is used. 1044 To allow the application server to differentiate between a BYE 1045 resulting from a from one resulting from an , the 1046 reserved name __reason is used, with a value of "disconnect" (without 1047 brackets) to reflect the use of VoiceXML's element, and 1048 a value of "exit" (without brackets) to an explicit in the 1049 VoiceXML document. If the session terminates for other reasons (such 1050 as the media server encountering an error), this parameter may be 1051 omitted, or may take on platform-specific values prefixed with an 1052 underscore. 1054 This specification extends the application/x-www-form-urlencoded by 1055 replacing non-ASCII characters with one or more octets of the UTF-8 1056 representation of the character, with each octet in turn replaced by 1057 %HH, where HH represents the uppercase hexadecimal notation for the 1058 octet value and % is a literal character. As a consequence, the 1059 Content-Type header field in a BYE message containing expr/namelist 1060 data MUST be set to application/x-www-form-urlencoded;charset=utf-8. 1062 The following table provides some examples of usage and the 1063 corresponding result content. 1065 +----------------------------------------------------------------+ 1066 | Usage | Result Content | 1067 |------------------------------|---------------------------------| 1068 | | __reason=exit | 1069 | | __exit=5&__reason=exit | 1070 | | __exit="done"&__reason=exit | 1071 | | __exit=true&__reason=exit | 1072 | | pin=1234&errors=0&__reason=exit | 1073 +----------------------------------------------------------------+ 1074 assuming the following VoiceXML variables and values: 1075 userAuthorized = true 1076 pin = 1234 1077 errors = 0 1079 For example, consider the VoiceXML snippet: 1081 ... 1082 1083 ... 1085 If id equals 1234 and pin equals 9999, say, the BYE message would 1086 look similar to: 1088 BYE sip:user@pc33.example.com SIP/2.0 1089 Via: SIP/2.0/UDP 192.0.2.4;branch=z9hG4bKnashds10 1090 Max-Forwards: 70 1091 From: sip:dialog@example.com;tag=a6c85cf 1092 To: sip:user@example.com;tag=1928301774 1093 Call-ID: a84b4c76e66710 1094 CSeq: 231 BYE 1095 Content-Type: application/x-www-form-urlencoded;charset=utf-8 1096 Content-Length: 30 1098 id=1234&pin=9999&__reason=exit 1100 Since some applications may choose to transfer confidential 1101 information, the VoiceXML Media Server MUST support the S/MIME 1102 encoding of SIP message bodies as discussed in Section 9. 1104 5. Outbound Calling 1106 Outbound calls can be triggered via the Application Server using 1107 third party call control [RFC3725]. 1109 Flow IV from [RFC3725] is recommended in conjunction with the 1110 VoiceXML Session preparation mechanism. This flow has several 1111 advantages over others, namely: 1113 1. Selection of a VoiceXML Media Server and preparation of the 1114 VoiceXML Application can occur before the call is placed to avoid 1115 the callee experiencing delays. 1117 2. Avoids timing difficulties that could occur with other flows due 1118 to the time taken to fetch and parse the initial VoiceXML 1119 document. 1121 3. The flow is IPv6 compatible. 1123 An example flow for an Application Server initiated outbound call is 1124 provided in section 2.6.2. 1126 6. Call Transfer 1128 While VoiceXML is at its core a dialog language, it also provides 1129 optional call transfer capability. VoiceXML's transfer capability is 1130 particularly suited to the PSTN IVR Service Node use-case described 1131 in section 1.1.2. It is NOT RECOMMENDED to use VoiceXML's call 1132 transfer capability in networks involving Application Servers. 1133 Rather, the Application Server itself can provide call routing 1134 functionality by taking signaling actions based on the data returned 1135 to it from the VoiceXML Media Server via HTTP or in the SIP BYE 1136 message. 1138 If VoiceXML transfer is supported, the mechanism described in this 1139 section MUST be employed. The transfer flows specified here are 1140 selected on the basis that they provide the best interworking across 1141 a wide range of SIP devices. CCXML<->VoiceXML implementations, which 1142 require tight-coupling in the form of bi-directional eventing to 1143 support all transfer types defined in VoiceXML, may benefit from 1144 other approaches, such as the use of SIP event packages [RFC3265]. 1146 In what follows, the provisional responses have been omitted for 1147 clarity. 1149 6.1. Blind 1151 The blind transfer sequence is initiated by the VoiceXML Media Server 1152 via a REFER message [RFC3515] on the original SIP dialog. The 1153 Refer-To header contains the URI for the called party, as specified 1154 via the 'dest' or 'destexpr' attributes on the VoiceXML 1155 tag. 1157 If the REFER request is accepted, in which case the VoiceXML Media 1158 Server will receive a 2xx response, the VoiceXML Media Server throws 1159 the connection.disconnect.transfer event and will terminate the 1160 VoiceXML Session with a BYE message. For blind transfers, 1161 implementations MAY use [RFC4488] to suppress the implicit 1162 subscription associated with the REFER message. 1164 If the REFER request results in a non-2xx response, the 's 1165 form item variable (or event raised) depends on the SIP response and 1166 is specified in the following table. Note that this indicates that 1167 the transfer request was rejected. 1169 +-------------------------+-----------------------------------+ 1170 | SIP Response | variable / event | 1171 +-------------------------+-----------------------------------+ 1172 | 404 Not Found | error.connection.baddestination | 1173 | 405 Method Not Allowed | error.unsupported.transfer.blind | 1174 | 503 Service Unavailable | error.connection.noresource | 1175 | (No response) | network_busy | 1176 | (Other 3xx/4xx/5xx/6xx) | unknown | 1177 +-------------------------+-----------------------------------+ 1179 An example is illustrated below (provisional responses and NOTIFY 1180 messages corresponding to provisional responses have been omitted for 1181 clarity). 1183 User Agent 1 VoiceXML User Agent 2 1184 (Caller) Media Server (Callee) 1185 | | | 1186 |(0) RTP/SRTP | | 1187 |.................| | 1188 | | | 1189 |(1) REFER | | 1190 |<----------------| | 1191 |(2) 202 Accepted | | 1192 |---------------->| | 1193 |(3) BYE | | 1194 |<----------------| | 1195 |(4) 200 OK | | 1196 |---------------->| | 1197 | | Stop RTP (0) | 1198 |(5) INVITE | 1199 |---------------------------------->| 1200 |(6) 200 OK | 1201 |<----------------------------------| 1202 |(7) NOTIFY | | 1203 |---------------->| | 1204 |(8) 200 OK | | 1205 |<--------------- | | 1206 |(9) ACK | 1207 |---------------------------------->| 1208 |(10) RTP/SRTP | 1209 |...................................| 1210 | | | 1212 If the "aai" or "aaiexpr" attribute is present on , it is 1213 appended to the Refer-To URI as a parameter named "aai" in the REFER 1214 method. Reserved characters are URL-encoded as required for SIP/SIPS 1215 URIs [RFC3261]. The mapping of values outside of the ASCII range is 1216 platform specific. 1218 6.2. Bridge 1220 The bridge transfer function results in the creation of a small 1221 multi-party session involving the Caller, the VoiceXML Media Server, 1222 and the Callee. The VoiceXML Media Server invites the Callee to the 1223 session and will eject the Callee if the transfer is terminated. 1225 If the "aai" or "aaiexpr" attribute is present on , it is 1226 appended to the Request-URI in the INVITE as a URI parameter named 1227 "aai". Reserved characters are URL-encoded as required for SIP/SIPS 1228 URIs [RFC3261]. The mapping of values outside of the ASCII range is 1229 platform specific. 1231 During the transfer attempt, audio specified in the transferaudio 1232 attribute of is streamed to User Agent 1. A VoiceXML 1233 Media Server MAY play early media received from the Callee to the 1234 Caller if the transferaudio attribute is omitted. 1236 The bridge transfer sequence is illustrated below. The VoiceXML 1237 Media Server (acting as a UAC) makes a call to User Agent 2 with the 1238 same codecs used by User Agent 1. When the call setup is complete, 1239 RTP flows between User Agent 2 and the VoiceXML Media Server. This 1240 stream is mixed with User Agent 1's. 1242 User Agent 1 VoiceXML User Agent 2 1243 (Caller) Media Server (Callee) 1244 | | | 1245 |(0)RTP/SRTP | | 1246 |...................| | 1247 | | | 1248 | |(1)INVITE [offer] | 1249 | |------------------>| 1250 | |(2) 200 OK [answer]| 1251 | |<------------------| 1252 | |(3) ACK | 1253 | |------------------>| 1254 | |(4) RTP/SRTP | 1255 | mix |...................| 1256 | (0)+(4)| | 1258 If a final response is not received from User Agent 2 from the INVITE 1259 and the connecttimeout expires (specified as an attribute of 1260 ), the VoiceXML Media Server will issue a CANCEL to 1261 terminate the transaction and the 's form item variable is 1262 set to noanswer. 1264 If INVITE results in a non-2xx response, the 's form item 1265 variable (or event raised) depends on the SIP response and is 1266 specified in the following table. 1268 +-------------------------+-----------------------------------+ 1269 | SIP Response | variable / event | 1270 +-------------------------+-----------------------------------+ 1271 | 404 Not Found | error.connection.baddestination | 1272 | 405 Method Not Allowed | error.unsupported.transfer.bridge | 1273 | 408 Request Timeout | noanswer | 1274 | 486 Busy Here | busy | 1275 | 503 Service Unavailable | error.connection.noresource | 1276 | (No response) | network_busy | 1277 | (Other 3xx/4xx/5xx/6xx) | unknown | 1278 +-------------------------+-----------------------------------+ 1280 Once the transfer is established, the VoiceXML Media Server can 1281 "listen" to the media stream from User Agent 1 to perform speech or 1282 DTMF hotword, which when matched results in a near-end disconnect, 1283 i.e. the VoiceXML Media Server issues a BYE to User Agent 2 and the 1284 VoiceXML Application continues with User Agent 1. A BYE will also be 1285 issued to User Agent 2 if the call duration exceeds the maximum 1286 duration specified in the maxtime attribute on . 1288 If User Agent 2 issues a BYE during the transfer, the transfer 1289 terminates and the VoiceXML 's form item variable receives 1290 the value far_end_disconnect. If User Agent 1 issues a BYE during 1291 the transfer, the transfer terminates and the VoiceXML event 1292 connection.disconnect.transfer is thrown. 1294 6.3. Consultation 1296 The consultation transfer (also called attended transfer [SIPEX]) is 1297 similar to a blind transfer except that the outcome of the transfer 1298 call setup is known and the Caller is not dropped as a result of an 1299 unsuccessful transfer attempt. 1301 Consultation transfer commences with the same flow as for bridge 1302 transfer except that the RTP streams are not mixed at step (4) and 1303 error.unsupported.transfer.consultation supplants 1304 error.unsupported.transfer.bridge. Assuming a new SIP dialog with 1305 User Agent 2 is created, the remainder of the sequence follows as 1306 illustrated below (provisional responses and NOTIFY messages 1307 corresponding to provisional responses have been omitted for 1308 clarity). Consultation transfer makes use of the Replaces: header 1309 [RFC3891] such that User Agent 1 calls User Agent 2 and replaces the 1310 latter's SIP dialog with the VoiceXML Media Server with a new SIP 1311 dialog between the Caller and Callee. 1313 User Agent 1 VoiceXML User Agent 2 1314 (Caller) Media Server (Callee) 1315 | | | 1316 |(0) RTP/SRTP | | 1317 |.................|(4) RTP/SRTP | 1318 | |.................| 1319 |(5) REFER | | 1320 |<----------------| | 1321 |(6) 202 Accepted | | 1322 |---------------->| | 1323 |(7) INVITE Replaces:ms1.example.com| 1324 |---------------------------------->| 1325 |(8) 200 OK | 1326 |<----------------------------------| 1327 |(9) ACK | 1328 |---------------------------------->| 1329 |(10) RTP/SRTP | 1330 |...................................| 1331 | |(11) BYE | 1332 | |<----------------| 1333 | |(12) 200 OK | 1334 | |---------------->| Stop 1335 |(13) NOTIFY | | RTP (4) 1336 |---------------->| | 1337 |(14) 200 OK | | 1338 |<----------------| | 1339 |(15) BYE | | 1340 |<----------------| | 1341 |(16) 200 OK | | 1342 |---------------->| Stop | 1343 | | RTP (0) | 1345 If a response other than 202 Accepted is recevied in response to the 1346 REFER request sent to User Agent 1, the transfer terminates, and an 1347 error.unsupported.transfer.consultation event is raised. In 1348 addition, a BYE is sent to User Agent 2 to terminate the established 1349 outbound leg. 1351 The VoiceXML Media Server uses receipt of a NOTIFY message with a 1352 sipfrag message of 200 OK to determine that the consultation transfer 1353 has succeeded. When this occurs, the connection.disconnect.transfer 1354 event will be thrown to the VoiceXML application, and a BYE is sent 1355 to User Agent 1 to terminate the session. A NOTIFY message with a 1356 non-2xx final response sipfrag message body will result in the 1357 transfer terminating and the associated VoiceXML input item variable 1358 being set to 'unknown'. Note that as a consequence of this 1359 mechanism, implementations MUST NOT use [RFC4488] to suppress the 1360 implicit subscription associated with the REFER message for 1361 consultation transfers. 1363 7. Contributors 1365 The bulk of the early work for this effort was carried out on weekly 1366 teleconferences and over e-mail. The authors would particularly like 1367 to recognize the contributions of R. J. Auburn (Voxeo), Jeff Haynie 1368 (Hakano), and Scott McGlashan (Hewlett-Packard). 1370 8. Acknowledgements 1372 This document owes its genesis to the expired Internet-Draft, "A SIP 1373 Interface to VoiceXML Dialog Servers", authored by J. Rosenberg, P. 1374 Mataga, and D. Ladd. The following people had input to the current 1375 document: 1377 R. J. Auburn (Voxeo) 1379 Hans Bjurstrom (Hewlett-Packard) 1381 Emily Candell (Comverse) 1383 Peter Danielsen (Lucent) 1385 Brian Frasca (Tellme) 1387 Jeff Haynie (Hakano) 1389 Scott McGlashan (Hewlett-Packard) 1391 Matt Oshry (Tellme) 1393 Rao Surapaneni (Tellme) 1395 The authors would like to acknowledge the support of Cullen Jennings 1396 and the Mediactrl chairs, Eric Burger and Spencer Dawkins. 1398 9. Security Considerations 1400 Exposing a VoiceXML media service with a well-known address may 1401 enhance the possibility of exploitation (for example an invoked 1402 network service may trigger a billing event). The VoiceXML Media 1403 Server is RECOMMENDED to use standard SIP mechanisms [RFC3261] to 1404 authenticate requesting endpoints and authorize per local policy. 1406 Some applications may choose to transfer confidential information to 1407 or from the VoiceXML Media Server. To provide data confidentiality, 1408 the VoiceXML Media Server MUST implement the sips: and https: schemes 1409 in addition to S/MIME message body encoding as described in 1410 [RFC3261]. 1412 The VoiceXML Media Server MUST support Secure RTP (SRTP) [RFC3711] to 1413 provide confidentiality, authentication, and replay protection for 1414 RTP media streams (including RTCP control traffic). 1416 To mitigate against the possibility for denial of service attacks, 1417 the VoiceXML Media Server is RECOMMENDED (in addition to 1418 authenticating and authorizing endpoints described above) to provide 1419 mechanisms for implementing local policies such as time-limiting of 1420 VoiceXML application execution. 1422 10. IANA Considerations 1424 IANA SHALL register the following parameters in the SIP/SIPS URI 1425 Parameters registry, following the specification required policy of 1426 RFC 3969: 1428 Parameter Name Predefined Values Reference 1429 -------------- ----------------- --------- 1430 maxage no TBD 1431 maxstale no TBD 1432 method "get" / "post" TBD 1433 postbody no TBD 1434 ccxml no TBD 1435 aai no TBD 1437 11. Changes since last version: 1439 o Tightened up Security Considerations per comments from IESG review 1441 o Added missing ccxml and aai IANA registrations 1443 o Miscellaneous typos 1445 12. References 1447 12.1. Normative References 1449 [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 1450 Specification", W3C Recommendation, Dec 1999. 1452 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1453 Requirement Levels", BCP 14, RFC 2119, March 1997. 1455 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1456 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1457 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1459 [RFC3016] Kikuchi, Y., Nomura, T., Fukunaga, S., Matsui, Y., and H. 1460 Kimata, "RTP Payload Format for MPEG-4 Audio/Visual 1461 Streams", RFC 3016, November 2000. 1463 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1464 A., Peterson, J., Sparks, R., Handley, M., and E. 1465 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1466 June 2002. 1468 [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model 1469 with Session Description Protocol (SDP)", RFC 3264, 1470 June 2002. 1472 [RFC3265] Roach, A., "Session Initiation Protocol (SIP)-Specific 1473 Event Notification", RFC 3265, June 2002. 1475 [RFC3311] Rosenberg, J., "The Session Initiation Protocol (SIP) 1476 UPDATE Method", RFC 3311, October 2002. 1478 [RFC3326] Schulzrinne, H., Oran, D., and G. Camarillo, "The Reason 1479 Header Field for the Session Initiation Protocol (SIP)", 1480 RFC 3326, December 2002. 1482 [RFC3515] Sparks, R., "The Session Initiation Protocol (SIP) Refer 1483 Method", RFC 3515, April 2003. 1485 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1486 Jacobson, "RTP: A Transport Protocol for Real-Time 1487 Applications", STD 64, RFC 3550, July 2003. 1489 [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and 1490 Video Conferences with Minimal Control", STD 65, RFC 3551, 1491 July 2003. 1493 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 1494 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 1495 RFC 3711, March 2004. 1497 [RFC3725] Rosenberg, J., Peterson, J., Schulzrinne, H., and G. 1498 Camarillo, "Best Current Practices for Third Party Call 1499 Control (3pcc) in the Session Initiation Protocol (SIP)", 1500 BCP 85, RFC 3725, April 2004. 1502 [RFC3891] Mahy, R., Biggs, B., and R. Dean, "The Session Initiation 1503 Protocol (SIP) "Replaces" Header", RFC 3891, 1504 September 2004. 1506 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1507 Resource Identifier (URI): Generic Syntax", STD 66, 1508 RFC 3986, January 2005. 1510 [RFC4244] Barnes, M., "An Extension to the Session Initiation 1511 Protocol (SIP) for Request History Information", RFC 4244, 1512 November 2005. 1514 [RFC4320] Sparks, R., "Actions Addressing Identified Issues with the 1515 Session Initiation Protocol's (SIP) Non-INVITE 1516 Transaction", RFC 4320, January 2006. 1518 [RFC4488] Levin, O., "Suppression of Session Initiation Protocol 1519 (SIP) REFER Method Implicit Subscription", RFC 4488, 1520 May 2006. 1522 [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, 1523 "Extended RTP Profile for Real-time Transport Control 1524 Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, 1525 July 2006. 1527 [RFC4627] Crockford, D., "The application/json Media Type for 1528 JavaScript Object Notation (JSON)", RFC 4627, July 2006. 1530 [RFC4629] Ott, H., Bormann, C., Sullivan, G., Wenger, S., and R. 1531 Even, "RTP Payload Format for ITU-T Rec", RFC 4629, 1532 January 2007. 1534 [RFC4733] Schulzrinne, H. and T. Taylor, "RTP Payload for DTMF 1535 Digits, Telephony Tones, and Telephony Signals", RFC 4733, 1536 December 2006. 1538 [RFC4855] Casner, S., "Media Type Registration of RTP Payload 1539 Formats", RFC 4855, February 2007. 1541 [RFC4867] Sjoberg, J., Westerlund, M., Lakaniemi, A., and Q. Xie, 1542 "RTP Payload Format and File Storage Format for the 1543 Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband 1544 (AMR-WB) Audio Codecs", RFC 4867, April 2007. 1546 [VXML20] McGlashan, S., Burnett, D., Carter, J., Danielsen, P., 1547 Ferrans, J., Hunt, A., Lucas, B., Porter, B., Rehor, K., 1548 and S. Tryphonas, "Voice Extensible Markup Language 1549 (VoiceXML) Version 2.0", W3C Recommendation, March 2004. 1551 [VXML21] Oshry, M., Auburn, R J., Baggia, P., Bodell, M., Burke, 1552 D., Burnett, D., Candell, E., Kilic, H., McGlashan, S., 1553 Lee, A., Porter, B., and K. Rehor, "Voice Extensible 1554 Markup Language (VoiceXML) Version 2.1", W3C Candidate 1555 Recommendation, June 2005. 1557 12.2. Informative References 1559 [CCXML10] Auburn, R J., "Voice Browser Call Control: CCXML Version 1560 1.0", W3C Working Draft (work in progress), June 2005. 1562 [IEC14496-14] 1563 "Information technology. Coding of audio-visual objects. 1564 MP4 file format", ISO/IEC ISO/IEC 14496-14:2003, 1565 October 2003. 1567 [MRCPv2] Shanmugham, S. and D. Burnett, "Media Resource Control 1568 Protocol Version 2", draft-ietf-speechsc-mrcpv2-13 (work 1569 in progress), Sep 2007. 1571 [RFC2190] Zhu, C., "RTP Payload Format for H.263 Video Streams", 1572 RFC 2190, September 1997. 1574 [RFC3960] Camarillo, G. and H. Schulzrinne, "Early Media and Ringing 1575 Tone Generation in the Session Initiation Protocol (SIP)", 1576 RFC 3960, December 2004. 1578 [RFC3969] Camarillo, G., "The Internet Assigned Number Authority 1579 (IANA) Uniform Resource Identifier (URI) Parameter 1580 Registry for the Session Initiation Protocol (SIP)", 1581 BCP 99, RFC 3969, December 2004. 1583 [RFC4240] Burger, E., Van Dyke, J., and A. Spitzer, "Basic Network 1584 Media Services with SIP", RFC 4240, December 2005. 1586 [SIPEX] Johnston, A., Sparks, R., Cunningham, C., Donovan, S., and 1587 K. Summers, "Session Initiation Protocol Examples", 1588 draft-ietf-sipping-service-examples (work in progress), 1589 July 2005. 1591 [TS23002] "3rd Generation Partnership Project: Network architecture 1592 (Release 6)", 3GPP TS 23.002 v6.6.0, December 2004. 1594 [TS26244] "Transparent end-to-end packet switched streaming service 1595 (PSS); 3GPP file format (3GP)", 3GPP TS 26.244 v6.4.0, 1596 December 2004. 1598 Appendix A. Notes on Normative References 1600 We make a "downref" normative reference to [RFC4627] - an 1601 Informational Draft describing a proprietary (but extremely popular) 1602 format. 1604 Authors' Addresses 1606 Dave Burke 1607 Google 1608 Belgrave House, 76 Buckingham Palace Road 1609 London SW1W 9TQ 1610 United Kingdom 1612 Email: daveburke@google.com 1614 Mark Scott 1615 Genesys 1616 1120 Finch Avenue West, 8th floor 1617 Toronto, Ontario M3J 3H7 1618 Canada 1620 Email: Mark.Scott@genesyslab.com