idnits 2.17.1 draft-rosenberg-sip-app-components-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 32) being 111 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1506: '... SHOULD be used for language specifi...' RFC 2119 keyword, line 1508: '... RECOMMENDED. The language tags SHOU...' RFC 2119 keyword, line 1518: '...he server. The SDP MUST indicate a two...' RFC 2119 keyword, line 1519: '...eams. One stream MUST be of type audio...' RFC 2119 keyword, line 1520: '...able to the client. The stream MUST be...' (7 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 1017 has weird spacing: '...; this is b...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 15, 2000) is 8562 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '1' on line 1684 looks like a reference -- Missing reference section? '2' on line 1688 looks like a reference -- Missing reference section? '3' on line 1692 looks like a reference -- Missing reference section? '4' on line 1696 looks like a reference -- Missing reference section? '5' on line 1701 looks like a reference -- Missing reference section? '6' on line 1705 looks like a reference -- Missing reference section? '7' on line 1709 looks like a reference -- Missing reference section? '8' on line 1713 looks like a reference -- Missing reference section? '9' on line 1717 looks like a reference -- Missing reference section? '10' on line 1721 looks like a reference -- Missing reference section? '11' on line 1725 looks like a reference -- Missing reference section? '12' on line 1729 looks like a reference -- Missing reference section? '13' on line 1733 looks like a reference -- Missing reference section? '14' on line 1736 looks like a reference -- Missing reference section? '15' on line 1739 looks like a reference Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force SIP WG 3 Internet Draft Rosenberg/Mataga/Schulzrinne 4 draft-rosenberg-sip-app-components-00.txt dynamicsoft/Columbia U. 5 November 15, 2000 6 Expires: May 2001 8 An Application Server Component Architecture for SIP 10 STATUS OF THIS MEMO 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that 17 other groups may also distribute working documents as Internet- 18 Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet- Drafts as reference 23 material or to cite them other than as work in progress. 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 Abstract 33 An application server is defined as an entity that is capable of 34 providing advanced features to users. Examples of features include 35 call forwarding, call screening, debit card calling, web interactive 36 voice response, etc. However, the set of functions needed to enable a 37 broad range of such applications is quite large - it includes speech 38 recognition, DTMF recognition and digit collection, text-to-speech 39 synthesis, database interfacing, audio and video coding and decoding, 40 audio and video bridging and mixing, and signaling, to name a few. 41 Supporting such a large set of functions on the same box presents a 42 major challenge. To solve this problem, the industry is proposing a 43 decomposition of the application server into two components - a media 44 server that handles the media component, and an application server 45 that handles the call control, data, and signaling. The interface 46 that has been proposed between these two elements is a control 47 mechanism along the lines of MGCP or Megaco. In this paper, we 48 propose an orthogonal decomposition, which breaks an application 49 server into application server components. Each component represents 50 a application server in its own right, but it provides a well defined 51 component that by itself may be a complete, but simpler, application. 53 1 Introduction 55 An observable trend in VoIP systems is the continuing decomposition 56 of monolithic elements into component subparts, with the 57 corresponding development of standardized interfaces between 58 components. This kind of decomposition can be observed in the 59 MGCP/megaco [1] gateway decomposition of a large gateway into a 60 signaling gateway (SG), media gateway (MG) and media gateway 61 controller (MGC), often referred to as a softswitch. Following that 62 decomposition, the softswitch was further decomposed into a pure call 63 control component (still referred to as a softswitch) and an 64 application server (AS), which provides features and services. The AS 65 was then decomposed, breaking it into a signaling piece (still 66 referred to as an application server), and a media server (MS), which 67 provides the media components of applications. Protocols like MGCP 68 [2] and Megaco [3] have been proposed as the interface between an AS 69 and MS. 71 This paper proposes an additional decomposition of an application 72 server into application server components (ASCs). This decomposition 73 is orthogonal to the MS/AS decomposition, and differs significantly 74 in its goals and benefits. The primary motivation is the recognition 75 that most complex (and interesting) applications require a common set 76 of core pieces - speech recognition and text-to-speech, translation 77 services, conference servers, messaging servers, etc. Each of these 78 components is complex and a full-fledged application in its own 79 right. In most cases, a complex application really doesn't care about 80 the details of the operation of the component. In many cases, these 81 components run on separate servers, and often, would be provided by 82 separate providers. What is needed, then, is a well-defined, 83 distributed interface to these application server components. Here, 84 we motivate a distributed decomposition of applications into 85 components, and then show why, for many of these, the interface is 86 ideally suited for a distributed, session establishment and 87 termination interface that follows a standardized pattern of 88 addressing and parameter passing. We believe the Session Initiation 89 Protocol (SIP) [4] is ideally suited for such an interface. 91 2 Why Decompose 93 The first question to address is "why decompose an application 94 server". 96 Decomposition is the act of breaking a large, monolithic system into 97 a number of smaller compoents that interact according to specified 98 behaviors. Decomposition of large components offers a number of 99 benefits: 101 Scale. As systems need to serve more and more users, there are 102 two approaches to scaling up. One is to buy increasingly 103 faster hardware, so that the monolithic servers can keep up 104 with increasing use. The second is to distribute the work 105 across components, so that multiple servers perform the 106 work. Distribution is fundamentally cheaper, since the cost 107 of large monolithic systems increases exponentially with 108 capacity, compared to the linear increase in cost with 109 multiple, smaller units. Distribution of work can be done 110 through load balancing, where each server remains 111 homogeneous, but the work is spread across numerous 112 servers, or it can be done through specialization, where 113 the work is split into separate functions, and each 114 function placed on a separate server. Specialization is 115 ideal in cases where the work has different requirements 116 for it to be completed. As an example, a component of an 117 application may require special purpose hardware. This 118 component can distributed to a specialized processor, with 119 a normal off the shelf processor handling the more generic 120 software tasks. Several of the components that we are 121 describing fit into this category (such as the TTS server). 123 Sharing of resources. By decomposing a server into components, a 124 many-to-many interaction between them becomes possible. 125 This means that one component can provide services to many 126 other components. This provides for sharing of resources, 127 which ultimately results in cost reduction. 129 Expertise. Building a complex application requires expertise in 130 call control, media services, compression, web, speech 131 recognition, etc. It is highly unlikely that one 132 organization will have enough expertise in all of these to 133 build them all. By decomposing an application server into 134 subpieces, organizations with expertise in one particular 135 piece can build that one. The result is that the complete 136 system can be composed of best in breed components. 138 Speed of deployment. By decomposing, upgrading existing 139 applications and deploying new ones becomes simpler. The 140 decomposition provides isolation. This isolation means that 141 one component can be changed or improved without affecting 142 others. That makes it easy to add new features to an 143 application, or to deploy a new one by using components 144 already deployed. 146 Decomposition does have its drawbacks. Primary amongst them is 147 security. In general, the more boxes in a system, and the more they 148 interact with each other, the more complex the security is. As a 149 result, any distributed system has inherently more complex security 150 issues. Another drawback is reliability. A system with multiple 151 boxes, where the system requires all boxes to work in order to 152 function, is less reliable than a system with a single box which must 153 work. 155 3 Tightly Coupled Decomposition 157 As an example of decomposition, it has been proposed to break the 158 application server into a signaling and control component (the AS), 159 plus a media server component (the MS). This decomposition is shown 160 in Figure 1. 162 Calls arrive at the AS component over SIP. The AS then accesses the 163 MS using MGCP, and learns the IP address and port where the media for 164 the call can be sent. This is returned in the 200 OK response by the 165 AS. The AS then begins to instruct the MS to perform specific 166 functions - collect digits, play tones and announcements, and to 167 report the digits and tones back to the AS for further processing. 168 Typically, the MGCP interface between the two devices is fairly 169 "busy"; there is a lot of messaging for complex applications. 171 In this model, there is a tightly coupled relationship between the MS 172 and AS. The MS cannot function without the AS, and the AS needs to 173 perform tight, low-level controls over the detailed operation of the 174 media server. 176 To some degree, breaking of an application server into these two 177 components represents an implementation detail of how one builds a 178 large, monolithic application server. It is not generally possible 179 for the two components to be owned by separate providers. In fact, it 180 has yet to be shown that complete interoperability and integration is 181 possible with two components from different vendors, let alone 182 different providers. 184 This decomposition also does not provide a true separation of 185 function. Most applications that require media interaction (IVR, 186 credit card and debit card, etc.) have very cleanly separated media 187 phases and signaling phases. The details of the media interactions 188 .................... 189 . . 190 . +-------------+ . 191 . | | . 192 SIP . | | . 193 -------------+ AS | . 194 . | | . 195 . | | . 196 . | | . 197 . +-------------+ . 198 . | . 199 . | . 200 . | . 201 . |MGCP . 202 . | . 203 . | . 204 . | . 205 . +-------------+ . 206 . | | . 207 . | | . 208 RTP . | | . 209 -------------+ MS | . 210 . | | . 211 . | | . 212 . +-------------+ . 213 . . 214 .................... 215 Complete Application 216 Server 218 Figure 1: MGCP-based decomposition 220 are usually not important to the signaling component, and vice a 221 versa. As an example, consider a debit card application. The 222 application starts with the user making a call. As part of the call 223 processing, interaction is needed with the user via the media stream 224 to determine the debit card number. The precise set of menu 225 operations and interactions used to obtain this number aren't 226 important to the call/signaling processing piece; only the result 227 (the number), is important. Once the number is returned, media 228 processing ceases, and data and call processing commence. The debit 229 card is looked up in a subscriber database, and if enough time 230 remains, the call is completed. The signaling component monitors the 231 call, and when the card has run out of minutes, the call is 232 terminated. 234 Consider the case where the application provider decides that the 235 menus presented for debit card collection are confusing, and they 236 need to be changed. This change really affects the media processing 237 only; ideally, we would like to have no change whatsoever in the data 238 processing and signaling part of the application. However, in the 239 decomposition afforded by MGCP, the AS component contains both the 240 signaling and call control, in addition to the control of the IVR 241 menus and and processing. Thus, the AS needs to be updated, even 242 though what has changed is really an IVR component. 244 The MGCP decomposition also presents a burden for software developers 245 on the AS. They need to understand, and program, the detailed 246 interactions with the MS that are provided by MGCP, in addition to 247 the detailed signaling and data processing operations. The developers 248 will also need to build and manage the low level state representing 249 the controlled entity, which can be painful. The result is longer 250 development times, less code reuse, and slower innovation. 252 It has been argued that one of the benefits of the MGCP decomposition 253 is that it offloads the "burden" of call control from the media 254 server. However, from a complexity standpoint, the MGCP processing 255 required is probably on par with (if not more than), the simple 256 amount of call control and SIP processing needed if SIP were used 257 directly. 259 From a reliability perspective, an MGCP style decomposition is less 260 desirable. Since the components are strongly coupled, the system will 261 fail so long as any of the pieces fail. Failure can also be 262 introduced because of additional network resources needed for 263 communications between the boxes. The result is that the MGCP 264 decomposition may actually increase the probability of failure, as 265 compared to no decomposition at all. 267 Another decomposition that has been proposed is to break a proxy into 268 a routing and call control component, plus a services component. The 269 interface between the two is then a transactional interface for 270 services, similar in concept to INAP, based upon state transitions 271 within a call model. This is another form of tight coupling, since it 272 requires the services component to have detailed knowledge of the 273 operational model of the call control component. We believe that this 274 decomposition is limiting, for the same reasons the AS/MS 275 decomposition is limiting. 277 4 The Decoupled Model 278 4.1 Architecture 280 As a result of this, we see the master/slave decomposition as being 281 ideal for a single vendor to build a large system. However, this 282 decomposition does not solve the other distribution needs we have 283 motivated above. As a result, we propose that the AS be decomposed 284 into an application component responsible for coordinating the 285 overall execution of the application (called the controller), and 286 application server components that provide pieces of the overall 287 application. These components are only loosely coupled with the 288 coordinating application server. The loose coupling implies that the 289 interaction between them is the same as the interaction between the 290 user and the coordinating application server, which is, in turn, the 291 same as the interation between the application server components and 292 other application server components. The components can easily be 293 from separate vendors, and the interactions support the needed 294 security and routing features to allow them to be owned by separate 295 providers, even. 297 The architecture is shown in Figure 2. 299 The goal of the decoupling is to break the application into as 300 coarse-grained pieces as possible. Each component (the coordinator 301 included) should need to know as little as possible about the 302 detailed operations performed by other components. A coarse-grained 303 decomposition means that there is a clean and simple break in the 304 functionality provided by the components. This enables significantly 305 simpler interfaces between those components. 307 Each component is really interested in passing a request for service 308 to another, letting the other component perform its task, and then 309 getting the final result of the task back as an output. From a 310 software engineering perspective, this represents the classic 311 function call; the call signaling component is making a function call 312 to the media part. It is interested only in the return value - the 313 debit card number, for example - and does not really care about the 314 implementation of it. From a protocol perspective, this is a classic 315 client-server system. The client makes a request of the server, and 316 the server does whatever it needs to do to return the final response. 317 The problem more closely resembes the client-server system than the 318 function call, however. This is because we need the interaction to be 319 across the network, rather than between code within the same process. 320 This is because one of the key concepts here is that components can 321 be provided by separate service providers. 323 In such a model, where does the state for the sessions live? Here, we 324 define a session as the complete set of interactions amongst all 325 components for the delivery of the service. Thus, a session might 326 span multiple protocols, and even multiple calls. Not surprisingly, 327 session state is distributed amongst the components, and the 328 distribution follows the architectural model of Figure 2. The top 329 level server, the controller, maintains the high level pieces of 330 state that deal with overall delivery of the service, and the state 331 required to coordinate the interactions with the component servers. 332 Each component server maintains only the state needed to execute 333 their component, and to manage interactions with components below 334 them. A component server does not know about the complete service 335 being delivered, and does not know about sibling servers. This aspect 336 of our model - hierarchical distribution of session state, leads to 337 one of the primary benefits of the architecture - ease of 338 development. Someone building a new application by reusing existing 339 components only needs to manage the high level state for delivery of 340 the service. State related to the details of operation of one of the 341 components - timings between digits in an IVR server, for example, is 342 not relevant to the coordinator, and does not need to be managed. 344 The difference between classic RPC or client/server interactions and 345 the interactions between the components here is that the relationship 346 between the components represents a long lived association (i.e., a 347 session), during which a session level service is being provided, 348 rather than a simple input/output service. As an example, consider a 349 component providing continuous real-time text-to-speech translation 350 services. The application coordinator that wishes to use this service 351 acts as a client, initiating a request for service to the server (in 352 this case, the TTS server). However, the text is not passed as an 353 "argument" to the TTS server, it is continually streamed for the 354 duration of an active session, and the TTS server would continuously 355 stream back the speech version of the text, which is the output of 356 the service. 358 Another example is a voice messaging server. The messaging server 359 provides basic services like message drop, message retrieve, and 360 message management. Each of these represent procedures that can be 361 executed by a client component. To drop a message, for example, the 362 client component would initiate a session with the messaging server. 363 A prompt would be played over that session, something like "please 364 record your message for Joe now", and then the component takes the 365 media input stream, records it, and saves it. When it is done, the 366 session is terminated. 368 In some cases, the session may require a "side channel" over which 369 intermediate data is passed, needed to control the session 370 interactions from that point forward. IVR is the classic example. In 371 some cases the coordinating application server can kick off the IVR 372 script, and then only get back the final result - a menu option, a 373 +-----------+ 374 | | 375 | | 376 | AS | 377 |coordinator| 378 | | 379 | | 380 +-----------+ 381 SIP, -- \ --- 382 RTP? -- \ ---- SIP, 383 -- \ ---- RTP? 384 -- \ SIP, ---- 385 -- \ RTP? ---- 386 -- \ -- 387 +----------+ +-----\----+ +----------+ 388 | | | | | | 389 | | | | | | 390 | | | | | | 391 | ASC | | ASC | | ASC | 392 | | | | | | 393 | | | | | | 394 +----------+ +----------+ +----------+ 395 \ / 396 / \\ SIP, / 397 / SIP, \ RTP? // 398 / RTP? \\ / SIP, 399 / \ / RTP? 400 / +----------+ 401 +----------+ | | 402 | | | | 403 | | | | 404 | | | ASC | 405 | ASC | | | 406 | | | | 407 | | +----------+ 408 +----------+ 410 Figure 2: Decoupled Architecture 411 credit card number, or what have you. In other cases, the 412 coordinating component may need to get intermediate results, so that 413 it can guide the operation of the IVR moving forward. This requires a 414 companion control channel that provides data output from the 415 component server back to the client, and then returns further high 416 level instructions from the client back to the server. 418 There is a thin line in some cases between this control channel and 419 the tightly coupled interactions of a master-slave MGCP relationship. 420 However, the loosely coupled nature of the interaction can be 421 maintained by using coarse-grained data passing over a distributed 422 client-server protocol, such as HTTP or Corba. 424 From this architectural description, it is clear that a client-server 425 session establishment protocol, which allows for passing of 426 parameters that describe service, is the ideal mechanism to 427 coordinate the interaction between components. Clearly, SIP is 428 perfect in such a role. 430 Following the example above, an IVR application server component 431 would be completely responsible for the execution of the IVR piece of 432 an application, including both the media and the signaling call 433 control. It would know the menus to maneuver through, and it would 434 know when to collect digits and present prompts. The coordinating 435 application server would request service from the IVR component by 436 initiating a call to it (possibly using third party call control [5] 437 to direct the media directly to the IVR without passing through 438 itself; more on that below). The application component takes the 439 media from the incoming call, running it against the IVR application. 440 When the IVR is done, the final result - in this case, the credit 441 card number, is passed back to the coordinating AS, possibly throug 442 an HTTP POST operation. The coordinating AS then terminates the call 443 with the IVR. 445 4.2 Benefits of the Decoupling 447 This decoupled interaction between components provides several 448 important benefits: 450 Separation of Businesses. The decoupled interaction between 451 components is needed to allow the components to be provided 452 by separate providers. Master-slave control interactions do 453 not work well across service providers, let alone across 454 vendors. By allowing separate providers to offer the 455 components, new businesses can be created that specialize 456 in the piece they are providing. 458 Rapid Development. Since the components can easily be placed in 459 separate boxes from separate vendors, or even in separate 460 providers, we achieve a separation of function that allows 461 each piece to be developed in complete isolation. We also 462 get reuse of components for new applications. This allows 463 for rapid service creation. 465 Better Interoperability. It can be argued that the decoupled 466 interaction between components is more like to be 467 interoperable that a master-slave mechanism. This is 468 largely based on the assumption that a master-slave 469 interaction requires a lot more messaging and exchange 470 between the components, whereas the decoupled client-server 471 mechanism requires less. The fewer information that passes 472 back and forth, the easier it is to interoperate. 474 Architectural Flexibility. The loose coupling of the components 475 means that a server, such as a conferencing application or 476 IVR, need not be implemented as an actual server. Rather, 477 complex networks of components, with proxies providing 478 routing of requests in arbitrarily complex ways, can be 479 built to provide a service. Since the interaction is SIP, 480 the application controller accessing the service doesn't 481 know whether it is communicating with a single server or a 482 network built in this fashion. That allows ASPs flexibility 483 in how they can construct their service networks. 485 Reliability The loose coupling of the components improves 486 reliability compared to a tight coupling. Thats because the 487 system can probably still continue to operate in the 488 failure of a single component. For example, if a TTS server 489 fails during a session, an application server can use a 490 server from a completely different provider, or it can use 491 a media server instead, converting the text to VoiceXML 492 scripts. Depending on the service, the TTS component could 493 possible be skipped altogether. Note, however, that the 494 reliability is still not as good as a monolithic system. 495 Having ten identical boxes each running a complete set of 496 services is better than spreading the service across ten 497 boxes, where some subset cause total failure. 499 5 Architecture for the Interfaces 501 Up to now, we have been fairly vague about exactly how such an 502 interface would work in practice. We have argued that it is SIP, but 503 not described in detail how SIP is actually used for this function. 505 SIP (along with SDP [6]) clearly provides the facilities for 506 initiation and termination of the sessions between the controller and 507 components, and for specification of the media addresses to and from 508 which media is sent. However, SIP leaves a lot of flexibility in 509 terms of naming, additional message content, session duration, and 510 control. Here, we discuss each of these in turn. 512 5.1 Naming 514 In any remote procedure call system, a key component is naming. The 515 identified resource must be properly addressed so that the underlying 516 message passing system can properly determine where the request 517 should go. 519 The same is true in SIP. Messages are routed based on the request 520 URI, as it serves as the primary naming tool for routing messages. In 521 its application to AS component interaction, the request URI serves 522 as the primary tool to identify the resource to which the session is 523 addressed. A critical piece of defining a session level service that 524 can be accessed by SIP is defining the naming of the resources within 525 that service. This point cannot be understated. 527 As an example, consider a conferencing service. In this case, the 528 primary resource that is being accessed is a mixing service. We would 529 like to have a way to identify which conference is being addressed by 530 any given call. All calls for the same conference are all bridged 531 together. By default, the bridging would operate in an N-1 532 configuration (that is, each user receives a mixed media stream that 533 represents all of the other users besides themself). Conferences can 534 be set up in two ways - ad-hoc, which are not pre-established at all, 535 and exist so long as there is a participant in them, and scheduled, 536 where they exist for a certain period of time. 538 One might imagine that a conferencing service breaks its URI 539 namespace into two pieces - one piece that represents ad-hoc 540 conferences, and another that represents scheduled conferences. Ad- 541 hoc conferences are addressed using a URI of the form .adhoc@conferences.com. All users who initiate a call to the URI 543 sip:as9dahas89.adhoc@conferences.com are bridged together. The 544 conference state is established when the first call to a conference 545 occurs, and destroyed when the last call terminates. In contrast, 546 scheduled conferences might be named by .scheduled@conferences.com, so that a call to 548 sip:conference12.scheduled@conferences.com allows a user access to a 549 pre-arranged conference. 551 There are several benefits to naming ad-hoc conferences vs. scheduled 552 ones in this fashion. The primary one is convenience; the name makes 553 it the type of conference apparent to any entities that are 554 interested. Secondly, it can avoid certain misconfigurations. Let's 555 say there are no conventions for naming of ad-hoc versus scheduled 556 conferences. I am asked to join a scheduled conference 557 (conf2321@conferences.com), but I mis-type the URL in my browser 558 (conf2123@conferences.com). I don't want this to drop me into an ad- 559 hoc conference where I sit for 15 minutes thinking others will 560 eventually join. If ad-hoc conferences are named differently, a call 561 to cond2123@conferences.com is never going to be an ad-hoc 562 conference, and so my call will be rejected immediately. 564 For an application server to use a conferencing service as a 565 component, the AS must know the URI namespace conventions used to 566 identify the various conferences. The above information, for example, 567 would be provided by the conferencing provider to its customers. 569 This same concept of using the request URI as a service identifier 570 has been described in detail for voicemail systems [7]. 572 The great advantage of using the request URI as a service identifier 573 comes because of the combination of two facts. First, unlike in the 574 PSTN, where numbers are limited, URIs come from an infinite space. 575 They are plentiful, and they are free. Secondly, the primary function 576 of SIP is call routing through manipulations of the request URI. In 577 the traditional SIP application, this URI represents people. However, 578 the URI can also represent services, as we propose here. This means 579 we can apply the routing services SIP provides to routing of calls to 580 services. The result - the problem of service invocation and service 581 location becomes a routing problem, for which SIP provides a scalable 582 and flexible solution. Since there is such a vast namespace of 583 services, we can explicitly name each service in a finely granular 584 way. This allows the distribution of services across the network. In 585 the conferencing example above, since we have separated the names of 586 ad-hoc conferences from scheduled conferences, we can program proxies 587 to route calls for ad-hoc conferences to one set of servers, and 588 calls for scheduled ones to another, possibly even in a different 589 provider. In fact, since each conference itself is given a URI, we 590 can distribute conferences across servers, and easily guarantee that 591 calls for the same conference always get routed to the same server. 593 This is in stark contrast to conferences in the telephone network, 594 where the equivalent of the URI - the phone number - is scarce. An 595 entire conferencing provider generally has one or two numbers. 596 Conference IDs must be obtained through IVR interactions with the 597 caller, or through a human attendant. This makes it difficult to 598 distribute conferences across servers all over the network, since the 599 PSTN routing only knows about the dialed number. 601 Care must be taken not to push this concept too far. Naming of 602 services should not become so fine-grained that all parameters 603 associated with the service simply become encoded into the request 604 URI as well. The right level of granularity can be determined based 605 on routing. If a service is represented by multiple URLs, but 606 requests for each of those URLs are always routed in the same way, 607 the naming is too fine-grained. 609 5.2 Additional Message Content 611 Sometimes, connecting to a service requires the service to know 612 additional information that is not appropriate for the request URI. 613 As an example, the conferencing server might need to know the name, 614 address, phone number, company, and email address of the 615 participants, which it converts to speech and uses as an announcement 616 when the user joins and leaves the bridge. 618 This kind of content can easily be carried in the body of the SIP 619 messages used to establish and manage the session with the service. 620 For simple data, SIP headers may be appropriate. In the conferencing 621 example above, the conferencing service might mandate that a vCard be 622 attached to all INVITEs, in order to provide that information. 624 When existing data formats (like a vCard) are not defined to provide 625 the needed information, it can be encoded in an XML document, for 626 example, and carried along in the INVITE. 628 Each service would need to specify the content that it needs in order 629 to process the session invitation. 631 5.3 Session Duration 633 The duration of the session that is established with a server depends 634 entirely on the nature of the service. For example, for a conference, 635 the initiation of the call begins the mixing service for that user, 636 and the termination of the call results in that user leaving the 637 conference. 639 For an IVR service, the INVITE request begins the interaction with 640 the service. Once the INVITE transaction completes, the IVR would 641 play out the initial prompt, and begin collecting data from the 642 caller. How the IVR terminates depends on its usage. When the 643 initiator of the service is an application server, we would argue 644 that in almost all cases, it should be the responsibility of the 645 controller to determine when the interaction is complete (and thus 646 terminate the call with a BYE). However, when the initiator is an end 647 user, the IVR will usually be the one to terminate the session. We 648 discuss IVR interactions in more detail below in Section 6.1. 650 5.4 Third Party Call Control 652 Third party call control, as defined in [5], plays an integral role 653 in this architecture. 655 In many cases, the controller orchestrating a service wishes to 656 invoke the resources of an IVR or conferencing server. However, the 657 AS is not the actual source of the media that drives the IVR. The 658 source of the media is the end user that initiated the call to the 659 controller. What is needed, then, is a way for the AS to call the IVR 660 or conferencing server, and pass it the media information of the end 661 user. Similarly, the media address of the IVR server (described in 662 the SDP from the media server), needs to be passed to the end user 663 that initiated the call. By using third party call control, an 664 application server can direct the media of the end user to and from 665 the components that it is using to provide the application. Once one 666 service is complete, the controller can move the media to a different 667 component. SIP re-INVITEs also allow the controller to request the 668 caller to send multiple media streams, one, for example, containing 669 only DTMF and tones. This allows for DTMF control of services without 670 carrying DTMF in SIP itself. 672 Figure 3 shows how we use a component server to collect DTMF input 673 for a service; specifically, a simple (and perhaps useless) service 674 that allows a caller to press '1' to indicate that they want to put 675 the call on hold. The service is, in principal, useless, since hold 676 is so common that the end user can do this themselves. However, it is 677 useful for example purposes. 679 The caller sends an INVITE request to the called party (1), which is 680 routed to a server handling calls for the domain of the called party. 681 In this case, the server is an application server. The AS decides 682 that it would like to offer the caller advanced services based on 683 DTMF events sent mid-call. As a result, it decides to invoke the 684 services of a media server component. The AS will use third party 685 call control mechanisms to have the caller send any DTMF related 686 media to the media server, in addition to sending its media to the 687 called party. To accomplish this, the AS sends an INVITE to the media 688 server (2), with an indication that the media stream is send only 689 (this is accomplised using the sendonly SDP attribute [6]). The 690 request URI of this INVITE binds that session to a service that looks 691 for any in-band DTMF, and reports it back to the AS through an HTTP 692 GET or POST operation. In section 6.1, we show how this is easily 693 done with a VoiceXML driven IVR server. 695 The media server responds with a 200 OK (3) that contains SDP with 696 the address where the media should be sent to. The application server 697 ACKs this response (4), and holds on to that SDP. The AS then proxies 698 the original INVITE request (5), and the called party answers the 699 call (6). This acceptance is proxied upstream (7), and then 700 acknowledged (8,9). At this point, media is flowing between the 701 caller and called party (10). The next step for the AS is to get a 702 stream of DTMF digits to flow from the caller to the media server. To 703 do this, it sends a re-INVITE to the caller (11). This re-INVITE 704 contains the same SDP as the response (6) from the called party, but 705 with the addition of a new media line. This media line is audio, and 706 contains a single codec, the RTP payload format for DTMF and tones 707 [8]. The connection address and port are from the SDP returned from 708 the media server. This tells the caller to send an additional media 709 stream to the media server, using only the DTMF codec. The result is 710 that RTP packets are sent only when the caller presses a button on 711 the phone. 713 The caller accepts this re-INVITE (12), and the AS acknowledges it 714 (13). Now, DTMF only RTP is flowing between the caller and the media 715 server (14). At some point later, the caller presses the 1 key 716 (which, for example, might imply call hold). This is processed by the 717 media server, and the result is an HTTP request being sent to the AS 718 (15). The HTTP request contains the value of the collected digit. The 719 AS receives this request, and knows that the user keyed in a 1. 720 Recognizing this input as call hold, the AS sends a re-INVITE to the 721 called party (17). The SDP in this re-INVITE is the same as the SDP 722 in the original INVITE from the called party (1), except that the 723 connection address is set to zero, indicating call hold. The called 724 party accepts the re-INVITE (18), and this is ACKed by the AS (19). 725 The called party is now on hold. 727 Note that the call flow remains unchanged if the stimulus were based 728 on voice recognition instead of DTMF. The only difference would be 729 that a general purpose codec, such as G.711, would be used instead of 730 RFC 2833 for communications between the caller and the media server. 731 This achieves an important unification. Independent of the type of 732 stimulus - voice, DTMF, or, in fact, direct http requests from the 733 caller (if they were using a softphone), the service execution code 734 is unchanged. 736 Others have proposed that DTMF digits be carried in SIP directly from 737 the caller to the AS [9,10]. However, this approach does not work 738 for anything beyond DTMF, while our approach works for DTMF, speech, 739 and web interfaces. Another drawback of the DTMF-in-SIP approach is 740 that all entities on the call signaling path will receive any DTMF 741 digits dialed by the called party. Furthermore, since the caller 742 doesn't know if there is an entity interested in DTMF, it is required 743 Caller Coordinator Media Server Callee 744 | | | | 745 |(1) SIP INV | | | 746 |--------------->|(2) SIP INV | | 747 | |----------------->| | 748 | |(3) 200 OK | | 749 | |<-----------------| | 750 | |(4) SIP ACK | | 751 | |----------------->| | 752 | |(5) SIP INV | | 753 | |----------------------------------->| 754 | |(6) 200 OK | | 755 |(7) 200 OK |<-----------------------------------| 756 |<---------------| | | 757 |(8) SIP ACK | | | 758 |--------------->|(9) SIP ACK | | 759 | |----------------------------------->| 760 |(10) RTP | | | 761 |.....................................................| 762 | | | | 763 |(11) SIP INV | | | 764 |<---------------| | | 765 |(12) 200 OK | | | 766 |--------------->| | | 767 |(13) SIP ACK | | | 768 |<---------------| | | 769 |(14) RTP | | | 770 |...................................| | 771 | | | | 772 | |(15) HTTP GET | | 773 | |<-----------------| | 774 | |(16) 200 OK | | 775 | |----------------->| | 776 | | | | 777 | |(17) SIP INV | | 778 | |------------------+---------------->| 779 | |(18) 200 OK | | 780 | |<-----------------+-----------------| 781 | |(19) SIP ACK | | 782 | |------------------+---------------->| 783 | | | | 784 | | | | 785 | | | | 786 | | | | 787 | | | | 789 Figure 3: Call Flow for DTMF Enabled Hold Service 790 to send DTMF within SIP messages all the time, even if no entity is 791 interested. 793 There have been proposals for adding a subscription/notification 794 mechanism on top of this to avoid this problem. However, this further 795 complicates the system by adding a requirement for the caller to 796 support a subscription and notification service just for DTMF. 798 Our approach fits well within the existing SIP framework, and 799 requires no additional work from the end users. Furthermore, it 800 transparently supports multiple application server components 801 receiving DTMF. This is because an AS is able to send a DTMF stream 802 to a component by adding a new media line to the list of media 803 streams being sent by the caller. The list of media streams being 804 sent by the caller is observed by each AS through the initial INVITE, 805 along with any subsequent re-INVITEs which might modify it. Consider 806 the situation with two application servers, A and B, depicted in 807 Figure 4. The original call setup starts with the caller, flows 808 through A, then B, then the called party. At some point later, A 809 sends a re-INVITE (10) to the caller, adding a media stream, just as 810 described in Figure 3. The SDP in this INVITE will be the same as 811 provided by the caller in message (1), plus the additional DTMF 812 stream. Note that this re-INVITE does not pass through B. Now, B 813 decides to add a media stream for DTMF. So, it sends a re-INVITE 814 (13). This goes first to A. As far as A is concerned, this re-INVITE 815 is from the called party. A computes the difference between what it 816 believes the called party should perceive as the set of media 817 streams, and what is in the re-INVITE (13). This difference (the 818 additional DTMF stream added by B) is added to the SDP that A had 819 sent to the caller previously (10), and the result is sent in a re- 820 INVITE to the caller (14). This SDP now contains the media streams 821 meant for the actual called party, along with two DTMF streams; one 822 for A, and one for B. The caller thus sends DTMF to both servers. 824 A further advantage of our approach is that the DTMF can even be sent 825 using multicast, since it is being sent in RTP rather than as part of 826 SIP. This allows for tremendous scalability, if needed, in the number 827 of entites receiving the DTMF streams. 829 5.5 Side Channels 831 Side channels are used for passing of events from the application 832 server components back to the client, and for passing control 833 commands from the client to the application server component. 835 Unfortunately, side channels complicate the simple session level 836 interface between components. It is our belief, at least for the 837 components described here, that only minimal side channels are 838 needed. Specifically, the only service below that requires one to be 839 effective is the IVR service, for which HTTP forms an ideal side 840 channel. If the side channel becomes so complex as to introduce 841 extensive synchronization, bandwidth, and transactional issues, the 842 relationship between the components becomes tightly coupled once 843 more, and the benefits we are espousing here begin to disappear. 845 As such, we believe that a reasonable side channel for decoupled 846 server interactions is defined as follows: 848 o The event reporting and control components have no real time 849 requirements. 851 o Event reporting from the component back to the client 852 accessing it are infrequent; specifically, the intervals are 853 much larger than the round trip times between the client and 854 the component. 856 o Control from the client to the component is infrequent; 857 specifically, the intervals are much larger than the round 858 trip times between the client and component. 860 o Event reporting is coarsely granular, so that the client does 861 not need to explicitly subscribe to specific events in order 862 to avoid be overwhelmed with data. 864 o The amount of data passed in both the events and in the 865 control is small. 867 o There are no requirements for transaction support. 869 Note that protocols like MGCP and megaco do not meet these 870 requirements, as they require tight timing, synchronization, and 871 explicit subscriptions. HTTP, as used in VoiceXML, however, does meet 872 these requirements. 874 6 Patterns for Accessing Components 876 In this section, we propose a set of patterns that define the 877 interaction of a controller with an application server component. 878 These patterns manifest themselves in the description of the service 879 invoked when a session is initiated, a discussion of the naming 880 conventions of the service, and a description of any back channel 881 used for control and data passing. 883 6.1 Interactive Voice Response Services 884 Caller A B Callee 885 | | | | 886 |(1) SIP INV | | | 887 |-------------->|(2) SIP INV | | 888 | |--------------->|(3) SIP INV | 889 | | |---------------->| 890 | | |(4) 200 OK | 891 | |(5) 200 OK |<----------------| 892 |(6) 200 OK |<---------------| | 893 |<--------------| | | 894 |(7) SIP ACK | | | 895 |-------------->|(8) SIP ACK | | 896 | |--------------->|(9) SIP ACK | 897 | | |---------------->| 898 |(10) SIP INV | | | 899 |<--------------| | | 900 |(11) 200 OK | | | 901 |-------------->| | | 902 |(12) SIP ACK | | | 903 |<--------------| | | 904 | | | | 905 | |(13) SIP INV | | 906 |(14) SIP INV |<---------------| | 907 |<--------------| | | 908 |(15) 200 OK | | | 909 |-------------->|(16) 200 OK | | 910 | |--------------->| | 911 | |(17) SIP ACK | | 912 |(18) SIP ACK |<---------------| | 913 |<--------------| | | 914 | | | | 915 | | | | 916 | | | | 917 | | | | 918 | | | | 920 Figure 4: Multiple Application Servers and DTMF 922 We have touched upon the basics of the interaction between a 923 controller and an IVR server. The controller initiates a call to the 924 server, the server executes some kind of IVR service, and data is 925 A number of questions still need to be answered, however: 927 1. How is the IVR service identified? 929 2. How can the controller specify the details of the dialog 930 the IVR carries out with the user? 932 3. How does data from the IVR get passed back to the 933 controller? 935 4. How is intermediate control performed (e.g., to interrupt 936 or reset IVR based on some event at the controller, in this 937 case)? 939 We believe that VoiceXML [11] represents the ideal partner for SIP in 940 the development of distributed IVR servers. VoiceXML is an XML based 941 scripting language for describing IVR services at an abstract level. 942 VoiceXML supports DTMF recognition, speech recognition, text-to- 943 speech, and playing out of recorded media files. The results of the 944 data collected from the user are passed to a controlling entity 945 through an HTTP form POST operation. The controller can then return 946 another script, or terminate the interaction with the IVR server. 948 From a naming perspective, the primary issue is how a request URI is 949 associated with a script to invoke when the call is answered. We see 950 three primary mechanisms: 952 1. There is a one-to-one binding of the address in the request 953 URI to a script to execute. These bindings are published by 954 the provider of the IVR service. 956 2. The initial script to execute is actually carried as 957 content in the body of the SIP INVITE request. The request 958 URI indicates that the desired service is execution of 959 content in the request (i.e., sip:executebody@servers.com). 961 3. The initial script to execute is fetched by the VoiceXML 962 server; the URL to fetch it from is passed in the SIP 963 INVITE message that initiates the IVR session. This can be 964 accomplished either with the application/uri MIME type as a 965 body, or using the new *-Info headers [12] which provide 966 references to content to fetch. 968 We believe that the third approach is probably the best one. SIP is 969 not the ideal transfer mechanism. Passing a URI allows a far better 970 transfer tool, namely HTTP, to be used to actually fetch the script 971 back from the controller. 973 HTTP is then also used to pass back form data from the IVR to the 974 controller. The results of the HTTP POST can also contain additional 975 VoiceXML scripts to execute. It represents the side channel discussed 976 in section 5.5 978 Note that in some cases, there needs to be interactions between the 979 HTTP server that receives the HTTP POST requests, and the controller 980 that initiates and terminates the SIP sessions with the IVR. This is 981 the case when the data collected by the VoiceXML server is used to 982 guide signaling behavior. For example, a pre-paid calling application 983 might use the IVR to collect the users PIN code. The PIN code is 984 looked up, and the number of minutes remaining is determined. This 985 amount of time must be known to the SIP controller, as it will need 986 to hang up the call once this time expires. Some kind of session 987 sharing mechanism is needed between the SIP controller and the HTTP 988 server in this case. 990 Figure 5 shows the interaction between an application server acting 991 in a coordinating role, and an IVR server component. In this example, 992 consider an application where the user makes a call, but the system 993 needs additional information to determine where to forward it to. The 994 user is prompted for the info, and once the name of the desired 995 called party is obtained and looked up, the call is completed to the 996 requested destination. 998 First, in step (1), the caller sends an INVITE to the controller. The 999 controller then creates a brand new call to the IVR application 1000 server (2), using the SDP from the INVITE in (1). The IVR accepts the 1001 call (3), and the SDP from that acceptance is returned in a 183 1002 response to the caller (4). The call to the IVR is acked (5), and now 1003 a media stream exists between the caller and the IVR server. The IVR 1004 server, in step (6), fetches the initial VoiceXML script to execute, 1005 which is returned by the controller (7). The prompts are played to 1006 the caller, and the identity of the called party is collected. This 1007 is passed to the controller through another POST (8), which returns 1008 an empty VoiceXML script (9)[1] complete, the controller hangs up 1009 with it (10 and 11). The information the controller got in the POST 1010 (8) is used to determine the next hop SIP server, and the initial 1011 INVITE is proxied there (12). 1013 Its important to observe the all call control related to executing 1014 the service lives within the controlling application server. The IVR 1015 _________________________ 1016 [1] Note that it is unusual for an empty script to be 1017 returned; this is because we want the AS to maintain 1018 control of the call signaling 1019 application server deals strictly with the media component. This 1020 division of work, as we have discussed above, allows for independent 1021 evolution of the call control and media components of services. For 1022 example, if the desired called party did not have a reachable SIP 1023 address, but they did have an email address, the call could be 1024 redirected to a mailto URL. To support this twist, only the 1025 controlling application server code need change. The media component 1026 remains completely and totally unchanged. 1028 Readers familiar with VoiceXML will observe that VoiceXML almost 1029 achieves this perfect separation. It lacks any call control excepting 1030 a two - for call transfer and call termination. These tags are 1031 clearly not sufficient for many services. Our architecture would 1032 argue that instead of adding call control to VoiceXML, all control 1033 should be removed, so that call control can be left to other server 1034 components. 1036 The separation of the control from the media component also allows 1037 the media component to change without affecting the control 1038 component. In fact, because of the http interface between the two, 1039 the media server can be completely removed and replaced with a normal 1040 web browser, with only a small effect on the call control component. 1041 As an example, if the calling party was coming from a web enabled SIP 1042 client (known by the presence of the Accept header with text/html as 1043 a value in the INVITE request), the controller could return an HTTP 1044 URL in the 183 with an actual web form that gets filled out by the 1045 caller. This would be instead of using an IVR server to collect the 1046 data. Interestingly, the representation of the collected data is 1047 identical in both cases. Both use an HTTP POST operation to send the 1048 data to the controller. This allows the data collection code in the 1049 controller to be unified across both voice access and web access. 1051 6.2 Conferencing Servers 1053 Conferencing servers today vary in type and complexity. Some are 1054 dialup only, supporting IVR access. Others support ad-hoc 1055 conferencing with web interfaces. Others still support three way 1056 calling as part of a PBX system. 1058 We observe once more that all of these conferencing "servers" are 1059 really conferencing applications that are just bundled as a server. 1060 These conferencing applications can be decomposed into components in 1061 exactly the way we have described above. At the core of each of these 1062 conferencing applications is a mixing service. This service is 1063 responsible for taking N audio or video streams, mixing them 1064 according to some matrix, and returning the mixed stream to each 1065 participant. Issues such as conference policy, provisioning of 1066 conferences, and authentication are all completely separate and 1067 | INVITE (1) | | 1068 |------------------------>| | 1069 | | INVITE (2) | 1070 | |------------------------->| 1071 | | 200 OK (3) | 1072 | |<-------------------------| 1073 | 183 (4) | | 1074 |<------------------------| | 1075 | | ACK (5) | 1076 | |------------------------->| 1077 | MEDIA | | 1078 |----------------------------------------------------| 1079 | | | 1080 | | HTTP GET (6) | 1081 | |<-------------------------| 1082 | | HTTP 200 OK (7) | 1083 | |------------------------->| 1084 | | | 1085 | | | 1086 | | | 1087 | | | 1088 | | HTTP GET (8) | 1089 | |<-------------------------| 1090 | | | 1091 | | HTTP 200 OK (9) | 1092 | |------------------------->| 1093 | | | 1094 | | BYE (10) | 1095 | |------------------------->| 1096 | | 200 OK (11) | 1097 | |<-------------------------| 1098 | | | 1099 | | INVITE (12) | 1100 | |---------------------------------------> 1101 | | | 1102 | | | 1103 | | | 1104 | | | 1105 | | | 1106 | | | 1107 | | | 1109 Caller Controller IVR Server 1111 Figure 5: Interaction of App Server and IVR Component 1112 outside of this basic mixing component. 1114 For this reason, we argue that a large variety of conferencing 1115 applications can be easily constructed by having the mixing service 1116 as separate application server component. 1118 What does the interface to such a mixing server look like? For the 1119 call control interface, users would join a conference by calling the 1120 server. The server would answer the call, thus appearing as a SIP 1121 UAS. The media sent from the user is mixed with other users in the 1122 conference, and the media sent back to the user is the mixed stream. 1123 The user can leave the conference by sending a BYE to the server, and 1124 the server can kick a user out of the conference by sending the user 1125 a BYE. 1127 Since the primary resource being accessed is a conference, it is no 1128 surprise that we would argue that the request URI of an incoming call 1129 defines the conference a user is mixed in to. In other words, all 1130 users that call the server with the same request URI, are all mixed 1131 together. The conferences are not defined by Call-ID or other SIP 1132 header fields. Using the request URI has tremendous advtanges from a 1133 routing and naming perspective, as we have discussed more generally 1134 above. 1136 It is not neccesary (in fact, not even advisable), for the 1137 conferencing server to require that the URIs that define the 1138 conference be set up ahead of time. Conference lifecycles in the 1139 mixing server are very simple. Conference state is created when the 1140 first call arrives for a particular URI, and ends when the last user 1141 with a call to that URI hangs up. This model allows the same mixing 1142 server to support both ad-hoc conferences, and pre-arranged 1143 conferences too. Pre-arranged conferences are handled through policy 1144 and control in a coordinating server external to the mixing server. 1145 This server lives entirely in the call control and signaling plane, 1146 not in the media plane. 1148 SIP (and RTP, of course) alone is not sufficient for complete usage 1149 of a conferencing server. Media mixing policies (effectively, the 1150 matrix indicating which users hear which other users, and with what 1151 relative volumes) need to be set. Information on the status of the 1152 conference, such as the identity of the current speaker, number of 1153 users currently being mixed, etc., may need to be reported back to 1154 some control entity. These represent the requirements for the side 1155 channel. In IVR servers, the side channel used HTTP. We argue that to 1156 unify these concepts, HTTP is ideally suited here as well. Updates to 1157 the mixing policy can be made through HTTP POST requests against the 1158 mixing server, using well defined interfaces (possibly SOAP). 1159 Similarly, information about the status of the conference can be 1160 obtained through HTTP GET operations against the mixing server. The 1161 side channel here meets the requirements outlined in Section 5.5; it 1162 is not real time in nature, does not reuqire transactional support, 1163 and passes relatively infrequent data and control. In fact, such a 1164 side channel will often not be needed at all. In 90 default mixing 1165 policy (the so-called N-1 matrix, where each user hears everyone but 1166 themselves, all at equal volume, with no floor control) will suffice. 1168 Fans of the INFO method [13] will argue that instead of using HTTP 1169 for the control, why not INFO? This would eliminate the need for an 1170 additional protocol, after all. The answer is the same as to why SIP 1171 should not simply replace HTTP - the two have different strengths and 1172 weakenesses. SIP is a poor data transfer protocol. It has insufficent 1173 support for transfer of medium to large data sets, which is important 1174 here. Furthermore, we may want to allow an entity separate from the 1175 one that initiated the session to control the session. Usage of INFO 1176 would only work from the same device (because of the sequence 1177 numbering). 1179 In the next few sections, we show how this basic application server 1180 component can be used, along with a controller and other components, 1181 to build more complex conferencing applications. 1183 6.2.1 Web Scheduled Conference Services 1185 In this application, we'd like a conferencing service where all 1186 conferences must be pre-scheduled. The pre-scheduling is done through 1187 a web page. At the page, the user will enter the start time (but not 1188 mandatory stop time) of the conference, the maximum number of 1189 attendees, and the identities of the attendees (if known). Once 1190 entered in a form, the server returns a SIP URL representing the 1191 conference. 1193 To implement this, we use an coordinating application server that has 1194 a SIP and HTTP interface, along with the mixing application server 1195 just described. 1197 Figure 6 shows a call flow for this service. A web client is first 1198 used to submit the information. Let us suppose a simple case where 1199 the conference can have up to two participants, and the conference 1200 starts immediately. The HTTP POST representing the form data is sent 1201 to the controller (1). It stores the information for the conference 1202 in a local data store, and chooses a SIP URL for the conference. This 1203 URL can be anything, so long as it is different from any URLs handed 1204 out so far by the controller. The URL is returned to the web client 1205 in step (2). As an additional convenience feature, the URL could be 1206 emailed to the participants. This would require the controller to 1207 have an SMTP interface, in addition to HTTP and SIP. Note that this 1208 SIP URL points to the controller, NOT the mixing server. 1210 A few moments later, the first participant calls in using a SIP 1211 INVITE (3). The call is routed to the controller. It checks the 1212 conference ID. It finds that the policy permits up to two 1213 participants (not a practical example, but simplifies the call flow). 1214 It stores data indicating that one participant has now joined, and 1215 the proxies the INVITE request in step (4) to the mixer. The request 1216 URI in this request will have the same user part as (3), but the host 1217 part now represents the mixer. The mixer receives the INVITE, creates 1218 the initial conference state (as this is the first call for that 1219 URL), and returns a 200 OK (5), which is forward to the caller (6), 1220 and then ACKed (7 and 8). 1222 In step (9), the second caller calls in. The controller sees that 1223 only one participant is on the call so far, so the second call is 1224 accepted. The controller stores the fact that there are now 2 1225 participants, and proxies the INVITE (10). The INVITE is accepted by 1226 the mixer (11), and the response forwarded to the second caller (12), 1227 and then ACKed (13 and 14). The two participants A and B can now hear 1228 each other. 1230 A third caller then calls in (15). The controller checks its records, 1231 and notices that this conference is now full. So, it rejects the 1232 INVITE (16), which is acknowleged (17). 1234 The astute reader will observe that, strictly speaking, the HTTP 1235 server does not really need to be co-resident with the SIP server in 1236 the controller. The initial conference setup can be stored in a 1237 database by a web server, and the controller can simply read this 1238 database. However, in more complex cases, we may wish to have web 1239 access to learn dynamic information about the conference as it 1240 progresses (for example, which users are in the conference). For this 1241 kind of dynamic session state, using a shared database between 1242 components is cumbersome. Rather, an integrated HTTP/SIP server is 1243 much better suited, where integrated implies only that it has built 1244 in mechanisms for session state sharing between the SIP and HTTP 1245 components. 1247 For this simple conferencing service, it was sufficient for the 1248 controller to act as a proxy. Thats because it does not need to 1249 forcibly kick anyone out of the conference once they are in. To 1250 support that kind of functionality, third party call control is 1251 needed. Let us examine a more complex service in the next section. 1253 6.2.2 Web Scheduled, IVR supported, Time Limited Conference 1254 | | | | (1) HTTP POST | | 1255 |--------------------------->| | 1256 | | | | (2) 200 OK | | 1257 |<---------------------------| | 1258 | | | | | | 1259 | | | | (3) INVITE | | 1260 | |----------------------->| (4) INVITE | 1261 | | | | |--------------------->| 1262 | | | | | (5) 200 OK | 1263 | | | | (6) 200 OK |<---------------------| 1264 | |<-----------------------| | 1265 | | | | (7) ACK | | 1266 | |----------------------->| (8) ACK | 1267 | | | | |--------------------->| 1268 | | | | | | 1269 | | | | (9) INVITE | | 1270 | | |------------------->| (10) INVITE | 1271 | | | | |--------------------->| 1272 | | | | | (11) 200 OK | 1273 | | | | (12) 200 OK |<---------------------| 1274 | | |<-------------------| | 1275 | | | | (13) ACK | | 1276 | | |------------------->| (14) ACK | 1277 | | | | |--------------------->| 1278 | | | | | | 1279 | | | | (15) INVITE | | 1280 | | | |--------------->| | 1281 | | | |(16) 500 Full | | 1282 | | | |<---------------| | 1283 | | | |(17) ACK | | 1284 | | | |--------------->| | 1285 | | | | | | 1286 | | | | | | 1287 | | | | | | 1289 Web A B C Controller Mixer 1291 Figure 6: Web Scheduled Conference Services 1292 In this more complex example, we once again wish to use a web 1293 interface to set up the conferences. However, we wish to add a stop 1294 time. If there are participants in the conference when the stop time 1295 arrives, a warning announcement is played 10 minutes prior, and then 1296 they are kicked off. In addition, when a user joins the conference, 1297 before they are added, they hear an announcement that states the name 1298 of the person that set up the conference, and what the start and stop 1299 times are. They are then asked to speak their name. Then, they are 1300 dropped in. The conference server then speaks their name, so that 1301 everyone knows who just joined. 1303 This seemingly complex service is very easily constructed by adding 1304 an IVR server as described above. Now, we have a controller, a mixing 1305 server, and an IVR server, all working together to build the service. 1306 Each provides a specific component towards the overall solution, yet 1307 each is an application server in its own right, with both signaling 1308 and media interfaces. 1310 We assume that the web setup is done as above. This time, the stop 1311 time is provided, along with the name of the person setting up the 1312 conference. 1314 The call flow for the initial participant is shown in Figure 7. 1316 The initial participant sends an INVITE, which is forwarded to the 1317 controller. The controller matches the request URI against the 1318 conference that the user wishes to join. The controller recognizes 1319 that it needs to play an announcement. So, in step (2), it initiates 1320 a call to an IVR server. This call is accepted in step (3), and the 1321 resulting SDP is passed back to the UAC in step (4) in a provisional 1322 response. After ACKing the call with the IVR in step (5), the 1323 controller receives an HTTP GET to fetch the root VoiceXML script in 1324 step (6). The controller dynamically generates the VoiceXML script, 1325 whose content will cause the server to read out "Welcome to the 1326 conference, Bob. The call will start at 10 am, and end at 11am.". The 1327 name of the caller, Bob, is extracted from the INVITE (1). 1329 Once the prompt has been played, the IVR server prompts the caller 1330 for their name, and the result is recorded into a file. Then, the 1331 VoiceXML server attempts to fetch the next VoiceXML script from the 1332 controller (8). Before responding, the controller reconnects the 1333 media stream from the media server into the conference bridge. To do 1334 this, it first sends an INVITE to the conferencing server, using SDP 1335 indicating send only (9). The server accepts (10), and the controller 1336 ACKs (11). The SDP from the acceptance (10) is passed in a re-INVITE 1337 (12) to the IVR server. The IVR server then accepts (13) and the 1338 controller ACKs (14). Now, a unidirectional media stream from the IVR 1339 server into the conference bridge is set up. The controller returns 1340 the next VoiceXML script (15), which tells the IVR server to play the 1341 previously recorded file into the conference, announcing the joining 1342 user. Once this is done, the IVR server fetches the next script (16), 1343 and gets back an empty response (17). The controller then disconnects 1344 from the IVR server (18,19). Finally, the controller re-INVITEs the 1345 conference server (20), updating the SDP to be that from the initial 1346 INVITE (1). The SDP from the acceptance (21) is passed on to the 1347 caller (22). Now, the caller is connected to the mixer as the first 1348 user in the conference. 1350 The second user would join in much the same way. 1352 Approximately 10 minutes before the end of the conference, a timer 1353 fires inside of the controller. It is time to play a warning 1354 announcement into the conference. The call flow for this is shown in 1355 Figure 8. 1357 The basic idea is to initiate a call to the IVR server and mixer, 1358 connect them using third party call control, and then have the IVR 1359 server play the announcement into the conference. The controller then 1360 hangs up. 1362 In step (1), the controller sends an INVITE to the mixer with a 1363 single audio stream on hold (i.e., "empty"). The request URI of the 1364 request is that of the conference. The mixer returns a 200 OK in step 1365 (2), and an ACK is sent in (3). The SDP from (2) is then used in step 1366 (4) to call the IVR server, which answers with its SDP in step (5). 1367 This is used in a re-invite (7,8,9) to the mixer to update the IP 1368 address and port as that of the IVR server. The IVR server then 1369 fetches the root VoiceXML document from the controller (11). This 1370 document instructs the server to read out some kind of conference 1371 warning - "Warning, your conference will end in 10 minutes". Once 1372 this is done, the IVR server fetches the next document (13), which is 1373 empty. The controller then hangs up with both the mixer (17) and the 1374 IVR server (19), disconnecting the IVR server from the conference. 1376 These examples demonstrate the component model we are proposing. The 1377 mixing component does not have application level intelligence. It has 1378 a call control interface, allowing it to exist anywhere (and be 1379 provided by any ASP service) and yet be a callable resource by other 1380 application server components. By combining a controller with an IVR 1381 server and the mixing server, complex and useful applications can be 1382 constructed in a distributed fashion. 1384 6.3 Continuous Text-to-Speech 1385 Caller Controller IVR Server Mixing Server 1386 | | | | 1387 | (1) INVITE | | | 1388 |-------------->| (2) INVITE | | 1389 | |----------------->| | 1390 | | (3) 200 OK | | 1391 | (4) 183 |<-----------------| | 1392 |<--------------| | | 1393 | | (5) ACK | | 1394 | |----------------->| | 1395 | | (6) HTTP GET | | 1396 | |<.................| | 1397 | | (7) 200 OK | | 1398 | |.................>| | 1399 | | | | 1400 | | (8) HTTP GET | | 1401 | |<.................| | 1402 | | (9) INVITE | | 1403 | |------------------------------------->| 1404 | | (10) 200 OK | | 1405 | |<-------------------------------------| 1406 | | (11) ACK | | 1407 | |------------------------------------->| 1408 | | (12) INVITE | | 1409 | |----------------->| | 1410 | | (13) 200 OK | | 1411 | |<-----------------| | 1412 | | (14) ACK | | 1413 | |----------------->| | 1414 | | | | 1415 | | (15) 200 OK | | 1416 | |.................>| | 1417 | | (16) HTTP GET | | 1418 | |<.................| | 1419 | | (17) 200 OK | | 1420 | |.................>| | 1421 | | (18) BYE | | 1422 | |----------------->| | 1423 | | (19) 200 OK | | 1424 | |<-----------------| | 1425 | | (20) INVITE | | 1426 | |------------------------------------->| 1427 | | (21) 200 OK | | 1428 | (22) 200 OK |<-------------------------------------| 1429 |<--------------| | | 1430 | (23) ACK | | | 1431 |-------------->| (24) ACK | | 1432 | |------------------------------------->| 1433 | | | | 1434 | | | | 1435 | | | | 1437 Caller Controller IVR Server Mixing Server 1439 | (1) INVITE empty SDP | | 1440 |---------------------->| | 1441 | (2) 200 OK SDP A | | 1442 |<----------------------| | 1443 | (3) ACK | | 1444 |---------------------->| | 1445 | | (4) INV SDP A | 1446 |------------------------------------------------->| 1447 | (5) 200 OK SDP B | | 1448 |<-------------------------------------------------| 1449 | | (6) ACK | 1450 |------------------------------------------------->| 1451 | (7) INV SDP B | | 1452 |---------------------->| | 1453 | (8) 200 OK SDP A | | 1454 |<----------------------| | 1455 | (9) ACK | | 1456 |---------------------->| | 1457 | | (11) HTTP GET | 1458 |<-------------------------------------------------| 1459 | | (12) 200 OK | 1460 |------------------------------------------------->| 1461 | | | 1462 | | | 1463 | | (13) HTTP GET | 1464 |<-------------------------------------------------| 1465 | | (14) 200 OK | 1466 |------------------------------------------------->| 1467 | | | 1468 | (15) BYE | | 1469 |------------------------------------------------->| 1470 | | (16) 200 OK | 1471 |<-------------------------------------------------| 1472 | (17) BYE | | 1473 |---------------------->| | 1474 | (18) 200 OK | | 1475 |<----------------------| | 1476 | | | 1477 | | | 1479 Controller Mixer IVR Server 1481 Figure 8: Advanced Web Scheduled Conference Service: Warning 1482 Announcement 1484 Another example of an application server component is a continuous 1485 Text-to-Speech (TTS) converter. This kind of service allows a real 1486 time text stream (encapsulated in RTP using the RTP payload format 1487 for text [14] to be received, which is then converted to speech and 1488 returned as an audio stream encoded using a traditional speech codec, 1489 be it G.723.1, G.711, or what have you. 1491 Like the IVR server and mixing server, the TTS server acts as a user 1492 agent server. It answers incoming calls, and basically mirrors 1493 incoming text back as speech. It continutes to do so until the call 1494 is hung up by the initiating client. 1496 A TTS service can be done using VoiceXML with an IVR server, as in 1497 the examples above. However, the difference is that here, the text 1498 stream to be converted is in the data path, not the control path. The 1499 stream is likely to be generated by other entities in the system, not 1500 the controller. 1502 6.3.1 Service Interface 1504 It is likely that the text-to-speech conversation process differs 1505 significantly depending on the language. As such, separate URIs 1506 SHOULD be used for language specific TTS services. Specifically, the 1507 convention sip:-@ is 1508 RECOMMENDED. The language tags SHOULD be selected from the set 1509 defined in RFC1766 [15]. 1511 One of the unfortunate limitations of SDP is that it is not currently 1512 possible for a single media stream to be composed of separate media 1513 formats in each direction. The text over RTP stream is, in fact, 1514 based on the top level text MIME type (text/t140). As a result, two 1515 media streams are needed for this service - a unidirectional audio 1516 stream and a unidirectional text stream. 1518 First, the client INVITEs the server. The SDP MUST indicate a two 1519 media streams. One stream MUST be of type audio. It SHOULD contain 1520 the set of audio codecs acceptable to the client. The stream MUST be 1521 marked as recv-only. The other stream MUST be of type text. It MUST 1522 contain a single codec, which is a dynamic payload number bound to 1523 text/t140. The stream MUST be marked as send-only. The 200 OK 1524 response from the TTS server that accepts the call has SDP with a two 1525 media lines, one of type audio, and one of type text, in the same 1526 order the streams appeared in the INVITE, as mandated by RFC2543. The 1527 audio stream SHOULD contain a subset of the codecs listed in the 1528 audio stream in the INVITE. The audio stream MUST be marked as send- 1529 only. The text stream MUST contain a single codec, which is a dynamic 1530 payload type number bound to text/t140. The stream MUST be marked as 1531 receive-only. 1533 The client then ACKs the request. The TTS server SHOULD attempt to 1534 convert all text received on the incoming text stream to speech, and 1535 return the resulting speech on the outgoing audio stream. 1537 6.3.2 Hearing Impaired Service 1539 The TTS server is extremely useful in supporting hearing impaired 1540 services. Examples of such services are described in describes a 1541 service where a controller accesses a TTS service. 1543 6.4 Messaging Servers 1545 Another type of application server component is a messaging server. 1546 Messaging servers allow for callers to record audio messages for 1547 users on the system. Users can also call into the server to retrieve 1548 these messages, delete them, and file them. The system operates 1549 through the use of voice prompts combined with DTMF detection and/or 1550 speech recognition. The prompts that are played are context 1551 dependent. A messaging server can be viewed as a specialized version 1552 of an IVR server with an application specific controller associated 1553 with it. In fact, a messaging server can be implemented in this way 1554 exactly. However, the combination is also usefully viewed as a 1555 component in its own right, due to the frequent need for messaging 1556 components in more complex applications. 1558 6.4.1 Service Interface 1560 The service interface for communicating with a messaging server is 1561 described in detail in [7]. The interface provides well known URIs 1562 for the most common resources within a messaging server - user 1563 specific message drops with a variety of drop conditions (called 1564 party busy, called party not there, etc.), message retrievals using a 1565 variety of authentication mechanisms (PIN, SIP level authentication), 1566 and message drops that are not user specific, so that the target user 1567 is queried for as part of the interface. 1569 6.4.2 Web Enabled Message Drops 1571 An example usage of this application component is a web front end 1572 that allows users to leave voicemail for company employees through 1573 the company web page. The page has a URL for each company employee. 1574 If some user A clicks on a URL for employee B, A's phone rings. When 1575 A picks up, they hear a greeting to record a message for employee B. 1577 The call flow for this application is the combination of third party 1578 call control combined with access to the service. It is shown in 1579 Figure 9. 1581 The caller, from a web page, clicks on the URL for the user they wish 1582 to leave a message for. The result is an HTTP request (1) to the 1583 controller. The URI in this request would be some controller-specific 1584 identifier that tells the controller what it needs to do. The 1585 controller then calls the user (3) using an SDP with a single media 1586 stream on hold initially. This is accepted (4), and the resulting SDP 1587 is used in an INVITE to the messaging server (6). The URI of this 1588 INVITE is that for message drop with standard greeting (sip:sub- 1589 jdrosen-deposit@voiceserver.com). The call is accepted (7) and the 1590 200 OK is used in a re-INVITE to the caller (9) to set the address of 1591 the media stream to that of the voicemail server. After the call is 1592 accepted (10) and ACKed (11), the caller hears the voice drop prompt 1593 for the messaging server, and can record their message. 1595 7 Security Considerations 1597 In many cases, authorization may need to be made to allow a caller 1598 access to a session level resource. Traditional SIP level 1599 authentication mechanisms can be used to accomplish this. Note, 1600 however, that in many cases the caller is the controller, which is 1601 acting as a third party call controller. In these cases, a two level 1602 trust model is really needed. The trust relationship in such 1603 situations is really between the session level resource and the 1604 controller (perhaps through an explicit business arrangement), and 1605 then between the controller and the caller. Thus, controllers should 1606 authenticate themselves to session resources they contact, rather 1607 than trying to proxy credentials from the caller. 1609 8 Conclusion 1611 In this paper, we have argued that rapid deployment of complex 1612 communications applications will require a distributed model where 1613 application components are spread across the network. These 1614 components could be offered by separate providers, for example, 1615 enabling an ASP component model to evolve. We have observed that many 1616 of the components can be described as having some kind of session 1617 level resource that can be communicated with, usually in an automated 1618 fashion. Access to these resources is typically parameterized. As a 1619 result, SIP access, using the request URI as a service indicator, is 1620 an ideal way to communicate across these components. 1622 To validate this model, we examined the specific service interfaces 1623 that would be defined by IVR servers, conferencing servers, text-to- 1624 | | | | 1625 | | (1) HTTP GET | | 1626 |-------------------->| | 1627 | | (2) 200 OK | | 1628 |<--------------------| | 1629 | | (3) INV | | 1630 | |<-------------| | 1631 | | (4) 200 OK | | 1632 | |------------->| | 1633 | | (5) ACK | | 1634 | |<-------------| | 1635 | | | (6) INV | 1636 | | |--------------------->| 1637 | | | (7) 200 OK | 1638 | | |<---------------------| 1639 | | | (8) ACK | 1640 | | |--------------------->| 1641 | | (9) INV | | 1642 | |<-------------| | 1643 | | (10) 200 OK | | 1644 | |------------->| | 1645 | | (11) ACK | | 1646 | |<-------------| | 1647 | | | | 1648 | | | | 1649 | | | | 1651 Web SIP Controller Messaging 1652 Caller Server 1654 Figure 9: Web Enabled Message Drops 1655 speech servers and messaging servers. We gave call flows of complex 1656 applications built up from these components using the specified 1657 interfaces. 1659 9 Author's Addresses 1661 Jonathan Rosenberg 1662 dynamicsoft 1663 72 Eagle Rock Avenue 1664 First Floor 1665 East Hanover, NJ 07936 1666 email: jdrosen@dynamicsoft.com 1668 Peter Mataga 1669 dynamicsoft 1670 72 Eagle Rock Avenue 1671 First Floor 1672 East Hanover, NJ 07936 1673 email: jdrosen@dynamicsoft.com 1675 Henning Schulzrinne 1676 Columbia University 1677 M/S 0401 1678 1214 Amsterdam Ave. 1679 New York, NY 10027-7003 1680 email: schulzrinne@cs.columbia.edu 1682 10 Bibliography 1684 [1] N. Greene, M. Ramalho, and B. Rosen, "Media gateway control 1685 protocol architecture and requirements," Request for Comments 2805, 1686 Internet Engineering Task Force, Apr. 2000. 1688 [2] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett, 1689 "Media gateway control protocol (MGCP) version 1.0," Request for 1690 Comments 2705, Internet Engineering Task Force, Oct. 1999. 1692 [3] F. Cuervo, N. Greene, C. Huitema, A. Rayhan, B. Rosen, and J. 1693 Segers, "Megaco protocol 0.8," Request for Comments 2885, Internet 1694 Engineering Task Force, Aug. 2000. 1696 [4] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP: 1698 session initiation protocol," Request for Comments 2543, Internet 1699 Engineering Task Force, Mar. 1999. 1701 [5] J. Rosenberg, H. Schulzrinne, and J. Peterson, "Third party call 1702 control in SIP," Internet Draft, Internet Engineering Task Force, 1703 Mar. 2000. Work in progress. 1705 [6] M. Handley and V. Jacobson, "SDP: session description protocol," 1706 Request for Comments 2327, Internet Engineering Task Force, Apr. 1707 1998. 1709 [7] B. Campbell and R. Sparks, "Control of service context using SIP 1710 Request-URI," Internet Draft, Internet Engineering Task Force, Oct. 1711 2000. Work in progress. 1713 [8] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits, 1714 telephony tones and telephony signals," Request for Comments 2833, 1715 Internet Engineering Task Force, May 2000. 1717 [9] V. Bharatia, E. Cave, and B. Culpepper, "SIP INFO method for 1718 event reporting," Internet Draft, Internet Engineering Task Force, 1719 Apr. 2000. Work in progress. 1721 [10] T. Choudhuri, C. Haun, P. Sollee, S. Orton, and S. Whynot, "SIP 1722 INFO method for DTMF digit transport and collection," Internet Draft, 1723 Internet Engineering Task Force, Apr. 2000. Work in progress. 1725 [11] VoiceXML Forum, "Voice extensible markup language (voicexml) 1726 version 1.00," voicexml forum specification, VoiceXML Forum, Mar. 1727 2000. 1729 [12] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP: 1730 Session initiation protocol," Internet Draft, Internet Engineering 1731 Task Force, Aug. 2000. Work in progress. 1733 [13] S. Donovan, "The SIP INFO method," Request for Comments 2976, 1734 Internet Engineering Task Force, Oct. 2000. 1736 [14] G. Hellstrom, "RTP payload for text conversation," Request for 1737 Comments 2793, Internet Engineering Task Force, May 2000. 1739 [15] H. Alvestrand, "Tags for the identification of languages," 1740 Request for Comments 1766, Internet Engineering Task Force, Mar. 1741 1995. 1743 Table of Contents 1745 1 Introduction ........................................ 2 1746 2 Why Decompose ....................................... 2 1747 3 Tightly Coupled Decomposition ....................... 4 1748 4 The Decoupled Model ................................. 6 1749 4.1 Architecture ........................................ 7 1750 4.2 Benefits of the Decoupling .......................... 10 1751 5 Architecture for the Interfaces ..................... 11 1752 5.1 Naming .............................................. 12 1753 5.2 Additional Message Content .......................... 14 1754 5.3 Session Duration .................................... 14 1755 5.4 Third Party Call Control ............................ 15 1756 5.5 Side Channels ....................................... 18 1757 6 Patterns for Accessing Components ................... 19 1758 6.1 Interactive Voice Response Services ................. 19 1759 6.2 Conferencing Servers ................................ 23 1760 6.2.1 Web Scheduled Conference Services ................... 26 1761 6.2.2 Web Scheduled, IVR supported, Time Limited 1762 Conference ..................................................... 27 1763 6.3 Continuous Text-to-Speech ........................... 30 1764 6.3.1 Service Interface ................................... 33 1765 6.3.2 Hearing Impaired Service ............................ 34 1766 6.4 Messaging Servers ................................... 34 1767 6.4.1 Service Interface ................................... 34 1768 6.4.2 Web Enabled Message Drops ........................... 34 1769 7 Security Considerations ............................. 35 1770 8 Conclusion .......................................... 35 1771 9 Author's Addresses .................................. 37 1772 10 Bibliography ........................................ 37