idnits 2.17.1 

draft-rosenberg-sip-app-components-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 32) being 111 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 1506: '...   SHOULD be used for language specifi...'
     RFC 2119 keyword, line 1508: '...   RECOMMENDED. The language tags SHOU...'
     RFC 2119 keyword, line 1518: '...he server. The SDP MUST indicate a two...'
     RFC 2119 keyword, line 1519: '...eams. One stream MUST be of type audio...'
     RFC 2119 keyword, line 1520: '...able to the client. The stream MUST be...'
     (7 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 1017 has weird spacing: '...;  this  is  b...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 15, 2000) is 8562 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? '1' on line 1684 looks like a reference

  -- Missing reference section? '2' on line 1688 looks like a reference

  -- Missing reference section? '3' on line 1692 looks like a reference

  -- Missing reference section? '4' on line 1696 looks like a reference

  -- Missing reference section? '5' on line 1701 looks like a reference

  -- Missing reference section? '6' on line 1705 looks like a reference

  -- Missing reference section? '7' on line 1709 looks like a reference

  -- Missing reference section? '8' on line 1713 looks like a reference

  -- Missing reference section? '9' on line 1717 looks like a reference

  -- Missing reference section? '10' on line 1721 looks like a reference

  -- Missing reference section? '11' on line 1725 looks like a reference

  -- Missing reference section? '12' on line 1729 looks like a reference

  -- Missing reference section? '13' on line 1733 looks like a reference

  -- Missing reference section? '14' on line 1736 looks like a reference

  -- Missing reference section? '15' on line 1739 looks like a reference


     Summary: 5 errors (**), 0 flaws (~~), 3 warnings (==), 17 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                   SIP WG
3	Internet Draft                              Rosenberg/Mataga/Schulzrinne
4	draft-rosenberg-sip-app-components-00.txt        dynamicsoft/Columbia U.
5	November 15, 2000
6	Expires: May 2001

8	          An Application Server Component Architecture for SIP

10	STATUS OF THIS MEMO

12	   This document is an Internet-Draft and is in full conformance with
13	   all provisions of Section 10 of RFC2026.

15	   Internet-Drafts are working documents of the Internet Engineering
16	   Task Force (IETF), its areas, and its working groups.  Note that
17	   other groups may also distribute working documents as Internet-
18	   Drafts.

20	   Internet-Drafts are draft documents valid for a maximum of six months
21	   and may be updated, replaced, or obsoleted by other documents at any
22	   time.  It is inappropriate to use Internet- Drafts as reference
23	   material or to cite them other than as work in progress.

25	   The list of current Internet-Drafts can be accessed at
26	   http://www.ietf.org/ietf/1id-abstracts.txt

28	   The list of Internet-Draft Shadow Directories can be accessed at
29	   http://www.ietf.org/shadow.html.

31	Abstract

33	   An application server is defined as an entity that is capable of
34	   providing advanced features to users. Examples of features include
35	   call forwarding, call screening, debit card calling, web interactive
36	   voice response, etc. However, the set of functions needed to enable a
37	   broad range of such applications is quite large - it includes speech
38	   recognition, DTMF recognition and digit collection, text-to-speech
39	   synthesis, database interfacing, audio and video coding and decoding,
40	   audio and video bridging and mixing, and signaling, to name a few.
41	   Supporting such a large set of functions on the same box presents a
42	   major challenge. To solve this problem, the industry is proposing a
43	   decomposition of the application server into two components - a media
44	   server that handles the media component, and an application server
45	   that handles the call control, data, and signaling. The interface
46	   that has been proposed between these two elements is a control
47	   mechanism along the lines of MGCP or Megaco. In this paper, we
48	   propose an orthogonal decomposition, which breaks an application
49	   server into application server components. Each component represents
50	   a application server in its own right, but it provides a well defined
51	   component that by itself may be a complete, but simpler, application.

53	1 Introduction

55	   An observable trend in VoIP systems is the continuing decomposition
56	   of monolithic elements into component subparts, with the
57	   corresponding development of standardized interfaces between
58	   components. This kind of decomposition can be observed in the
59	   MGCP/megaco [1] gateway decomposition of a large gateway into a
60	   signaling gateway (SG), media gateway (MG) and media gateway
61	   controller (MGC), often referred to as a softswitch. Following that
62	   decomposition, the softswitch was further decomposed into a pure call
63	   control component (still referred to as a softswitch) and an
64	   application server (AS), which provides features and services. The AS
65	   was then decomposed, breaking it into a signaling piece (still
66	   referred to as an application server), and a media server (MS), which
67	   provides the media components of applications. Protocols like MGCP
68	   [2] and Megaco [3] have been proposed as the interface between an AS
69	   and MS.

71	   This paper proposes an additional decomposition of an application
72	   server into application server components (ASCs). This decomposition
73	   is orthogonal to the MS/AS decomposition, and differs significantly
74	   in its goals and benefits. The primary motivation is the recognition
75	   that most complex (and interesting) applications require a common set
76	   of core pieces - speech recognition and text-to-speech, translation
77	   services, conference servers, messaging servers, etc. Each of these
78	   components is complex and a full-fledged application in its own
79	   right. In most cases, a complex application really doesn't care about
80	   the details of the operation of the component. In many cases, these
81	   components run on separate servers, and often, would be provided by
82	   separate providers. What is needed, then, is a well-defined,
83	   distributed interface to these application server components. Here,
84	   we motivate a distributed decomposition of applications into
85	   components, and then show why, for many of these, the interface is
86	   ideally suited for a distributed, session establishment and
87	   termination interface that follows a standardized pattern of
88	   addressing and parameter passing. We believe the Session Initiation
89	   Protocol (SIP) [4] is ideally suited for such an interface.

91	2 Why Decompose

93	   The first question to address is "why decompose an application
94	   server".

96	   Decomposition is the act of breaking a large, monolithic system into
97	   a number of smaller compoents that interact according to specified
98	   behaviors. Decomposition of large components offers a number of
99	   benefits:

101	        Scale. As systems need to serve more and more users, there are
102	             two approaches to scaling up. One is to buy increasingly
103	             faster hardware, so that the monolithic servers can keep up
104	             with increasing use. The second is to distribute the work
105	             across components, so that multiple servers perform the
106	             work. Distribution is fundamentally cheaper, since the cost
107	             of large monolithic systems increases exponentially with
108	             capacity, compared to the linear increase in cost with
109	             multiple, smaller units. Distribution of work can be done
110	             through load balancing, where each server remains
111	             homogeneous, but the work is spread across numerous
112	             servers, or it can be done through specialization, where
113	             the work is split into separate functions, and each
114	             function placed on a separate server. Specialization is
115	             ideal in cases where the work has different requirements
116	             for it to be completed. As an example, a component of an
117	             application may require special purpose hardware. This
118	             component can distributed to a specialized processor, with
119	             a normal off the shelf processor handling the more generic
120	             software tasks. Several of the components that we are
121	             describing fit into this category (such as the TTS server).

123	        Sharing of resources. By decomposing a server into components, a
124	             many-to-many interaction between them becomes possible.
125	             This means that one component can provide services to many
126	             other components. This provides for sharing of resources,
127	             which ultimately results in cost reduction.

129	        Expertise. Building a complex application requires expertise in
130	             call control, media services, compression, web, speech
131	             recognition, etc. It is highly unlikely that one
132	             organization will have enough expertise in all of these to
133	             build them all. By decomposing an application server into
134	             subpieces, organizations with expertise in one particular
135	             piece can build that one. The result is that the complete
136	             system can be composed of best in breed components.

138	        Speed of deployment. By decomposing, upgrading existing
139	             applications and deploying new ones becomes simpler. The
140	             decomposition provides isolation. This isolation means that
141	             one component can be changed or improved without affecting
142	             others. That makes it easy to add new features to an
143	             application, or to deploy a new one by using components
144	             already deployed.

146	   Decomposition does have its drawbacks. Primary amongst them is
147	   security. In general, the more boxes in a system, and the more they
148	   interact with each other, the more complex the security is. As a
149	   result, any distributed system has inherently more complex security
150	   issues. Another drawback is reliability. A system with multiple
151	   boxes, where the system requires all boxes to work in order to
152	   function, is less reliable than a system with a single box which must
153	   work.

155	3 Tightly Coupled Decomposition

157	   As an example of decomposition, it has been proposed to break the
158	   application server into a signaling and control component (the AS),
159	   plus a media server component (the MS). This decomposition is shown
160	   in Figure 1.

162	   Calls arrive at the AS component over SIP. The AS then accesses the
163	   MS using MGCP, and learns the IP address and port where the media for
164	   the call can be sent. This is returned in the 200 OK response by the
165	   AS. The AS then begins to instruct the MS to perform specific
166	   functions - collect digits, play tones and announcements, and to
167	   report the digits and tones back to the AS for further processing.
168	   Typically, the MGCP interface between the two devices is fairly
169	   "busy"; there is a lot of messaging for complex applications.

171	   In this model, there is a tightly coupled relationship between the MS
172	   and AS. The MS cannot function without the AS, and the AS needs to
173	   perform tight, low-level controls over the detailed operation of the
174	   media server.

176	   To some degree, breaking of an application server into these two
177	   components represents an implementation detail of how one builds a
178	   large, monolithic application server. It is not generally possible
179	   for the two components to be owned by separate providers. In fact, it
180	   has yet to be shown that complete interoperability and integration is
181	   possible with two components from different vendors, let alone
182	   different providers.

184	   This decomposition also does not provide a true separation of
185	   function. Most applications that require media interaction (IVR,
186	   credit card and debit card, etc.) have very cleanly separated media
187	   phases and signaling phases. The details of the media interactions
188	                         ....................
189	                         .                  .
190	                         . +-------------+  .
191	                         . |             |  .
192	               SIP       . |             |  .
193	              -------------+      AS     |  .
194	                         . |             |  .
195	                         . |             |  .
196	                         . |             |  .
197	                         . +-------------+  .
198	                         .        |         .
199	                         .        |         .
200	                         .        |         .
201	                         .        |MGCP     .
202	                         .        |         .
203	                         .        |         .
204	                         .        |         .
205	                         . +-------------+  .
206	                         . |             |  .
207	                         . |             |  .
208	               RTP       . |             |  .
209	              -------------+     MS      |  .
210	                         . |             |  .
211	                         . |             |  .
212	                         . +-------------+  .
213	                         .                  .
214	                         ....................
215	                          Complete Application
216	                          Server

218	   Figure 1: MGCP-based decomposition

220	   are usually not important to the signaling component, and vice a
221	   versa. As an example, consider a debit card application. The
222	   application starts with the user making a call. As part of the call
223	   processing, interaction is needed with the user via the media stream
224	   to determine the debit card number. The precise set of menu
225	   operations and interactions used to obtain this number aren't
226	   important to the call/signaling processing piece; only the result
227	   (the number), is important. Once the number is returned, media
228	   processing ceases, and data and call processing commence. The debit
229	   card is looked up in a subscriber database, and if enough time
230	   remains, the call is completed. The signaling component monitors the
231	   call, and when the card has run out of minutes, the call is
232	   terminated.

234	   Consider the case where the application provider decides that the
235	   menus presented for debit card collection are confusing, and they
236	   need to be changed. This change really affects the media processing
237	   only; ideally, we would like to have no change whatsoever in the data
238	   processing and signaling part of the application. However, in the
239	   decomposition afforded by MGCP, the AS component contains both the
240	   signaling and call control, in addition to the control of the IVR
241	   menus and and processing. Thus, the AS needs to be updated, even
242	   though what has changed is really an IVR component.

244	   The MGCP decomposition also presents a burden for software developers
245	   on the AS. They need to understand, and program, the detailed
246	   interactions with the MS that are provided by MGCP, in addition to
247	   the detailed signaling and data processing operations. The developers
248	   will also need to build and manage the low level state representing
249	   the controlled entity, which can be painful. The result is longer
250	   development times, less code reuse, and slower innovation.

252	   It has been argued that one of the benefits of the MGCP decomposition
253	   is that it offloads the "burden" of call control from the media
254	   server. However, from a complexity standpoint, the MGCP processing
255	   required is probably on par with (if not more than), the simple
256	   amount of call control and SIP processing needed if SIP were used
257	   directly.

259	   From a reliability perspective, an MGCP style decomposition is less
260	   desirable. Since the components are strongly coupled, the system will
261	   fail so long as any of the pieces fail. Failure can also be
262	   introduced because of additional network resources needed for
263	   communications between the boxes. The result is that the MGCP
264	   decomposition may actually increase the probability of failure, as
265	   compared to no decomposition at all.

267	   Another decomposition that has been proposed is to break a proxy into
268	   a routing and call control component, plus a services component. The
269	   interface between the two is then a transactional interface for
270	   services, similar in concept to INAP, based upon state transitions
271	   within a call model. This is another form of tight coupling, since it
272	   requires the services component to have detailed knowledge of the
273	   operational model of the call control component. We believe that this
274	   decomposition is limiting, for the same reasons the AS/MS
275	   decomposition is limiting.

277	4 The Decoupled Model
278	4.1 Architecture

280	   As a result of this, we see the master/slave decomposition as being
281	   ideal for a single vendor to build a large system. However, this
282	   decomposition does not solve the other distribution needs we have
283	   motivated above. As a result, we propose that the AS be decomposed
284	   into an application component responsible for coordinating the
285	   overall execution of the application (called the controller), and
286	   application server components that provide pieces of the overall
287	   application. These components are only loosely coupled with the
288	   coordinating application server. The loose coupling implies that the
289	   interaction between them is the same as the interaction between the
290	   user and the coordinating application server, which is, in turn, the
291	   same as the interation between the application server components and
292	   other application server components. The components can easily be
293	   from separate vendors, and the interactions support the needed
294	   security and routing features to allow them to be owned by separate
295	   providers, even.

297	   The architecture is shown in Figure 2.

299	   The goal of the decoupling is to break the application into as
300	   coarse-grained pieces as possible. Each component (the coordinator
301	   included) should need to know as little as possible about the
302	   detailed operations performed by other components. A coarse-grained
303	   decomposition means that there is a clean and simple break in the
304	   functionality provided by the components. This enables significantly
305	   simpler interfaces between those components.

307	   Each component is really interested in passing a request for service
308	   to another, letting the other component perform its task, and then
309	   getting the final result of the task back as an output. From a
310	   software engineering perspective, this represents the classic
311	   function call; the call signaling component is making a function call
312	   to the media part. It is interested only in the return value - the
313	   debit card number, for example - and does not really care about the
314	   implementation of it. From a protocol perspective, this is a classic
315	   client-server system. The client makes a request of the server, and
316	   the server does whatever it needs to do to return the final response.
317	   The problem more closely resembes the client-server system than the
318	   function call, however. This is because we need the interaction to be
319	   across the network, rather than between code within the same process.
320	   This is because one of the key concepts here is that components can
321	   be provided by separate service providers.

323	   In such a model, where does the state for the sessions live? Here, we
324	   define a session as the complete set of interactions amongst all
325	   components for the delivery of the service. Thus, a session might
326	   span multiple protocols, and even multiple calls. Not surprisingly,
327	   session state is distributed amongst the components, and the
328	   distribution follows the architectural model of Figure 2. The top
329	   level server, the controller, maintains the high level pieces of
330	   state that deal with overall delivery of the service, and the state
331	   required to coordinate the interactions with the component servers.
332	   Each component server maintains only the state needed to execute
333	   their component, and to manage interactions with components below
334	   them. A component server does not know about the complete service
335	   being delivered, and does not know about sibling servers. This aspect
336	   of our model - hierarchical distribution of session state, leads to
337	   one of the primary benefits of the architecture - ease of
338	   development. Someone building a new application by reusing existing
339	   components only needs to manage the high level state for delivery of
340	   the service. State related to the details of operation of one of the
341	   components - timings between digits in an IVR server, for example, is
342	   not relevant to the coordinator, and does not need to be managed.

344	   The difference between classic RPC or client/server interactions and
345	   the interactions between the components here is that the relationship
346	   between the components represents a long lived association (i.e., a
347	   session), during which a session level service is being provided,
348	   rather than a simple input/output service. As an example, consider a
349	   component providing continuous real-time text-to-speech translation
350	   services. The application coordinator that wishes to use this service
351	   acts as a client, initiating a request for service to the server (in
352	   this case, the TTS server). However, the text is not passed as an
353	   "argument" to the TTS server, it is continually streamed for the
354	   duration of an active session, and the TTS server would continuously
355	   stream back the speech version of the text, which is the output of
356	   the service.

358	   Another example is a voice messaging server. The messaging server
359	   provides basic services like message drop, message retrieve, and
360	   message management. Each of these represent procedures that can be
361	   executed by a client component. To drop a message, for example, the
362	   client component would initiate a session with the messaging server.
363	   A prompt would be played over that session, something like "please
364	   record your message for Joe now", and then the component takes the
365	   media input stream, records it, and saves it. When it is done, the
366	   session is terminated.

368	   In some cases, the session may require a "side channel" over which
369	   intermediate data is passed, needed to control the session
370	   interactions from that point forward. IVR is the classic example. In
371	   some cases the coordinating application server can kick off the IVR
372	   script, and then only get back the final result - a menu option, a
373	                            +-----------+
374	                            |           |
375	                            |           |
376	                            |  AS       |
377	                            |coordinator|
378	                            |           |
379	                            |           |
380	                            +-----------+
381	                   SIP,     --    \    ---
382	                    RTP?  --       \      ----      SIP,
383	                        --         \          ----   RTP?
384	                      --            \ SIP,        ----
385	                    --               \ RTP?           ----
386	                  --                 \                    --
387	            +----------+        +-----\----+         +----------+
388	            |          |        |          |         |          |
389	            |          |        |          |         |          |
390	            |          |        |          |         |          |
391	            |  ASC     |        |    ASC   |         |   ASC    |
392	            |          |        |          |         |          |
393	            |          |        |          |         |          |
394	            +----------+        +----------+         +----------+
395	                                       \                    /
396	               /                        \\  SIP,           /
397	              / SIP,                      \  RTP?        //
398	             /   RTP?                      \\           / SIP,
399	            /                                \         /   RTP?
400	           /                               +----------+
401	     +----------+                          |          |
402	     |          |                          |          |
403	     |          |                          |          |
404	     |          |                          |   ASC    |
405	     |   ASC    |                          |          |
406	     |          |                          |          |
407	     |          |                          +----------+
408	     +----------+

410	   Figure 2: Decoupled Architecture
411	   credit card number, or what have you. In other cases, the
412	   coordinating component may need to get intermediate results, so that
413	   it can guide the operation of the IVR moving forward. This requires a
414	   companion control channel that provides data output from the
415	   component server back to the client, and then returns further high
416	   level instructions from the client back to the server.

418	   There is a thin line in some cases between this control channel and
419	   the tightly coupled interactions of a master-slave MGCP relationship.
420	   However, the loosely coupled nature of the interaction can be
421	   maintained by using coarse-grained data passing over a distributed
422	   client-server protocol, such as HTTP or Corba.

424	   From this architectural description, it is clear that a client-server
425	   session establishment protocol, which allows for passing of
426	   parameters that describe service, is the ideal mechanism to
427	   coordinate the interaction between components. Clearly, SIP is
428	   perfect in such a role.

430	   Following the example above, an IVR application server component
431	   would be completely responsible for the execution of the IVR piece of
432	   an application, including both the media and the signaling call
433	   control. It would know the menus to maneuver through, and it would
434	   know when to collect digits and present prompts. The coordinating
435	   application server would request service from the IVR component by
436	   initiating a call to it (possibly using third party call control [5]
437	   to direct the media directly to the IVR without passing through
438	   itself; more on that below). The application component takes the
439	   media from the incoming call, running it against the IVR application.
440	   When the IVR is done, the final result - in this case, the credit
441	   card number, is passed back to the coordinating AS, possibly throug
442	   an HTTP POST operation. The coordinating AS then terminates the call
443	   with the IVR.

445	4.2 Benefits of the Decoupling

447	   This decoupled interaction between components provides several
448	   important benefits:

450	        Separation of Businesses. The decoupled interaction between
451	             components is needed to allow the components to be provided
452	             by separate providers. Master-slave control interactions do
453	             not work well across service providers, let alone across
454	             vendors. By allowing separate providers to offer the
455	             components, new businesses can be created that specialize
456	             in the piece they are providing.

458	        Rapid Development. Since the components can easily be placed in
459	             separate boxes from separate vendors, or even in separate
460	             providers, we achieve a separation of function that allows
461	             each piece to be developed in complete isolation. We also
462	             get reuse of components for new applications. This allows
463	             for rapid service creation.

465	        Better Interoperability. It can be argued that the decoupled
466	             interaction between components is more like to be
467	             interoperable that a master-slave mechanism. This is
468	             largely based on the assumption that a master-slave
469	             interaction requires a lot more messaging and exchange
470	             between the components, whereas the decoupled client-server
471	             mechanism requires less. The fewer information that passes
472	             back and forth, the easier it is to interoperate.

474	        Architectural Flexibility. The loose coupling of the components
475	             means that a server, such as a conferencing application or
476	             IVR, need not be implemented as an actual server. Rather,
477	             complex networks of components, with proxies providing
478	             routing of requests in arbitrarily complex ways, can be
479	             built to provide a service. Since the interaction is SIP,
480	             the application controller accessing the service doesn't
481	             know whether it is communicating with a single server or a
482	             network built in this fashion. That allows ASPs flexibility
483	             in how they can construct their service networks.

485	        Reliability The loose coupling of the components improves
486	             reliability compared to a tight coupling. Thats because the
487	             system can probably still continue to operate in the
488	             failure of a single component. For example, if a TTS server
489	             fails during a session, an application server can use a
490	             server from a completely different provider, or it can use
491	             a media server instead, converting the text to VoiceXML
492	             scripts. Depending on the service, the TTS component could
493	             possible be skipped altogether. Note, however, that the
494	             reliability is still not as good as a monolithic system.
495	             Having ten identical boxes each running a complete set of
496	             services is better than spreading the service across ten
497	             boxes, where some subset cause total failure.

499	5 Architecture for the Interfaces

501	   Up to now, we have been fairly vague about exactly how such an
502	   interface would work in practice. We have argued that it is SIP, but
503	   not described in detail how SIP is actually used for this function.

505	   SIP (along with SDP [6]) clearly provides the facilities for
506	   initiation and termination of the sessions between the controller and
507	   components, and for specification of the media addresses to and from
508	   which media is sent. However, SIP leaves a lot of flexibility in
509	   terms of naming, additional message content, session duration, and
510	   control. Here, we discuss each of these in turn.

512	5.1 Naming

514	   In any remote procedure call system, a key component is naming. The
515	   identified resource must be properly addressed so that the underlying
516	   message passing system can properly determine where the request
517	   should go.

519	   The same is true in SIP. Messages are routed based on the request
520	   URI, as it serves as the primary naming tool for routing messages. In
521	   its application to AS component interaction, the request URI serves
522	   as the primary tool to identify the resource to which the session is
523	   addressed. A critical piece of defining a session level service that
524	   can be accessed by SIP is defining the naming of the resources within
525	   that service. This point cannot be understated.

527	   As an example, consider a conferencing service. In this case, the
528	   primary resource that is being accessed is a mixing service. We would
529	   like to have a way to identify which conference is being addressed by
530	   any given call. All calls for the same conference are all bridged
531	   together. By default, the bridging would operate in an N-1
532	   configuration (that is, each user receives a mixed media stream that
533	   represents all of the other users besides themself). Conferences can
534	   be set up in two ways - ad-hoc, which are not pre-established at all,
535	   and exist so long as there is a participant in them, and scheduled,
536	   where they exist for a certain period of time.

538	   One might imagine that a conferencing service breaks its URI
539	   namespace into two pieces - one piece that represents ad-hoc
540	   conferences, and another that represents scheduled conferences. Ad-
541	   hoc conferences are addressed using a URI of the form <conference
542	   ID>.adhoc@conferences.com. All users who initiate a call to the URI
543	   sip:as9dahas89.adhoc@conferences.com are bridged together. The
544	   conference state is established when the first call to a conference
545	   occurs, and destroyed when the last call terminates. In contrast,
546	   scheduled conferences might be named by <conference
547	   id>.scheduled@conferences.com, so that a call to
548	   sip:conference12.scheduled@conferences.com allows a user access to a
549	   pre-arranged conference.

551	   There are several benefits to naming ad-hoc conferences vs. scheduled
552	   ones in this fashion. The primary one is convenience; the name makes
553	   it the type of conference apparent to any entities that are
554	   interested. Secondly, it can avoid certain misconfigurations. Let's
555	   say there are no conventions for naming of ad-hoc versus scheduled
556	   conferences. I am asked to join a scheduled conference
557	   (conf2321@conferences.com), but I mis-type the URL in my browser
558	   (conf2123@conferences.com). I don't want this to drop me into an ad-
559	   hoc conference where I sit for 15 minutes thinking others will
560	   eventually join. If ad-hoc conferences are named differently, a call
561	   to cond2123@conferences.com is never going to be an ad-hoc
562	   conference, and so my call will be rejected immediately.

564	   For an application server to use a conferencing service as a
565	   component, the AS must know the URI namespace conventions used to
566	   identify the various conferences. The above information, for example,
567	   would be provided by the conferencing provider to its customers.

569	   This same concept of using the request URI as a service identifier
570	   has been described in detail for voicemail systems [7].

572	   The great advantage of using the request URI as a service identifier
573	   comes because of the combination of two facts. First, unlike in the
574	   PSTN, where numbers are limited, URIs come from an infinite space.
575	   They are plentiful, and they are free. Secondly, the primary function
576	   of SIP is call routing through manipulations of the request URI. In
577	   the traditional SIP application, this URI represents people. However,
578	   the URI can also represent services, as we propose here. This means
579	   we can apply the routing services SIP provides to routing of calls to
580	   services. The result - the problem of service invocation and service
581	   location becomes a routing problem, for which SIP provides a scalable
582	   and flexible solution. Since there is such a vast namespace of
583	   services, we can explicitly name each service in a finely granular
584	   way. This allows the distribution of services across the network. In
585	   the conferencing example above, since we have separated the names of
586	   ad-hoc conferences from scheduled conferences, we can program proxies
587	   to route calls for ad-hoc conferences to one set of servers, and
588	   calls for scheduled ones to another, possibly even in a different
589	   provider. In fact, since each conference itself is given a URI, we
590	   can distribute conferences across servers, and easily guarantee that
591	   calls for the same conference always get routed to the same server.

593	   This is in stark contrast to conferences in the telephone network,
594	   where the equivalent of the URI - the phone number - is scarce. An
595	   entire conferencing provider generally has one or two numbers.
596	   Conference IDs must be obtained through IVR interactions with the
597	   caller, or through a human attendant. This makes it difficult to
598	   distribute conferences across servers all over the network, since the
599	   PSTN routing only knows about the dialed number.

601	   Care must be taken not to push this concept too far. Naming of
602	   services should not become so fine-grained that all parameters
603	   associated with the service simply become encoded into the request
604	   URI as well. The right level of granularity can be determined based
605	   on routing. If a service is represented by multiple URLs, but
606	   requests for each of those URLs are always routed in the same way,
607	   the naming is too fine-grained.

609	5.2 Additional Message Content

611	   Sometimes, connecting to a service requires the service to know
612	   additional information that is not appropriate for the request URI.
613	   As an example, the conferencing server might need to know the name,
614	   address, phone number, company, and email address of the
615	   participants, which it converts to speech and uses as an announcement
616	   when the user joins and leaves the bridge.

618	   This kind of content can easily be carried in the body of the SIP
619	   messages used to establish and manage the session with the service.
620	   For simple data, SIP headers may be appropriate. In the conferencing
621	   example above, the conferencing service might mandate that a vCard be
622	   attached to all INVITEs, in order to provide that information.

624	   When existing data formats (like a vCard) are not defined to provide
625	   the needed information, it can be encoded in an XML document, for
626	   example, and carried along in the INVITE.

628	   Each service would need to specify the content that it needs in order
629	   to process the session invitation.

631	5.3 Session Duration

633	   The duration of the session that is established with a server depends
634	   entirely on the nature of the service. For example, for a conference,
635	   the initiation of the call begins the mixing service for that user,
636	   and the termination of the call results in that user leaving the
637	   conference.

639	   For an IVR service, the INVITE request begins the interaction with
640	   the service. Once the INVITE transaction completes, the IVR would
641	   play out the initial prompt, and begin collecting data from the
642	   caller. How the IVR terminates depends on its usage. When the
643	   initiator of the service is an application server, we would argue
644	   that in almost all cases, it should be the responsibility of the
645	   controller to determine when the interaction is complete (and thus
646	   terminate the call with a BYE). However, when the initiator is an end
647	   user, the IVR will usually be the one to terminate the session. We
648	   discuss IVR interactions in more detail below in Section 6.1.

650	5.4 Third Party Call Control

652	   Third party call control, as defined in [5], plays an integral role
653	   in this architecture.

655	   In many cases, the controller orchestrating a service wishes to
656	   invoke the resources of an IVR or conferencing server. However, the
657	   AS is not the actual source of the media that drives the IVR. The
658	   source of the media is the end user that initiated the call to the
659	   controller. What is needed, then, is a way for the AS to call the IVR
660	   or conferencing server, and pass it the media information of the end
661	   user. Similarly, the media address of the IVR server (described in
662	   the SDP from the media server), needs to be passed to the end user
663	   that initiated the call. By using third party call control, an
664	   application server can direct the media of the end user to and from
665	   the components that it is using to provide the application. Once one
666	   service is complete, the controller can move the media to a different
667	   component. SIP re-INVITEs also allow the controller to request the
668	   caller to send multiple media streams, one, for example, containing
669	   only DTMF and tones. This allows for DTMF control of services without
670	   carrying DTMF in SIP itself.

672	   Figure 3 shows how we use a component server to collect DTMF input
673	   for a service; specifically, a simple (and perhaps useless) service
674	   that allows a caller to press '1' to indicate that they want to put
675	   the call on hold. The service is, in principal, useless, since hold
676	   is so common that the end user can do this themselves. However, it is
677	   useful for example purposes.

679	   The caller sends an INVITE request to the called party (1), which is
680	   routed to a server handling calls for the domain of the called party.
681	   In this case, the server is an application server. The AS decides
682	   that it would like to offer the caller advanced services based on
683	   DTMF events sent mid-call. As a result, it decides to invoke the
684	   services of a media server component. The AS will use third party
685	   call control mechanisms to have the caller send any DTMF related
686	   media to the media server, in addition to sending its media to the
687	   called party. To accomplish this, the AS sends an INVITE to the media
688	   server (2), with an indication that the media stream is send only
689	   (this is accomplised using the sendonly SDP attribute [6]). The
690	   request URI of this INVITE binds that session to a service that looks
691	   for any in-band DTMF, and reports it back to the AS through an HTTP
692	   GET or POST operation. In section 6.1, we show how this is easily
693	   done with a VoiceXML driven IVR server.

695	   The media server responds with a 200 OK (3) that contains SDP with
696	   the address where the media should be sent to. The application server
697	   ACKs this response (4), and holds on to that SDP. The AS then proxies
698	   the original INVITE request (5), and the called party answers the
699	   call (6). This acceptance is proxied upstream (7), and then
700	   acknowledged (8,9). At this point, media is flowing between the
701	   caller and called party (10). The next step for the AS is to get a
702	   stream of DTMF digits to flow from the caller to the media server. To
703	   do this, it sends a re-INVITE to the caller (11). This re-INVITE
704	   contains the same SDP as the response (6) from the called party, but
705	   with the addition of a new media line. This media line is audio, and
706	   contains a single codec, the RTP payload format for DTMF and tones
707	   [8]. The connection address and port are from the SDP returned from
708	   the media server. This tells the caller to send an additional media
709	   stream to the media server, using only the DTMF codec. The result is
710	   that RTP packets are sent only when the caller presses a button on
711	   the phone.

713	   The caller accepts this re-INVITE (12), and the AS acknowledges it
714	   (13). Now, DTMF only RTP is flowing between the caller and the media
715	   server (14). At some point later, the caller presses the 1 key
716	   (which, for example, might imply call hold). This is processed by the
717	   media server, and the result is an HTTP request being sent to the AS
718	   (15). The HTTP request contains the value of the collected digit. The
719	   AS receives this request, and knows that the user keyed in a 1.
720	   Recognizing this input as call hold, the AS sends a re-INVITE to the
721	   called party (17). The SDP in this re-INVITE is the same as the SDP
722	   in the original INVITE from the called party (1), except that the
723	   connection address is set to zero, indicating call hold. The called
724	   party accepts the re-INVITE (18), and this is ACKed by the AS (19).
725	   The called party is now on hold.

727	   Note that the call flow remains unchanged if the stimulus were based
728	   on voice recognition instead of DTMF. The only difference would be
729	   that a general purpose codec, such as G.711, would be used instead of
730	   RFC 2833 for communications between the caller and the media server.
731	   This achieves an important unification. Independent of the type of
732	   stimulus - voice, DTMF, or, in fact, direct http requests from the
733	   caller (if they were using a softphone), the service execution code
734	   is unchanged.

736	   Others have proposed that DTMF digits be carried in SIP directly from
737	   the caller to the AS [9,10].  However, this approach does not work
738	   for anything beyond DTMF, while our approach works for DTMF, speech,
739	   and web interfaces. Another drawback of the DTMF-in-SIP approach is
740	   that all entities on the call signaling path will receive any DTMF
741	   digits dialed by the called party. Furthermore, since the caller
742	   doesn't know if there is an entity interested in DTMF, it is required
743	    Caller          Coordinator         Media Server       Callee
744	      |                |                  |                 |
745	      |(1) SIP INV     |                  |                 |
746	      |--------------->|(2) SIP INV       |                 |
747	      |                |----------------->|                 |
748	      |                |(3) 200 OK        |                 |
749	      |                |<-----------------|                 |
750	      |                |(4) SIP ACK       |                 |
751	      |                |----------------->|                 |
752	      |                |(5) SIP INV       |                 |
753	      |                |----------------------------------->|
754	      |                |(6) 200 OK        |                 |
755	      |(7) 200 OK      |<-----------------------------------|
756	      |<---------------|                  |                 |
757	      |(8) SIP ACK     |                  |                 |
758	      |--------------->|(9) SIP ACK       |                 |
759	      |                |----------------------------------->|
760	      |(10) RTP        |                  |                 |
761	      |.....................................................|
762	      |                |                  |                 |
763	      |(11) SIP INV    |                  |                 |
764	      |<---------------|                  |                 |
765	      |(12) 200 OK     |                  |                 |
766	      |--------------->|                  |                 |
767	      |(13) SIP ACK    |                  |                 |
768	      |<---------------|                  |                 |
769	      |(14) RTP        |                  |                 |
770	      |...................................|                 |
771	      |                |                  |                 |
772	      |                |(15) HTTP GET     |                 |
773	      |                |<-----------------|                 |
774	      |                |(16) 200 OK       |                 |
775	      |                |----------------->|                 |
776	      |                |                  |                 |
777	      |                |(17) SIP INV      |                 |
778	      |                |------------------+---------------->|
779	      |                |(18) 200 OK       |                 |
780	      |                |<-----------------+-----------------|
781	      |                |(19) SIP ACK      |                 |
782	      |                |------------------+---------------->|
783	      |                |                  |                 |
784	      |                |                  |                 |
785	      |                |                  |                 |
786	      |                |                  |                 |
787	      |                |                  |                 |

789	   Figure 3: Call Flow for DTMF Enabled Hold Service
790	   to send DTMF within SIP messages all the time, even if no entity is
791	   interested.

793	   There have been proposals for adding a subscription/notification
794	   mechanism on top of this to avoid this problem. However, this further
795	   complicates the system by adding a requirement for the caller to
796	   support a subscription and notification service just for DTMF.

798	   Our approach fits well within the existing SIP framework, and
799	   requires no additional work from the end users. Furthermore, it
800	   transparently supports multiple application server components
801	   receiving DTMF. This is because an AS is able to send a DTMF stream
802	   to a component by adding a new media line to the list of media
803	   streams being sent by the caller. The list of media streams being
804	   sent by the caller is observed by each AS through the initial INVITE,
805	   along with any subsequent re-INVITEs which might modify it. Consider
806	   the situation with two application servers, A and B, depicted in
807	   Figure 4. The original call setup starts with the caller, flows
808	   through A, then B, then the called party. At some point later, A
809	   sends a re-INVITE (10) to the caller, adding a media stream, just as
810	   described in Figure 3. The SDP in this INVITE will be the same as
811	   provided by the caller in message (1), plus the additional DTMF
812	   stream. Note that this re-INVITE does not pass through B. Now, B
813	   decides to add a media stream for DTMF. So, it sends a re-INVITE
814	   (13). This goes first to A. As far as A is concerned, this re-INVITE
815	   is from the called party. A computes the difference between what it
816	   believes the called party should perceive as the set of media
817	   streams, and what is in the re-INVITE (13). This difference (the
818	   additional DTMF stream added by B) is added to the SDP that A had
819	   sent to the caller previously (10), and the result is sent in a re-
820	   INVITE to the caller (14). This SDP now contains the media streams
821	   meant for the actual called party, along with two DTMF streams; one
822	   for A, and one for B. The caller thus sends DTMF to both servers.

824	   A further advantage of our approach is that the DTMF can even be sent
825	   using multicast, since it is being sent in RTP rather than as part of
826	   SIP. This allows for tremendous scalability, if needed, in the number
827	   of entites receiving the DTMF streams.

829	5.5 Side Channels

831	   Side channels are used for passing of events from the application
832	   server components back to the client, and for passing control
833	   commands from the client to the application server component.

835	   Unfortunately, side channels complicate the simple session level
836	   interface between components. It is our belief, at least for the
837	   components described here, that only minimal side channels are
838	   needed. Specifically, the only service below that requires one to be
839	   effective is the IVR service, for which HTTP forms an ideal side
840	   channel. If the side channel becomes so complex as to introduce
841	   extensive synchronization, bandwidth, and transactional issues, the
842	   relationship between the components becomes tightly coupled once
843	   more, and the benefits we are espousing here begin to disappear.

845	   As such, we believe that a reasonable side channel for decoupled
846	   server interactions is defined as follows:

848	        o The event reporting and control components have no real time
849	          requirements.

851	        o Event reporting from the component back to the client
852	          accessing it are infrequent; specifically, the intervals are
853	          much larger than the round trip times between the client and
854	          the component.

856	        o Control from the client to the component is infrequent;
857	          specifically, the intervals are much larger than the round
858	          trip times between the client and component.

860	        o Event reporting is coarsely granular, so that the client does
861	          not need to explicitly subscribe to specific events in order
862	          to avoid be overwhelmed with data.

864	        o The amount of data passed in both the events and in the
865	          control is small.

867	        o There are no requirements for transaction support.

869	   Note that protocols like MGCP and megaco do not meet these
870	   requirements, as they require tight timing, synchronization, and
871	   explicit subscriptions. HTTP, as used in VoiceXML, however, does meet
872	   these requirements.

874	6 Patterns for Accessing Components

876	   In this section, we propose a set of patterns that define the
877	   interaction of a controller with an application server component.
878	   These patterns manifest themselves in the description of the service
879	   invoked when a session is initiated, a discussion of the naming
880	   conventions of the service, and a description of any back channel
881	   used for control and data passing.

883	6.1 Interactive Voice Response Services
884	       Caller            A                B              Callee
885	         |               |                |                 |
886	         |(1) SIP INV    |                |                 |
887	         |-------------->|(2) SIP INV     |                 |
888	         |               |--------------->|(3) SIP INV      |
889	         |               |                |---------------->|
890	         |               |                |(4) 200 OK       |
891	         |               |(5) 200 OK      |<----------------|
892	         |(6) 200 OK     |<---------------|                 |
893	         |<--------------|                |                 |
894	         |(7) SIP ACK    |                |                 |
895	         |-------------->|(8) SIP ACK     |                 |
896	         |               |--------------->|(9) SIP ACK      |
897	         |               |                |---------------->|
898	         |(10) SIP INV   |                |                 |
899	         |<--------------|                |                 |
900	         |(11) 200 OK    |                |                 |
901	         |-------------->|                |                 |
902	         |(12) SIP ACK   |                |                 |
903	         |<--------------|                |                 |
904	         |               |                |                 |
905	         |               |(13) SIP INV    |                 |
906	         |(14) SIP INV   |<---------------|                 |
907	         |<--------------|                |                 |
908	         |(15) 200 OK    |                |                 |
909	         |-------------->|(16) 200 OK     |                 |
910	         |               |--------------->|                 |
911	         |               |(17) SIP ACK    |                 |
912	         |(18) SIP ACK   |<---------------|                 |
913	         |<--------------|                |                 |
914	         |               |                |                 |
915	         |               |                |                 |
916	         |               |                |                 |
917	         |               |                |                 |
918	         |               |                |                 |

920	   Figure 4: Multiple Application Servers and DTMF

922	   We have touched upon the basics of the interaction between a
923	   controller and an IVR server. The controller initiates a call to the
924	   server, the server executes some kind of IVR service, and data is
925	   A number of questions still need to be answered, however:

927	        1.   How is the IVR service identified?

929	        2.   How can the controller specify the details of the dialog
930	             the IVR carries out with the user?

932	        3.   How does data from the IVR get passed back to the
933	             controller?

935	        4.   How is intermediate control performed (e.g., to interrupt
936	             or reset IVR based on some event at the controller, in this
937	             case)?

939	   We believe that VoiceXML [11] represents the ideal partner for SIP in
940	   the development of distributed IVR servers. VoiceXML is an XML based
941	   scripting language for describing IVR services at an abstract level.
942	   VoiceXML supports DTMF recognition, speech recognition, text-to-
943	   speech, and playing out of recorded media files. The results of the
944	   data collected from the user are passed to a controlling entity
945	   through an HTTP form POST operation. The controller can then return
946	   another script, or terminate the interaction with the IVR server.

948	   From a naming perspective, the primary issue is how a request URI is
949	   associated with a script to invoke when the call is answered. We see
950	   three primary mechanisms:

952	        1.   There is a one-to-one binding of the address in the request
953	             URI to a script to execute. These bindings are published by
954	             the provider of the IVR service.

956	        2.   The initial script to execute is actually carried as
957	             content in the body of the SIP INVITE request. The request
958	             URI indicates that the desired service is execution of
959	             content in the request (i.e., sip:executebody@servers.com).

961	        3.   The initial script to execute is fetched by the VoiceXML
962	             server; the URL to fetch it from is passed in the SIP
963	             INVITE message that initiates the IVR session. This can be
964	             accomplished either with the application/uri MIME type as a
965	             body, or using the new *-Info headers [12] which provide
966	             references to content to fetch.

968	   We believe that the third approach is probably the best one. SIP is
969	   not the ideal transfer mechanism. Passing a URI allows a far better
970	   transfer tool, namely HTTP, to be used to actually fetch the script
971	   back from the controller.

973	   HTTP is then also used to pass back form data from the IVR to the
974	   controller. The results of the HTTP POST can also contain additional
975	   VoiceXML scripts to execute. It represents the side channel discussed
976	   in section 5.5

978	   Note that in some cases, there needs to be interactions between the
979	   HTTP server that receives the HTTP POST requests, and the controller
980	   that initiates and terminates the SIP sessions with the IVR. This is
981	   the case when the data collected by the VoiceXML server is used to
982	   guide signaling behavior. For example, a pre-paid calling application
983	   might use the IVR to collect the users PIN code. The PIN code is
984	   looked up, and the number of minutes remaining is determined. This
985	   amount of time must be known to the SIP controller, as it will need
986	   to hang up the call once this time expires. Some kind of session
987	   sharing mechanism is needed between the SIP controller and the HTTP
988	   server in this case.

990	   Figure 5 shows the interaction between an application server acting
991	   in a coordinating role, and an IVR server component. In this example,
992	   consider an application where the user makes a call, but the system
993	   needs additional information to determine where to forward it to. The
994	   user is prompted for the info, and once the name of the desired
995	   called party is obtained and looked up, the call is completed to the
996	   requested destination.

998	   First, in step (1), the caller sends an INVITE to the controller. The
999	   controller then creates a brand new call to the IVR application
1000	   server (2), using the SDP from the INVITE in (1). The IVR accepts the
1001	   call (3), and the SDP from that acceptance is returned in a 183
1002	   response to the caller (4). The call to the IVR is acked (5), and now
1003	   a media stream exists between the caller and the IVR server. The IVR
1004	   server, in step (6), fetches the initial VoiceXML script to execute,
1005	   which is returned by the controller (7). The prompts are played to
1006	   the caller, and the identity of the called party is collected. This
1007	   is passed to the controller through another POST (8), which returns
1008	   an empty VoiceXML script (9)[1] complete, the controller hangs up
1009	   with it (10 and 11). The information the controller got in the POST
1010	   (8) is used to determine the next hop SIP server, and the initial
1011	   INVITE is proxied there (12).

1013	   Its important to observe the all call control related to executing
1014	   the service lives within the controlling application server. The IVR
1015	_________________________
1016	  [1] Note that it is unusual for an empty script to be
1017	returned;  this  is  because we want the AS to maintain
1018	control of the call signaling
1019	   application server deals strictly with the media component. This
1020	   division of work, as we have discussed above, allows for independent
1021	   evolution of the call control and media components of services. For
1022	   example, if the desired called party did not have a reachable SIP
1023	   address, but they did have an email address, the call could be
1024	   redirected to a mailto URL. To support this twist, only the
1025	   controlling application server code need change. The media component
1026	   remains completely and totally unchanged.

1028	   Readers familiar with VoiceXML will observe that VoiceXML almost
1029	   achieves this perfect separation. It lacks any call control excepting
1030	   a two - for call transfer and call termination. These tags are
1031	   clearly not sufficient for many services. Our architecture would
1032	   argue that instead of adding call control to VoiceXML, all control
1033	   should be removed, so that call control can be left to other server
1034	   components.

1036	   The separation of the control from the media component also allows
1037	   the media component to change without affecting the control
1038	   component. In fact, because of the http interface between the two,
1039	   the media server can be completely removed and replaced with a normal
1040	   web browser, with only a small effect on the call control component.
1041	   As an example, if the calling party was coming from a web enabled SIP
1042	   client (known by the presence of the Accept header with text/html as
1043	   a value in the INVITE request), the controller could return an HTTP
1044	   URL in the 183 with an actual web form that gets filled out by the
1045	   caller. This would be instead of using an IVR server to collect the
1046	   data. Interestingly, the representation of the collected data is
1047	   identical in both cases. Both use an HTTP POST operation to send the
1048	   data to the controller. This allows the data collection code in the
1049	   controller to be unified across both voice access and web access.

1051	6.2 Conferencing Servers

1053	   Conferencing servers today vary in type and complexity. Some are
1054	   dialup only, supporting IVR access. Others support ad-hoc
1055	   conferencing with web interfaces. Others still support three way
1056	   calling as part of a PBX system.

1058	   We observe once more that all of these conferencing "servers" are
1059	   really conferencing applications that are just bundled as a server.
1060	   These conferencing applications can be decomposed into components in
1061	   exactly the way we have described above. At the core of each of these
1062	   conferencing applications is a mixing service. This service is
1063	   responsible for taking N audio or video streams, mixing them
1064	   according to some matrix, and returning the mixed stream to each
1065	   participant. Issues such as conference policy, provisioning of
1066	   conferences, and authentication are all completely separate and
1067	     |      INVITE (1)         |                          |
1068	     |------------------------>|                          |
1069	     |                         |        INVITE (2)        |
1070	     |                         |------------------------->|
1071	     |                         |       200 OK (3)         |
1072	     |                         |<-------------------------|
1073	     |      183 (4)            |                          |
1074	     |<------------------------|                          |
1075	     |                         |       ACK (5)            |
1076	     |                         |------------------------->|
1077	     |           MEDIA         |                          |
1078	     |----------------------------------------------------|
1079	     |                         |                          |
1080	     |                         |     HTTP GET (6)         |
1081	     |                         |<-------------------------|
1082	     |                         |     HTTP 200 OK (7)      |
1083	     |                         |------------------------->|
1084	     |                         |                          |
1085	     |                         |                          |
1086	     |                         |                          |
1087	     |                         |                          |
1088	     |                         |     HTTP GET (8)         |
1089	     |                         |<-------------------------|
1090	     |                         |                          |
1091	     |                         |     HTTP 200 OK (9)      |
1092	     |                         |------------------------->|
1093	     |                         |                          |
1094	     |                         |      BYE (10)            |
1095	     |                         |------------------------->|
1096	     |                         |       200 OK (11)        |
1097	     |                         |<-------------------------|
1098	     |                         |                          |
1099	     |                         |       INVITE (12)        |
1100	     |                         |--------------------------------------->
1101	     |                         |                          |
1102	     |                         |                          |
1103	     |                         |                          |
1104	     |                         |                          |
1105	     |                         |                          |
1106	     |                         |                          |
1107	     |                         |                          |

1109	  Caller                    Controller               IVR Server

1111	   Figure 5: Interaction of App Server and IVR Component
1112	   outside of this basic mixing component.

1114	   For this reason, we argue that a large variety of conferencing
1115	   applications can be easily constructed by having the mixing service
1116	   as separate application server component.

1118	   What does the interface to such a mixing server look like? For the
1119	   call control interface, users would join a conference by calling the
1120	   server. The server would answer the call, thus appearing as a SIP
1121	   UAS. The media sent from the user is mixed with other users in the
1122	   conference, and the media sent back to the user is the mixed stream.
1123	   The user can leave the conference by sending a BYE to the server, and
1124	   the server can kick a user out of the conference by sending the user
1125	   a BYE.

1127	   Since the primary resource being accessed is a conference, it is no
1128	   surprise that we would argue that the request URI of an incoming call
1129	   defines the conference a user is mixed in to. In other words, all
1130	   users that call the server with the same request URI, are all mixed
1131	   together. The conferences are not defined by Call-ID or other SIP
1132	   header fields. Using the request URI has tremendous advtanges from a
1133	   routing and naming perspective, as we have discussed more generally
1134	   above.

1136	   It is not neccesary (in fact, not even advisable), for the
1137	   conferencing server to require that the URIs that define the
1138	   conference be set up ahead of time. Conference lifecycles in the
1139	   mixing server are very simple. Conference state is created when the
1140	   first call arrives for a particular URI, and ends when the last user
1141	   with a call to that URI hangs up. This model allows the same mixing
1142	   server to support both ad-hoc conferences, and pre-arranged
1143	   conferences too. Pre-arranged conferences are handled through policy
1144	   and control in a coordinating server external to the mixing server.
1145	   This server lives entirely in the call control and signaling plane,
1146	   not in the media plane.

1148	   SIP (and RTP, of course) alone is not sufficient for complete usage
1149	   of a conferencing server. Media mixing policies (effectively, the
1150	   matrix indicating which users hear which other users, and with what
1151	   relative volumes) need to be set. Information on the status of the
1152	   conference, such as the identity of the current speaker, number of
1153	   users currently being mixed, etc., may need to be reported back to
1154	   some control entity. These represent the requirements for the side
1155	   channel. In IVR servers, the side channel used HTTP. We argue that to
1156	   unify these concepts, HTTP is ideally suited here as well. Updates to
1157	   the mixing policy can be made through HTTP POST requests against the
1158	   mixing server, using well defined interfaces (possibly SOAP).
1159	   Similarly, information about the status of the conference can be
1160	   obtained through HTTP GET operations against the mixing server. The
1161	   side channel here meets the requirements outlined in Section 5.5; it
1162	   is not real time in nature, does not reuqire transactional support,
1163	   and passes relatively infrequent data and control. In fact, such a
1164	   side channel will often not be needed at all. In 90 default mixing
1165	   policy (the so-called N-1 matrix, where each user hears everyone but
1166	   themselves, all at equal volume, with no floor control) will suffice.

1168	   Fans of the INFO method [13] will argue that instead of using HTTP
1169	   for the control, why not INFO? This would eliminate the need for an
1170	   additional protocol, after all. The answer is the same as to why SIP
1171	   should not simply replace HTTP - the two have different strengths and
1172	   weakenesses. SIP is a poor data transfer protocol. It has insufficent
1173	   support for transfer of medium to large data sets, which is important
1174	   here. Furthermore, we may want to allow an entity separate from the
1175	   one that initiated the session to control the session. Usage of INFO
1176	   would only work from the same device (because of the sequence
1177	   numbering).

1179	   In the next few sections, we show how this basic application server
1180	   component can be used, along with a controller and other components,
1181	   to build more complex conferencing applications.

1183	6.2.1 Web Scheduled Conference Services

1185	   In this application, we'd like a conferencing service where all
1186	   conferences must be pre-scheduled. The pre-scheduling is done through
1187	   a web page. At the page, the user will enter the start time (but not
1188	   mandatory stop time) of the conference, the maximum number of
1189	   attendees, and the identities of the attendees (if known). Once
1190	   entered in a form, the server returns a SIP URL representing the
1191	   conference.

1193	   To implement this, we use an coordinating application server that has
1194	   a SIP and HTTP interface, along with the mixing application server
1195	   just described.

1197	   Figure 6 shows a call flow for this service. A web client is first
1198	   used to submit the information. Let us suppose a simple case where
1199	   the conference can have up to two participants, and the conference
1200	   starts immediately. The HTTP POST representing the form data is sent
1201	   to the controller (1). It stores the information for the conference
1202	   in a local data store, and chooses a SIP URL for the conference. This
1203	   URL can be anything, so long as it is different from any URLs handed
1204	   out so far by the controller. The URL is returned to the web client
1205	   in step (2). As an additional convenience feature, the URL could be
1206	   emailed to the participants. This would require the controller to
1207	   have an SMTP interface, in addition to HTTP and SIP. Note that this
1208	   SIP URL points to the controller, NOT the mixing server.

1210	   A few moments later, the first participant calls in using a SIP
1211	   INVITE (3). The call is routed to the controller. It checks the
1212	   conference ID. It finds that the policy permits up to two
1213	   participants (not a practical example, but simplifies the call flow).
1214	   It stores data indicating that one participant has now joined, and
1215	   the proxies the INVITE request in step (4) to the mixer. The request
1216	   URI in this request will have the same user part as (3), but the host
1217	   part now represents the mixer. The mixer receives the INVITE, creates
1218	   the initial conference state (as this is the first call for that
1219	   URL), and returns a 200 OK (5), which is forward to the caller (6),
1220	   and then ACKed (7 and 8).

1222	   In step (9), the second caller calls in. The controller sees that
1223	   only one participant is on the call so far, so the second call is
1224	   accepted. The controller stores the fact that there are now 2
1225	   participants, and proxies the INVITE (10). The INVITE is accepted by
1226	   the mixer (11), and the response forwarded to the second caller (12),
1227	   and then ACKed (13 and 14). The two participants A and B can now hear
1228	   each other.

1230	   A third caller then calls in (15). The controller checks its records,
1231	   and notices that this conference is now full. So, it rejects the
1232	   INVITE (16), which is acknowleged (17).

1234	   The astute reader will observe that, strictly speaking, the HTTP
1235	   server does not really need to be co-resident with the SIP server in
1236	   the controller. The initial conference setup can be stored in a
1237	   database by a web server, and the controller can simply read this
1238	   database. However, in more complex cases, we may wish to have web
1239	   access to learn dynamic information about the conference as it
1240	   progresses (for example, which users are in the conference). For this
1241	   kind of dynamic session state, using a shared database between
1242	   components is cumbersome. Rather, an integrated HTTP/SIP server is
1243	   much better suited, where integrated implies only that it has built
1244	   in mechanisms for session state sharing between the SIP and HTTP
1245	   components.

1247	   For this simple conferencing service, it was sufficient for the
1248	   controller to act as a proxy. Thats because it does not need to
1249	   forcibly kick anyone out of the conference once they are in. To
1250	   support that kind of functionality, third party call control is
1251	   needed. Let us examine a more complex service in the next section.

1253	6.2.2 Web Scheduled, IVR supported, Time Limited Conference
1254	   |   |   |   | (1) HTTP POST  |                      |
1255	   |--------------------------->|                      |
1256	   |   |   |   | (2) 200 OK     |                      |
1257	   |<---------------------------|                      |
1258	   |   |   |   |                |                      |
1259	   |   |   |   | (3) INVITE     |                      |
1260	   |   |----------------------->|  (4) INVITE          |
1261	   |   |   |   |                |--------------------->|
1262	   |   |   |   |                |   (5) 200 OK         |
1263	   |   |   |   |  (6) 200 OK    |<---------------------|
1264	   |   |<-----------------------|                      |
1265	   |   |   |   | (7) ACK        |                      |
1266	   |   |----------------------->|   (8) ACK            |
1267	   |   |   |   |                |--------------------->|
1268	   |   |   |   |                |                      |
1269	   |   |   |   | (9) INVITE     |                      |
1270	   |   |   |------------------->|   (10) INVITE        |
1271	   |   |   |   |                |--------------------->|
1272	   |   |   |   |                |   (11) 200 OK        |
1273	   |   |   |   | (12) 200 OK    |<---------------------|
1274	   |   |   |<-------------------|                      |
1275	   |   |   |   |  (13) ACK      |                      |
1276	   |   |   |------------------->| (14) ACK             |
1277	   |   |   |   |                |--------------------->|
1278	   |   |   |   |                |                      |
1279	   |   |   |   | (15) INVITE    |                      |
1280	   |   |   |   |--------------->|                      |
1281	   |   |   |   |(16) 500 Full   |                      |
1282	   |   |   |   |<---------------|                      |
1283	   |   |   |   |(17) ACK        |                      |
1284	   |   |   |   |--------------->|                      |
1285	   |   |   |   |                |                      |
1286	   |   |   |   |                |                      |
1287	   |   |   |   |                |                      |

1289	  Web  A   B   C              Controller            Mixer

1291	   Figure 6: Web Scheduled Conference Services
1292	   In this more complex example, we once again wish to use a web
1293	   interface to set up the conferences. However, we wish to add a stop
1294	   time. If there are participants in the conference when the stop time
1295	   arrives, a warning announcement is played 10 minutes prior, and then
1296	   they are kicked off. In addition, when a user joins the conference,
1297	   before they are added, they hear an announcement that states the name
1298	   of the person that set up the conference, and what the start and stop
1299	   times are. They are then asked to speak their name. Then, they are
1300	   dropped in. The conference server then speaks their name, so that
1301	   everyone knows who just joined.

1303	   This seemingly complex service is very easily constructed by adding
1304	   an IVR server as described above. Now, we have a controller, a mixing
1305	   server, and an IVR server, all working together to build the service.
1306	   Each provides a specific component towards the overall solution, yet
1307	   each is an application server in its own right, with both signaling
1308	   and media interfaces.

1310	   We assume that the web setup is done as above. This time, the stop
1311	   time is provided, along with the name of the person setting up the
1312	   conference.

1314	   The call flow for the initial participant is shown in Figure 7.

1316	   The initial participant sends an INVITE, which is forwarded to the
1317	   controller. The controller matches the request URI against the
1318	   conference that the user wishes to join. The controller recognizes
1319	   that it needs to play an announcement. So, in step (2), it initiates
1320	   a call to an IVR server. This call is accepted in step (3), and the
1321	   resulting SDP is passed back to the UAC in step (4) in a provisional
1322	   response. After ACKing the call with the IVR in step (5), the
1323	   controller receives an HTTP GET to fetch the root VoiceXML script in
1324	   step (6). The controller dynamically generates the VoiceXML script,
1325	   whose content will cause the server to read out "Welcome to the
1326	   conference, Bob. The call will start at 10 am, and end at 11am.". The
1327	   name of the caller, Bob, is extracted from the INVITE (1).

1329	   Once the prompt has been played, the IVR server prompts the caller
1330	   for their name, and the result is recorded into a file. Then, the
1331	   VoiceXML server attempts to fetch the next VoiceXML script from the
1332	   controller (8). Before responding, the controller reconnects the
1333	   media stream from the media server into the conference bridge. To do
1334	   this, it first sends an INVITE to the conferencing server, using SDP
1335	   indicating send only (9). The server accepts (10), and the controller
1336	   ACKs (11). The SDP from the acceptance (10) is passed in a re-INVITE
1337	   (12) to the IVR server. The IVR server then accepts (13) and the
1338	   controller ACKs (14). Now, a unidirectional media stream from the IVR
1339	   server into the conference bridge is set up. The controller returns
1340	   the next VoiceXML script (15), which tells the IVR server to play the
1341	   previously recorded file into the conference, announcing the joining
1342	   user. Once this is done, the IVR server fetches the next script (16),
1343	   and gets back an empty response (17). The controller then disconnects
1344	   from the IVR server (18,19). Finally, the controller re-INVITEs the
1345	   conference server (20), updating the SDP to be that from the initial
1346	   INVITE (1).  The SDP from the acceptance (21) is passed on to the
1347	   caller (22). Now, the caller is connected to the mixer as the first
1348	   user in the conference.

1350	   The second user would join in much the same way.

1352	   Approximately 10 minutes before the end of the conference, a timer
1353	   fires inside of the controller. It is time to play a warning
1354	   announcement into the conference. The call flow for this is shown in
1355	   Figure 8.

1357	   The basic idea is to initiate a call to the IVR server and mixer,
1358	   connect them using third party call control, and then have the IVR
1359	   server play the announcement into the conference. The controller then
1360	   hangs up.

1362	   In step (1), the controller sends an INVITE to the mixer with a
1363	   single audio stream on hold (i.e., "empty"). The request URI of the
1364	   request is that of the conference. The mixer returns a 200 OK in step
1365	   (2), and an ACK is sent in (3). The SDP from (2) is then used in step
1366	   (4) to call the IVR server, which answers with its SDP in step (5).
1367	   This is used in a re-invite (7,8,9) to the mixer to update the IP
1368	   address and port as that of the IVR server. The IVR server then
1369	   fetches the root VoiceXML document from the controller (11). This
1370	   document instructs the server to read out some kind of conference
1371	   warning - "Warning, your conference will end in 10 minutes". Once
1372	   this is done, the IVR server fetches the next document (13), which is
1373	   empty. The controller then hangs up with both the mixer (17) and the
1374	   IVR server (19), disconnecting the IVR server from the conference.

1376	   These examples demonstrate the component model we are proposing. The
1377	   mixing component does not have application level intelligence. It has
1378	   a call control interface, allowing it to exist anywhere (and be
1379	   provided by any ASP service) and yet be a callable resource by other
1380	   application server components. By combining a controller with an IVR
1381	   server and the mixing server, complex and useful applications can be
1382	   constructed in a distributed fashion.

1384	6.3 Continuous Text-to-Speech
1385	  Caller          Controller         IVR Server          Mixing Server
1386	    |               |                  |                   |
1387	    | (1) INVITE    |                  |                   |
1388	    |-------------->| (2) INVITE       |                   |
1389	    |               |----------------->|                   |
1390	    |               | (3) 200 OK       |                   |
1391	    | (4) 183       |<-----------------|                   |
1392	    |<--------------|                  |                   |
1393	    |               | (5) ACK          |                   |
1394	    |               |----------------->|                   |
1395	    |               | (6) HTTP GET     |                   |
1396	    |               |<.................|                   |
1397	    |               | (7) 200 OK       |                   |
1398	    |               |.................>|                   |
1399	    |               |                  |                   |
1400	    |               | (8) HTTP GET     |                   |
1401	    |               |<.................|                   |
1402	    |               | (9) INVITE       |                   |
1403	    |               |------------------------------------->|
1404	    |               | (10) 200 OK      |                   |
1405	    |               |<-------------------------------------|
1406	    |               | (11) ACK         |                   |
1407	    |               |------------------------------------->|
1408	    |               | (12) INVITE      |                   |
1409	    |               |----------------->|                   |
1410	    |               | (13) 200 OK      |                   |
1411	    |               |<-----------------|                   |
1412	    |               | (14) ACK         |                   |
1413	    |               |----------------->|                   |
1414	    |               |                  |                   |
1415	    |               | (15) 200 OK      |                   |
1416	    |               |.................>|                   |
1417	    |               | (16) HTTP GET    |                   |
1418	    |               |<.................|                   |
1419	    |               | (17) 200 OK      |                   |
1420	    |               |.................>|                   |
1421	    |               | (18) BYE         |                   |
1422	    |               |----------------->|                   |
1423	    |               | (19) 200 OK      |                   |
1424	    |               |<-----------------|                   |
1425	    |               | (20) INVITE      |                   |
1426	    |               |------------------------------------->|
1427	    |               | (21) 200 OK      |                   |
1428	    | (22) 200 OK   |<-------------------------------------|
1429	    |<--------------|                  |                   |
1430	    | (23) ACK      |                  |                   |
1431	    |-------------->| (24) ACK         |                   |
1432	    |               |------------------------------------->|
1433	    |               |                  |                   |
1434	    |               |                  |                   |
1435	    |               |                  |                   |

1437	  Caller          Controller         IVR Server          Mixing Server

1439	      | (1) INVITE empty SDP  |                          |
1440	      |---------------------->|                          |
1441	      | (2) 200 OK SDP A      |                          |
1442	      |<----------------------|                          |
1443	      | (3) ACK               |                          |
1444	      |---------------------->|                          |
1445	      |                       |   (4) INV SDP A          |
1446	      |------------------------------------------------->|
1447	      | (5) 200 OK SDP B      |                          |
1448	      |<-------------------------------------------------|
1449	      |                       |   (6) ACK                |
1450	      |------------------------------------------------->|
1451	      | (7) INV SDP B         |                          |
1452	      |---------------------->|                          |
1453	      | (8) 200 OK SDP A      |                          |
1454	      |<----------------------|                          |
1455	      | (9) ACK               |                          |
1456	      |---------------------->|                          |
1457	      |                       |  (11) HTTP GET           |
1458	      |<-------------------------------------------------|
1459	      |                       |  (12) 200 OK             |
1460	      |------------------------------------------------->|
1461	      |                       |                          |
1462	      |                       |                          |
1463	      |                       |  (13) HTTP GET           |
1464	      |<-------------------------------------------------|
1465	      |                       |  (14) 200 OK             |
1466	      |------------------------------------------------->|
1467	      |                       |                          |
1468	      | (15) BYE              |                          |
1469	      |------------------------------------------------->|
1470	      |                       |  (16) 200 OK             |
1471	      |<-------------------------------------------------|
1472	      | (17) BYE              |                          |
1473	      |---------------------->|                          |
1474	      | (18) 200 OK           |                          |
1475	      |<----------------------|                          |
1476	      |                       |                          |
1477	      |                       |                          |

1479	   Controller               Mixer                      IVR Server

1481	   Figure  8:  Advanced  Web  Scheduled  Conference   Service:   Warning
1482	   Announcement

1484	   Another example of an application server component is a continuous
1485	   Text-to-Speech (TTS) converter. This kind of service allows a real
1486	   time text stream (encapsulated in RTP using the RTP payload format
1487	   for text [14] to be received, which is then converted to speech and
1488	   returned as an audio stream encoded using a traditional speech codec,
1489	   be it G.723.1, G.711, or what have you.

1491	   Like the IVR server and mixing server, the TTS server acts as a user
1492	   agent server. It answers incoming calls, and basically mirrors
1493	   incoming text back as speech. It continutes to do so until the call
1494	   is hung up by the initiating client.

1496	   A TTS service can be done using VoiceXML with an IVR server, as in
1497	   the examples above. However, the difference is that here, the text
1498	   stream to be converted is in the data path, not the control path. The
1499	   stream is likely to be generated by other entities in the system, not
1500	   the controller.

1502	6.3.1 Service Interface

1504	   It is likely that the text-to-speech conversation process differs
1505	   significantly depending on the language. As such, separate URIs
1506	   SHOULD be used for language specific TTS services. Specifically, the
1507	   convention sip:<server-specific-name>-<language-tag>@<domain> is
1508	   RECOMMENDED. The language tags SHOULD be selected from the set
1509	   defined in RFC1766 [15].

1511	   One of the unfortunate limitations of SDP is that it is not currently
1512	   possible for a single media stream to be composed of separate media
1513	   formats in each direction. The text over RTP stream is, in fact,
1514	   based on the top level text MIME type (text/t140). As a result, two
1515	   media streams are needed for this service - a unidirectional audio
1516	   stream and a unidirectional text stream.

1518	   First, the client INVITEs the server. The SDP MUST indicate a two
1519	   media streams. One stream MUST be of type audio. It SHOULD contain
1520	   the set of audio codecs acceptable to the client. The stream MUST be
1521	   marked as recv-only. The other stream MUST be of type text. It MUST
1522	   contain a single codec, which is a dynamic payload number bound to
1523	   text/t140. The stream MUST be marked as send-only. The 200 OK
1524	   response from the TTS server that accepts the call has SDP with a two
1525	   media lines, one of type audio, and one of type text, in the same
1526	   order the streams appeared in the INVITE, as mandated by RFC2543. The
1527	   audio stream SHOULD contain a subset of the codecs listed in the
1528	   audio stream in the INVITE. The audio stream MUST be marked as send-
1529	   only. The text stream MUST contain a single codec, which is a dynamic
1530	   payload type number bound to text/t140. The stream MUST be marked as
1531	   receive-only.

1533	   The client then ACKs the request. The TTS server SHOULD attempt to
1534	   convert all text received on the incoming text stream to speech, and
1535	   return the resulting speech on the outgoing audio stream.

1537	6.3.2 Hearing Impaired Service

1539	   The TTS server is extremely useful in supporting hearing impaired
1540	   services. Examples of such services are described in describes a
1541	   service where a controller accesses a TTS service.

1543	6.4 Messaging Servers

1545	   Another type of application server component is a messaging server.
1546	   Messaging servers allow for callers to record audio messages for
1547	   users on the system. Users can also call into the server to retrieve
1548	   these messages, delete them, and file them. The system operates
1549	   through the use of voice prompts combined with DTMF detection and/or
1550	   speech recognition. The prompts that are played are context
1551	   dependent. A messaging server can be viewed as a specialized version
1552	   of an IVR server with an application specific controller associated
1553	   with it. In fact, a messaging server can be implemented in this way
1554	   exactly. However, the combination is also usefully viewed as a
1555	   component in its own right, due to the frequent need for messaging
1556	   components in more complex applications.

1558	6.4.1 Service Interface

1560	   The service interface for communicating with a messaging server is
1561	   described in detail in [7]. The interface provides well known URIs
1562	   for the most common resources within a messaging server - user
1563	   specific message drops with a variety of drop conditions (called
1564	   party busy, called party not there, etc.), message retrievals using a
1565	   variety of authentication mechanisms (PIN, SIP level authentication),
1566	   and message drops that are not user specific, so that the target user
1567	   is queried for as part of the interface.

1569	6.4.2 Web Enabled Message Drops

1571	   An example usage of this application component is a web front end
1572	   that allows users to leave voicemail for company employees through
1573	   the company web page. The page has a URL for each company employee.
1574	   If some user A clicks on a URL for employee B, A's phone rings. When
1575	   A picks up, they hear a greeting to record a message for employee B.

1577	   The call flow for this application is the combination of third party
1578	   call control combined with access to the service. It is shown in
1579	   Figure 9.

1581	   The caller, from a web page, clicks on the URL for the user they wish
1582	   to leave a message for. The result is an HTTP request (1) to the
1583	   controller. The URI in this request would be some controller-specific
1584	   identifier that tells the controller what it needs to do. The
1585	   controller then calls the user (3) using an SDP with a single media
1586	   stream on hold initially. This is accepted (4), and the resulting SDP
1587	   is used in an INVITE to the messaging server (6). The URI of this
1588	   INVITE is that for message drop with standard greeting (sip:sub-
1589	   jdrosen-deposit@voiceserver.com). The call is accepted (7) and the
1590	   200 OK is used in a re-INVITE to the caller (9) to set the address of
1591	   the media stream to that of the voicemail server. After the call is
1592	   accepted (10) and ACKed (11), the caller hears the voice drop prompt
1593	   for the messaging server, and can record their message.

1595	7 Security Considerations

1597	   In many cases, authorization may need to be made to allow a caller
1598	   access to a session level resource. Traditional SIP level
1599	   authentication mechanisms can be used to accomplish this. Note,
1600	   however, that in many cases the caller is the controller, which is
1601	   acting as a third party call controller. In these cases, a two level
1602	   trust model is really needed. The trust relationship in such
1603	   situations is really between the session level resource and the
1604	   controller (perhaps through an explicit business arrangement), and
1605	   then between the controller and the caller. Thus, controllers should
1606	   authenticate themselves to session resources they contact, rather
1607	   than trying to proxy credentials from the caller.

1609	8 Conclusion

1611	   In this paper, we have argued that rapid deployment of complex
1612	   communications applications will require a distributed model where
1613	   application components are spread across the network. These
1614	   components could be offered by separate providers, for example,
1615	   enabling an ASP component model to evolve. We have observed that many
1616	   of the components can be described as having some kind of session
1617	   level resource that can be communicated with, usually in an automated
1618	   fashion. Access to these resources is typically parameterized. As a
1619	   result, SIP access, using the request URI as a service indicator, is
1620	   an ideal way to communicate across these components.

1622	   To validate this model, we examined the specific service interfaces
1623	   that would be defined by IVR servers, conferencing servers, text-to-
1624	     |      |              |                      |
1625	     |      | (1) HTTP GET |                      |
1626	     |-------------------->|                      |
1627	     |      | (2) 200 OK   |                      |
1628	     |<--------------------|                      |
1629	     |      | (3) INV      |                      |
1630	     |      |<-------------|                      |
1631	     |      | (4) 200 OK   |                      |
1632	     |      |------------->|                      |
1633	     |      | (5) ACK      |                      |
1634	     |      |<-------------|                      |
1635	     |      |              |  (6) INV             |
1636	     |      |              |--------------------->|
1637	     |      |              |  (7) 200 OK          |
1638	     |      |              |<---------------------|
1639	     |      |              |  (8) ACK             |
1640	     |      |              |--------------------->|
1641	     |      | (9) INV      |                      |
1642	     |      |<-------------|                      |
1643	     |      | (10) 200 OK  |                      |
1644	     |      |------------->|                      |
1645	     |      | (11) ACK     |                      |
1646	     |      |<-------------|                      |
1647	     |      |              |                      |
1648	     |      |              |                      |
1649	     |      |              |                      |

1651	    Web    SIP           Controller             Messaging
1652	      Caller                                     Server

1654	   Figure 9: Web Enabled Message Drops
1655	   speech servers and messaging servers. We gave call flows of complex
1656	   applications built up from these components using the specified
1657	   interfaces.

1659	9 Author's Addresses

1661	   Jonathan Rosenberg
1662	   dynamicsoft
1663	   72 Eagle Rock Avenue
1664	   First Floor
1665	   East Hanover, NJ 07936
1666	   email: jdrosen@dynamicsoft.com

1668	   Peter Mataga
1669	   dynamicsoft
1670	   72 Eagle Rock Avenue
1671	   First Floor
1672	   East Hanover, NJ 07936
1673	   email: jdrosen@dynamicsoft.com

1675	   Henning Schulzrinne
1676	   Columbia University
1677	   M/S 0401
1678	   1214 Amsterdam Ave.
1679	   New York, NY 10027-7003
1680	   email: schulzrinne@cs.columbia.edu

1682	10 Bibliography

1684	   [1] N. Greene, M. Ramalho, and B. Rosen, "Media gateway control
1685	   protocol architecture and requirements," Request for Comments 2805,
1686	   Internet Engineering Task Force, Apr. 2000.

1688	   [2] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett,
1689	   "Media gateway control protocol (MGCP) version 1.0," Request for
1690	   Comments 2705, Internet Engineering Task Force, Oct. 1999.

1692	   [3] F. Cuervo, N. Greene, C. Huitema, A. Rayhan, B. Rosen, and J.
1693	   Segers, "Megaco protocol 0.8," Request for Comments 2885, Internet
1694	   Engineering Task Force, Aug. 2000.

1696	   [4] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP:

1698	   session initiation protocol," Request for Comments 2543, Internet
1699	   Engineering Task Force, Mar. 1999.

1701	   [5] J. Rosenberg, H. Schulzrinne, and J. Peterson, "Third party call
1702	   control in SIP," Internet Draft, Internet Engineering Task Force,
1703	   Mar. 2000.  Work in progress.

1705	   [6] M. Handley and V. Jacobson, "SDP: session description protocol,"
1706	   Request for Comments 2327, Internet Engineering Task Force, Apr.
1707	   1998.

1709	   [7] B. Campbell and R. Sparks, "Control of service context using SIP
1710	   Request-URI," Internet Draft, Internet Engineering Task Force, Oct.
1711	   2000.  Work in progress.

1713	   [8] H. Schulzrinne and S. Petrack, "RTP payload for DTMF digits,
1714	   telephony tones and telephony signals," Request for Comments 2833,
1715	   Internet Engineering Task Force, May 2000.

1717	   [9] V. Bharatia, E. Cave, and B. Culpepper, "SIP INFO method for
1718	   event reporting," Internet Draft, Internet Engineering Task Force,
1719	   Apr. 2000.  Work in progress.

1721	   [10] T. Choudhuri, C. Haun, P. Sollee, S. Orton, and S. Whynot, "SIP
1722	   INFO method for DTMF digit transport and collection," Internet Draft,
1723	   Internet Engineering Task Force, Apr. 2000.  Work in progress.

1725	   [11] VoiceXML Forum, "Voice extensible markup language (voicexml)
1726	   version 1.00," voicexml forum specification, VoiceXML Forum, Mar.
1727	   2000.

1729	   [12] M. Handley, H. Schulzrinne, E. Schooler, and J. Rosenberg, "SIP:
1730	   Session initiation protocol," Internet Draft, Internet Engineering
1731	   Task Force, Aug. 2000.  Work in progress.

1733	   [13] S. Donovan, "The SIP INFO method," Request for Comments 2976,
1734	   Internet Engineering Task Force, Oct. 2000.

1736	   [14] G. Hellstrom, "RTP payload for text conversation," Request for
1737	   Comments 2793, Internet Engineering Task Force, May 2000.

1739	   [15] H. Alvestrand, "Tags for the identification of languages,"
1740	   Request for Comments 1766, Internet Engineering Task Force, Mar.
1741	   1995.

1743	                           Table of Contents

1745	   1          Introduction ........................................    2
1746	   2          Why Decompose .......................................    2
1747	   3          Tightly Coupled Decomposition .......................    4
1748	   4          The Decoupled Model .................................    6
1749	   4.1        Architecture ........................................    7
1750	   4.2        Benefits of the Decoupling ..........................   10
1751	   5          Architecture for the Interfaces .....................   11
1752	   5.1        Naming ..............................................   12
1753	   5.2        Additional Message Content ..........................   14
1754	   5.3        Session Duration ....................................   14
1755	   5.4        Third Party Call Control ............................   15
1756	   5.5        Side Channels .......................................   18
1757	   6          Patterns for Accessing Components ...................   19
1758	   6.1        Interactive Voice Response Services .................   19
1759	   6.2        Conferencing Servers ................................   23
1760	   6.2.1      Web Scheduled Conference Services ...................   26
1761	   6.2.2      Web Scheduled, IVR supported, Time Limited
1762	   Conference .....................................................   27
1763	   6.3        Continuous Text-to-Speech ...........................   30
1764	   6.3.1      Service Interface ...................................   33
1765	   6.3.2      Hearing Impaired Service ............................   34
1766	   6.4        Messaging Servers ...................................   34
1767	   6.4.1      Service Interface ...................................   34
1768	   6.4.2      Web Enabled Message Drops ...........................   34
1769	   7          Security Considerations .............................   35
1770	   8          Conclusion ..........................................   35
1771	   9          Author's Addresses ..................................   37
1772	   10         Bibliography ........................................   37