Network Working Group E. Burger Internet-Draft Cantata Technology, Inc. Expires: December 7, 2006 June 5, 2006 Media Server Control Language and Protocol Thoughts draft-burger-mscl-thoughts-01 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 7, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract IP mutli-function Media Server control is a problem that has slowly bubbled up in importance over the past four years. A driver in the IETF is the requirements generated by the XCON framework. Many approaches have been proposed. Some of these proposals are device- controlled-oriented, such as H.248. Others are server-oriented, using SIP and application-oriented markup. Before rushing headlong into a framework for a solution, it is time to step back and try to understand just what the scope of the problem is. Once consensus is reached, we can then move forward with a framework for a solution. Burger Expires December 7, 2006 [Page 1] Internet-Draft MSCL Thoughts June 2006 This document describes a number of existing approaches and proposals to solve the Application Server - Media Server protocol problem, their characteristics and benefits and drawbacks. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Media Resource Model . . . . . . . . . . . . . . . . . . . 4 2.2. Number of Protocol Messages for a Given Operation . . . . 5 2.3. Network Topology . . . . . . . . . . . . . . . . . . . . . 5 2.4. Protocol Layer Integrity . . . . . . . . . . . . . . . . . 6 2.5. Computer Science Issues . . . . . . . . . . . . . . . . . 7 2.6. Deployment Scale . . . . . . . . . . . . . . . . . . . . . 9 2.7. Compatibility with SIP Model . . . . . . . . . . . . . . . 10 2.8. Security Issues . . . . . . . . . . . . . . . . . . . . . 10 3. Transport Protocols . . . . . . . . . . . . . . . . . . . . . 11 3.1. Pure Device Control . . . . . . . . . . . . . . . . . . . 11 3.2. Pure SIP . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3. SIP With TCP Side Channel . . . . . . . . . . . . . . . . 12 3.4. SIP With INFO . . . . . . . . . . . . . . . . . . . . . . 13 3.5. SIP With SUBSCRIBE/NOTIFY . . . . . . . . . . . . . . . . 14 3.6. SIP With MEDIA . . . . . . . . . . . . . . . . . . . . . . 14 4. Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1. H.248 . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2. MSCML . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3. MOML/MSML . . . . . . . . . . . . . . . . . . . . . . . . 18 5. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 20 6. Security Considerations . . . . . . . . . . . . . . . . . . . 21 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 8. Informative References . . . . . . . . . . . . . . . . . . . . 22 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 24 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 24 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 25 Intellectual Property and Copyright Statements . . . . . . . . . . 26 Burger Expires December 7, 2006 [Page 2] Internet-Draft MSCL Thoughts June 2006 1. Introduction An IP multi-function Media Server is a network server that provides media processing services to the network. There are two models for media resource servers. One models the media resource server as a box of low-level resources, such as RTP mixers, transcoders, audio play and record resources, video play and record resources, tone detection and generation resources, and resources to connect, or "plumb" the resources together. The other model is that of a server that offers announcement services, interactive voice response (IVR) services (including speech recognition and speech synthesis modalities), interactive video response (IVVR) services, basic mixing services, and enhanced mixing services. In general, when we say "multi-function Media Server", we are referring to the server model. As the IP Media Server evolved from a box of low-level resources into a first-class server in the Internet, the protocol interfaces to control the IP Media Server evolved, as well. When people thought of the media server as a box of low-level resources, device control protocols like H.248 [1] seemed appropriate. At the time, the primary model for control of a media server was from a "SoftSwitch", or Media Gateway Controller. The principal application was for playing announcements and collecting a small number of digits. The Media Gateway Controller already implemented a device control state machine to control the Media Gateways. Moreover, the Media Gateway Controller implemented some form of Gateway Control protocol to control the Media Gateways. Thus it was logical to assume that a device control protocol, more specifically H.248 from the IETF perspective, would be appropriate for media resource control. Although the "SoftSwitch" (traditional telephony) model (and market) was an early driver for the need for media resources, within two years it was clear that the primary consumer of media resources would be Internet-oriented applications. Developers create and deploy these applications on Internet Application Servers, using Internet and Web tools and protocols. These Application Servers have no need to control Media Gateways, and thus do not generally have implementations of device control protocols such as H.248. Moreover, Application Servers were much more likely to have HTTP [2] and SIP [3] and use stimulus-markup, client-server application architectures. RFC3087 [4] introduced the concept of addressing services as if they were users in SIP. This meant that it was possible to address specific resources from an application simply by sending the session Burger Expires December 7, 2006 [Page 3] Internet-Draft MSCL Thoughts June 2006 to a "user" at a media server. However, RFC3087 did not provide any mechanism to achieve Internet-wide interoperability. What was needed was some sort of naming convention to address the various services available at the media server. The netann [5] specification provides such a naming convention. Recalling the functions of a multi-function IP Media Server, the netann specification is directly sufficient for announcements and simple conferencing. For Interactive Voice Response (IVR), VoiceXML [6] provides a standard method for defining voice (and now video) dialogs. However, there is a need to inform the IP multifunction media server that the request is for the VoiceXML service and the URI of the initial document. The netann specification provides this definition. What is missing is a method for enhanced conference control. By enhanced conference control, we mean facilities for creating sub- mixes, recording the mix or a leg, playing media into a mix or leg, altering the gain on a leg or the mix as a whole, defining which media is eligible for the mix, and so on. To date, there have been several proposals, experimental protocols, and de facto standards to address the enhanced conference control problem. Factors influencing these protocols include the application's media resource model (raw resources versus service server), the desire to leverage existing protocol infrastructure (such as using SIP Registrars for resource discovery, SIP Proxies for resource location, scale, and availability), and the expectations of Internet-scale deployment sizing. The following sections examine these factors and then look at the various proposals to address them. As a side note, two XML-based, SIP-transported media server control markup languages command approximately 100% of the market: MSCML [16] and MOML [17]. 2. Factors 2.1. Media Resource Model As the Introduction indicated, many new applications use the Internet model for media resources. That is, applications request media services from an Internet-oriented, IP multi-function Media Server. However, some legacy applications, as well as application developers more comfortable with a telco-oriented approach, would like to model the media processing function as a set of low-level resources. Burger Expires December 7, 2006 [Page 4] Internet-Draft MSCL Thoughts June 2006 There is no question that with a low-level model, one has the full flexibility to address any possible requirement. For example, creating a sidebar conference is simply the manipulation of some mixer resources and plumbing the selected RTP streams (possibly through transcoders) to the mixer resources. Likewise, one can accomplish playing a prompt to a leg by disconnecting the leg from the mixer, allocating a media player, plumbing the media player to the RTP port that represents the leg, directing the media player to play the prompt, then deallocate the media player, and finally re- plumbing the RTP stream to the mixer. Conversely, with an Internet server model, applications request media manipulation using protocols appropriate for applications. For example, media streams are addressed using application constructs, such as SIP dialog identifiers. Rather than specifying a sidebar by manipulating RTP streams directly, the application specifies which legs the Media Server is to place into a sidebar. In fact, as we will show below, one can specify complex topologies, such as Agent/ Supervisor/Mark, with fewer messages than using a device control protocol. 2.2. Number of Protocol Messages for a Given Operation The number of protocol messages required for a given set of operations is a factor that can potentially affect the scale of the deployment. Too many messages can result in bandwidth problems at the media server control interface, packet handling problems at either the media server or application server, and stack processing problems at either the media server or application server. Conversely, optimizing on number of messages can result in complex protocols with a very large number of verbs. This is often in conflict with engineering principles such as offering a simple protocol with a small number of verbs. 2.3. Network Topology In determining the control mechanism, we need to examine the control topology. Namely, will there be a one-to-one mapping of Application Servers to Media Servers? Will there be a one-to-many mapping of Applications Servers to Media Servers? Will there be a many-to-one mapping of Applications Servers to Media Servers? Or, can there be a many-to-many mapping of Application Servers to Media Servers. Answers to this question helps determine the question as to whether there should be a single control channel per Media Server, single control channel per Application Server, single control channel per Burger Expires December 7, 2006 [Page 5] Internet-Draft MSCL Thoughts June 2006 session, or single control channel per leg. Since control channels consume operating system resources, fewer control channels use fewer operating system resources. Of course, overall system resource utilization is more complex than simply how many channels there are at a given node. For example, on most operating systems, message routing is done in kernel space with pointer manipulation. However, once in application space, message routing is often done with buffer copying. Another aspect influencing the cardinality of control channels is protocol layer integrity. We will examine this point in the next section. 2.4. Protocol Layer Integrity There are many fundamental principles driving the IETF model of layered protocols. For example, a single TCP socket uses less system resources that ten thousand TCP sockets. Given that, why do we have FTP, TELNET, SMTP, NNTP, MGCP, etc.? It would appear to be much more efficient to establish a single TCP socket between the hosts and multiplex the different protocols over that socket. One of the reasons we do not do this is that while we would save on memory and kernel processing on the TCP socket, we end up spending memory and kernel processing resources on demultiplexing the TCP stream to direct the stream to the appropriate application process in user space. Likewise, one could multiplex a given protocol over a single channel. In this case, the decision comes down to programming model. For example, in the FTP case, it is easier to manage the media and control separately over separate channels. Many implementations of FTP has the server FTP daemon spawning separate FTP server processes to handle requests. In this way the FTP server process can be quite simple and straightforward. Another approach has multiple requests physically multiplexed to a single port, but establish separate logical sessions. One protocol that uses this model is SIP. All requests go to a single port (usually 5060), yet in the protocol data unit (PDU), we have a dialog identifier that identifies which dialog the message belongs to. The control channel per session model maintains protocol layer integrity by allowing the kernel to do appropriate routing of requests to the application. Multiplexing the control channel requires special considerations. Burger Expires December 7, 2006 [Page 6] Internet-Draft MSCL Thoughts June 2006 If there is a limit of a single control channel at the Media Server, then, by definition, there can be only a single Application Server controlling it. This works in a device control model, such as H.248 [1], where a Media Gateway Controller controls an entire Media Gateway. In order to allow multiple clients to control the server, one must "virtualize" the server. That is, the server presents what looks to the client as an entire, self-contained server, while in fact those self-contained servers are actually logical partitions of the physical server. Depending on the server function, such partitioning may be easy or extremely complex. Let us consider the case of a SIP Application Server. A SIP Application Server, or Back-to-Back User Agent (B2BUA), looks to the world like a whole bunch of SIP User Agent Servers. This is not too difficult to manage, as the SIP User Agent Servers all generally look alike. On the other hand, consider a SIP Media Server. The SIP Media Server often has a fixed number of different types of resources, such as announcement players, conference bridges, recorders, and so on. Partitioning these resources can be exceedingly complex. Some applications benefit from a single control channel model. For example, the classic SoftSwitch model and the current IMS model assume that all media processing requests go through a single network element that, in the words of TRON, is a "Master Control Program." While many from the telco world are comfortable with having a large, centralized system, many in the IETF have found time and time again that a single central server rarely meets the requirements for Internet scale. Other methods, such as server farms and alternate return contact addresses, enable theoretically infinite scale. 2.5. Computer Science Issues Two issues to consider when using a device control protocol are how long it takes to create an application and the quality of the work product. Two factors influencing these issues are the program length and cyclomatic complexity. There is an interesting result through 30 years of programmer productivity studies. It turns out that with the exception of the introduction of compilers, visual editors, and visual debuggers, programmer productivity has been relatively constant, at 10 to 50 lines of code delivered per day. Thus, reducing the number of lines of code required for a given function is an important tactic to achieve the goal of improving either the time-to-market or robustness of an application. This is one of the reasons why we code in Java, C++, VB, etc., instead of assembly language. Burger Expires December 7, 2006 [Page 7] Internet-Draft MSCL Thoughts June 2006 Cyclomatic complexity measures the number of branches and function calls in a given application. Again, 15 years of research have shown a strong correlation between cyclomaitc complexity and the difficulty of test and liklihood of bugs in fielded code. This is an intuitive result: more branches means more test cases, or the collary, that more branches means more code that testing will miss. However, the emperical results are more impressive: the higher the cyclomatic complexity, the more errors found in the field. Here is a concrete example of how this plays out in practice. iSCSI [7] defines how one can, over IP, read and write blocks on a disk. One could then ask, "Why do we access data bases using data base- oriented protocols, like TDS [8]?" After all, one can do all the manipulation one needs for a data base application at the disk block level. Moreover, one can virtualize the target disk, so the application does not have to have direct control over physical disk blocks. We would offer the answer is obvious. Data base application developers think and operate at the table access level. They don't care about disk blocks, B-Trees, indices, and so on. One could argue that supplying a client library that hides the data base-centric operations from an application would hide the low-level nature of a disk access protocol from the application. That is, it would present an application-layer interface to the application. We offer here that protocol layer integrity comes to play here, as well. In particular, embedding data base code in the client means that one cannot have any data base innovation at the server. Everything occurs at, and is bound to, the client. Clearly there is a need for a low-level disk access protocol. That is what drove the iSCSI effort. However, application developers need a file access protocol like NFS [10]; data base application developers need a high-level data base access protocol; mail application developers need a mail transfer protocol like SMTP [11]; and so on. A similar situation exists in the media processing milieu. The IETF, with the ITU-T has created a media gateway control protocol, H.248 [1]. Although designed for the media gateway control problem, H.248 has capabilities for controlling arbitrary media functions, albeit at a very low level. H.248, and, THE MODEL IT REPRESENTS, assumes a master/slave, low-level device control programming model. This is analogous to direct disk block manipulation for data access, as represented by iSCSI. Features accessible via H.248 or protocols in the style of H.248 include audio players, audio recorders, RTP termination and origination, mixers, tone detectors and generators, Burger Expires December 7, 2006 [Page 8] Internet-Draft MSCL Thoughts June 2006 and plumbing primitives. High-level media processing protocols have been proposed, modeling a media resource server as just that, a server that offers multimedia processing functions. Services offered by media servers include IVR, conference mixing, announcements, interactive video, and so on. Consider the choice of terms: a H.248 device offers "features" while a media server offers "services". Section 3 examines the different protocol proposals in detail. 2.6. Deployment Scale Just how many sessions do we need at any given Media Server? First, let us consider a Media Server that would handle ALL calls on the globe. Take a population of seven billion people. Let us assume that every person calls one other person, on average, once every week. That means we are looking at 1 billion calls per day. Calculating the maximum number of simultaneous calls, let us assume that in any given populated time zone, up to 1/12th of the population of the world is actively making calls. The assumption here is that the time zones dividing the Pacific and Atlantic Oceans are essentially unpopulated (sorry Greenland and Alaska), while the time zones covering Europe have a relatively high teledensity. We make this assumption as we assume that busy hour will rotate around the Earth for a given application. With these assumptions, there are about 83 million calls per day in a given time zone. Since, for most applications, 15% of calls occur during the busy hour, we are looking at 12.5 million simultaneous calls. Now it is time for a reality check. Just how many simultaneous sessions will any given Application Server or Media Server really need to handle? In the above example, we found an upper limit of 12.5 million simultaneous sessions ASSUMING ALL CALLS IN THE WORLD GO THROUGH THE APPLICATION. That is a pretty hefty assumption. What if we worked it backward? Let us assume that a single Application Server and Media Server provided voice messaging to the entire world. Again, let us start with a population of seven billion people. With a ratio of 200 subscribers per session, we get 35 million sessions. Taking time zones into account, we would be looking at about 2 million simultaneous sessions. What is the point of these calculations? It is that the argument Burger Expires December 7, 2006 [Page 9] Internet-Draft MSCL Thoughts June 2006 that one must have a single control channel to effectively scale services is a bit disingenuous. Namely, if an Application Server will be handling, say, 100 million users, only a small percentage will be using the service at any given time. Moreover, if one architected the Application Server to be a single node, it will have to handle hundreds of thousands of inbound connections anyway. If you can handle a few hundreds of thousands of simultaneous connections, you can probably handle a few two- or three- hundreds of thousands of connections. To put this into perspective, 100,000 inbound connections represents well over 2 entire IP port address spaces. 2.7. Compatibility with SIP Model Various proposals offer to use SIP in some way. The question is, will one use SIP within the acceptable use of SIP, or will one use it "because it is there." For example, does a given protocol proposal leverage the SIP routing infrastructure, or is it intended for a point-to-point deployment? Does the server offer SIP-level services, or is it simply using SIP to transport, or tunnel, device control commands? Does the protocol preserve layer integrity, by using references in the SIP domain, or does it require references to the SDP [9] or IP domain? One measure of compatibility with the SIP model a given proposal offers is to see what its compatibility with SIP Proxies, as defined by RFC3261 [3], is. For example, does the proposal require SDP manipulation? If so, how deep does the manipulation need to be? Clearly, any SDP manipulation makes the protocol incompatible with SIP Proxies - SDP modification requires the use of a back-to-back User Agent (B2BUA). Is the B2BUA simply inserting an m-line in the SDP to plumb a control channel? Is the B2BUA parsing the SDP to determine RTP addresses and media types? The best would be pure proxies, as this will have the highest chance of avoiding compatibility issues in the future. 2.8. Security Issues One issue is who is allowed to manipulate what at the Media Server. For services like announcements, IVR, and IVVR, a straightforward security model is to have commands come on the same SIP dialog as what established the media connection. Clearly, if you can create the connection, you have some kind of relationship with the end point, if you are not the requesting end point itself. Other relationships get more complicated. For example, if we have a Burger Expires December 7, 2006 [Page 10] Internet-Draft MSCL Thoughts June 2006 single control pipe from the Application Server, everything is OK if there is only one Application Server. This is the model for H.248. However, if we have more than one Application Server, then we have to ensure a separation of the resources from one Application Server from another. One solution for this problem is to partition the Media Server into multiple virtual Media Servers, each one dedicated to a given Application Server. This is a suggested model in H.248. However, as mentioned above in Section 2.4, this may be difficult for server- centric Media Servers. 3. Transport Protocols 3.1. Pure Device Control H.248 [1] is the IETF/ITU-T media gateway control protocol. H.248 provides generic session establishment machinery and gateway internal resource interconnection. H.248 packages define various resources, including tone detectors, tone generators, audio recorders, and fixed-function audio prompt resources. H.248 uses SDP for session negotiation, but it is considerably different than SIP's SDP offer/answer [12] protocol. H.248 assumes a single media gateway controller per media gateway. H.248 uses a single TCP, UDP, or SCTP pipe between the controller and gateway. Most H.248 implementations use text encoding over the wire. For those that are enamored with XML PDU's, H.248 does have an ASN.1 [13] encoding. This means one can use XER [14] to have an XML wire protocol. 3.2. Pure SIP Using the netann [5] convention, one can perform basic media services, such as announcements and basic mixing. However, SIP does not provide the necessary controls for enhanced conferencing, such as gain control, identification of preferred speakers (if they speak, they have priority in the mix, even if they are not the loudest), creating sidebar and other topologies (such as Coach/Agent/Mark), and so on. Note that Pure SIP uses a single TCP or SCTP socket. However, there is a separate SIP session per leg. Burger Expires December 7, 2006 [Page 11] Internet-Draft MSCL Thoughts June 2006 3.3. SIP With TCP Side Channel MRCPv2 [15] is an example of a media processing protocol that uses a TCP side channel. In MRCPv2, the client uses SIP to route to a speech server, uses SIP's SDP offer/answer [12] protocol to negotiate the media codecs, and specifies the protocol machinery for establishing a side channel transfer protocol, such as TCP or TLS, for the actual MRCPv2 PDU's. The MRCPv2 server hands back a unique session identifier to the client. All subsequent messages relating to a given MRCPv2 session include the session identifier. This means one can share the side channel between multiple client instances on the requesting node. MRCPv2 allows the client to request channel reuse or to request a new channel at session establishment time. Correspondingly, the MRCPv2 server can insist on a side channel per session, rather than sharing the side channel amongst sessions. The MRCPv2 model has the benefit of using the SIP protocol machinery for session establishment. This includes using the SIP security mechanisms to authorize the association of the side channel with the media channel. MRCPv2 itself has the drawbacks of having a totally different state machine. The MRCPv2 state machine is optimized for speech services like speech recognition and speech synthesis. Moreover, the methods are incompatible with the needs for conference control. In addition, the MRCPv2 approach rules out the use of the protocol by SIP Proxies, as the B2BUA must modify the SDP to insert the SDP m-line for the control channel. One might ask, "If all we are doing is establishing a TCP connection to control the media server, what do we need SIP for?" This is a reasonable question. The key is to be using SIP for media session establishment. If we are using SIP for media session establishment, then we need to ensure the URI used for session establishment resolves to the same node as the node for session control. Using the SIP routing mechanism, and having the server initiate the TCP connection back, ensures this works. For example, the URI sip: myserver.example.com may resolve to sip: server21.farm12.northeast.example.net, whereas the URI http://myserver.example.com may resolve to http://server41.httpfarm.central.example.net. That is, the host part is NOT NECESSARILY unambiguous. Burger Expires December 7, 2006 [Page 12] Internet-Draft MSCL Thoughts June 2006 3.4. SIP With INFO Two proposals have been put forward that use the SIP dialog for the side channel. Both use the INFO method. They are MSCML [16] and MSML [18]. MSCML uses the SIP Requires and Content-Type headers to ensure interoperability and preservation of SIP semantics. MSCML correlates the commands received on the dialog with the dialog's media streams. In the case of enhanced conferences, where there are global commands such as conference size, playing to the entire conference, or recording the entire conference, MSCML has the concept of a Conference Control Leg. The Conference Control Leg is not associated with any media dialog. However, it is a SIP dialog in the normal sense. MSML relies on a private (non-Internet) agreement between the Application Server and Media Server to know the context of the INFO messages. MSML tunnels SDP-layer information over the established dialog; in the case of media processing, it uses a secondary markup, MOML [18]. MOML is a device control protocol, with primitives similar to H.248. Deployed versions of MOML/MSML do not use SIP, such as for referencing entities with SIP dialog properties, using SIP semantics for control, or transparently correlating SIP dialogs with RTP streams. However, the current version of the MSML specification does suggest using the SIP Dialog identifier to identify media sessions. We will touch upon the content of what goes over the side channel in Section 4. Using the SIP dialog for the side channel has the benefit of using the SIP routing network for getting the messages to locate and follow (in the mobility case) the UAS and UAC. In particular, proxies that are important for routing can Record-Route, while proxies that are not needed other than for session establishment can chose to not Record-Route. Thus the transport of side channel commands places only a small burden on the SIP routing network. Note that there are a few problems resulting from the use of INFO. First, there are no throttling mechanisms, other than that provided by the underlying transport mechanism (TCP or Connection-Mode SCTP). If you are using UDP, you are out of luck. Second, even in the case of MSCML, which is well behaved in that it is guaranteed by the SIP protocol machinery that both the UAS and UAC will interoperate and understand the semantics of the MSCML INFO messages, the stacks can still get other, ill-behaved INFO messages that it may not Burger Expires December 7, 2006 [Page 13] Internet-Draft MSCL Thoughts June 2006 understand. Third, even though this has never happened in the real world, there is a theoretical problem that INFO message handling may overwhelm a proxy. In practice, one sizes ones proxies to the total traffic they need to handle. Moreover, only active element proxies, such as Edge Proxies, need Record-Route. That said, this might be a problem in the future. The following sections explore alternatives that use the SIP Dialog. 3.5. SIP With SUBSCRIBE/NOTIFY As outlined in the expired draft, INFO Considered Harmful [19], the events framework (SUBSCRIBE/NOTIFY) addresses all of the problems with INFO. Namely, event packages must offer throttling mechanisms, all event packages identify themselves and thus globally interoperate, and even stupid proxies that Record-Route everything often decide not to Record-Route SUBSCRIBE and NOTIFY messages. Of course, SUBSCRIBE/NOTIFY really, really, really should not (actually, most of us, including me, say "MUST NOT") reuse the SIP dialog directly associated with the media session. This means we lose the auto-correlation feature that we have by using the INFO method. There is a subtler, yet arguably more important problem with using SUBSCRIBE/NOTIFY. Namely, the semantics of SUBSCIBE are, "tell me (monitor) what is going on at the device." Typical uses for SUBSCRIBE are for presence [20] (what is the state of the user?), MWI [21] (what is the state of the message store?), and KPML [22] (what is the state of the key press buffer?). No package changes the state of the UAS. Using SUBSCRIBE, for example, to play a prompt or to change the configuration of a mixer, most definitely changes the state of the UAS. 3.6. SIP With MEDIA Another approach outlined in INFO Considered Harmful [19] is to introduce a new method. This was the route taken by PUBLISH [23], as it was not quite NOTIFY. Properly defined, a new method can safely share the SIP dialog. Moreover, it would satisfy the auto-correlation properties used by, for example, MSCML. Lastly, the semantics would be well defined, addressing the issues raised by INFO Considered Harmful. 4. Models Burger Expires December 7, 2006 [Page 14] Internet-Draft MSCL Thoughts June 2006 4.1. H.248 H.248 [1] provides: 1. A single control channel between Application Server and Media Server. 2. The possibility for an XML transport encoding. 3. Total control of media resources, at the assembly language level. The first item is of use to those whom would want a single control channel and socket per Application Server. The second item is of use to those whom love XML. The third item ensures a measure of capabilities possibility. That is, since the Application explicitly defines the application-level semantics of media processing at the media layer, future Applications can define future, unanticipated topologies. The drawbacks of H.248 are: 1. Layer violation et al. 2. Market adoption The first item touches upon virtually every issue raised in Section 2. By definition, H.248 is a low-level device control protocol. That means more lines of code for a given function, higher complexity for a given function, no compatibility with the SIP model (everything becomes a MGC), and the Application Server must dive deep into SDP and they media layer to do basic operations. The second item, while not in itself a determining factor in the IETF, is important to note as a leading indicator. For many of the reasons noted above, neither Application Server developers nor Media Server developers desire H.248 as an Application Server - Media Server protocol. Moreover, none of the major media server manufacturers have or plan to offer H.248-based media servers. In a sense, the market has spoken about this option, even in light of the 1999 declaration (well before there were any enhanced media services) by 3GPP that H.248 would be the media server (MRFP) interface. 4.2. MSCML MSCML [16] provides: 1. Automatic correlation, including security associations, between the control channel and the media session. 2. Preservation of SIP semantics, including being SIP Proxy friendly. 3. Operations and all semantics are at the SIP dialog layer. 4. Application Servers can be relatively simple, as addressing of media processing commands is straightforward: send the command down the associated SIP media dialog. Burger Expires December 7, 2006 [Page 15] Internet-Draft MSCL Thoughts June 2006 5. Establishing a media session is straightforward: INVITE the Media Server to a session. 6. Strict adherence to the philosophy espoused by, among other places, the Application Interaction Framework [24]. The drawbacks of MSCML include: 1. Even though MSCML properly uses INFO, using INFO in itself has theoretical problems with non-interoperating devices. 2. By relying on SIP dialogs, the Application Server uses multiple SIP dialogs to control, for example, an enhanced conference on the Media Server. 3. By taking the application layer approach, MSMCL requires one to two more protocol messages than a device control approach. The first issue is a result of using INFO. The second issue is more interesting. For example, the enhanced conference case, that is, where one needs to play or record into the entire conference, one has to setup an additional SIP dialog, the Conference Control Dialog, per conference. In the extreme case of two-party conferences, this increases the number of SIP dialogs by 50%. Of course, few two-party scenarios require the enhanced conferencing features, and thus would not increase the number of dialogs. However, if one did need those features, then the dialog expansion would occur. The third issue refers to the situation where the Application Server wants to place the caller into a conference, but the application needs to interact with the caller before the application knows which conference to place them into. In the MSCML model, the application has to INVITE the caller into a dialog (VoiceXML) or IVR session with the caller, determine the address of the conference, and then re- INVITE or REFER the caller into the conference. Of course, if one uses a low-level device control markup rather than an application-level markup like VoiceXML, then the number of protocol messages to implement a voice dialog will swamp the extra redirect message. Interestingly, MSML and MSCML exchange the same number of messages to do the same task. The re-INVITE model offers total flexibility, in that the application never has to change if the modality of the IVR step changes. For example, the IVR step could be to a low-cost audio media resource, which then places the caller into a high-cost, 30fps, continuous presence video bridge. Burger Expires December 7, 2006 [Page 16] Internet-Draft MSCL Thoughts June 2006 Application Server Media Server | | |INVITE sip:dialog@ms.example.net | |;voicexml=http://as.example.net/get-id | |----------------------------------------->| | | |200 OK | |<-----------------------------------------| | | |ACK | |----------------------------------------->| | | |GET http://as.example.net/cgi-bin/get-id | |<-----------------------------------------| | | |(VoiceXML script) | |..........................................| | | |POST (result) | |<-----------------------------------------| | | |REFER sip:conf=12345@ms.example.net | |----------------------------------------->| | | |202 ACCEPTED | |<-----------------------------------------| | | |NOTIFY | |<-----------------------------------------| | | |200 OK (NOTIFY) | |----------------------------------------->| | | | | The downside of the re-INVITE model is that it involves the endpoint in the SDP renegotiation. This puts an additional burden on the Application Server and caller device to relay and act upon the messages. The REFER model does not involve the calling endpoint. However, it does have one additional protocol message. Burger Expires December 7, 2006 [Page 17] Internet-Draft MSCL Thoughts June 2006 Application Server Media Server | | |INVITE sip:dialog@ms.example.net | |;voicexml=http://as.example.net/get-id | |----------------------------------------->| | | |200 OK | |<-----------------------------------------| | | |ACK | |----------------------------------------->| | | |GET http://as.example.net/cgi-bin/get-id | |<-----------------------------------------| | | |(VoiceXML script) | |..........................................| | | |POST (result) | |<-----------------------------------------| | | |REFER sip:conf=12345@ms.example.net | |----------------------------------------->| | | |202 ACCEPTED | |<-----------------------------------------| | | |NOTIFY | |<-----------------------------------------| | | |200 OK (NOTIFY) | |----------------------------------------->| | | | | 4.3. MOML/MSML MSML [18] provides: 1. As of the -04 draft, a SIP Dialog addressing scheme. 2. Arbitrarily complex mixing topologies, on a par with H.248. 3. With MOML [17], the audio prompt, record, DTMF detection, and other functions of H.248, with the addition of access to speech resources. 4. Switching between IVR and conferencing can be done without a re- INVITE or REFER. The drawbacks of MSML include: Burger Expires December 7, 2006 [Page 18] Internet-Draft MSCL Thoughts June 2006 1. The application has to be aware of and manipulate the media resource plumbing. 2. With most operations on a par with H.248, why not use H.248? 3. The MSML model assumes everything resides in a single server, especially with respect to the audio/video example given above. Application Server Media Server | | | | |INVITE sip:dialog@ms.example.net | |;moml=cid:foobratz12@ms.example.net * | |----------------------------------------->| | | |200 OK | |<-----------------------------------------| | | |ACK | |----------------------------------------->| | | |GET http://as.example.net/cgi-bin/get-id | |<-----------------------------------------| | | |(VoiceXML script) | |..........................................| | | |POST (result) | |<-----------------------------------------| | | |INFO (MSML ) | |<-----------------------------------------| | | |200 OK | |----------------------------------------->| | | |INFO (MSML ) | |----------------------------------------->| | | |200 OK | |<-----------------------------------------| | | | | * The MSML specification does not state how to start a session. We assume that one starts a MOML session and then send a document. The URI of the VoiceXML script, and the programming logic necessary to start that script, is embedded in the MSML document sent to the Media Server. Burger Expires December 7, 2006 [Page 19] Internet-Draft MSCL Thoughts June 2006 5. Recommendations This section is in the spirit of getting a conversation started. Everything here is opinion. Feel free to argue. First of all, it is clear there is interest in a standard for the Application Server - Media Server protocol in the Internet community. The adoption of MOML/MSML in the developer community and MSCML in the developer and vendor community is an existence proof of the utility of, and need for, such a protocol. The official impetus for this work is the XCON Media Server Requirements [26]. However, in spite of the fact we have VoiceXML for application level IVR specification and H.248 for low-level IVR specification, people keep asking for IVR with conferencing, as evidenced by the XCON requirements. The problem is this IVR functionality bleeds out, and thus we need to ensure it is well thought out before just tossing something in there. There is a desire to leverage the SIP protocol machinery for media session establishment, namely the SIP Offer/Answer protocol. Application developers want to see the Media Server as a server that offers application-level media processing. That is, modeling the Media Server as a server that offers IVR, conference mixing, and other, application-level media processing services. If application developers want low-level, DSP-level media manipulation, they already have an IETF protocol, H.248. If application developers want a single control channel (total, including session establishment) from the Application Server to the Media Server, they already have an IETF protocol, H.248. If application developers want an XML transport encoding for a low- level protocol or a single control channel, they already have an IETF protocol, H.248. Assuming developers do not want H.248, what are the options? INFO probably isn't it. That leaves to directions to go. The first is to stick with the SIP Dialog model of MSCML and the other is to stick with the side channel model of MRCPv2. The former would indicate a new method, such as MEDIA. The latter would indicate a new establishment procedure, such as described in Burger Expires December 7, 2006 [Page 20] Internet-Draft MSCL Thoughts June 2006 the other MSRP [25]. What does all this mean? WHAT GOES DOWN THE PIPE IS AS IMPORTANT AS THE PIPE ITSELF. It is easy to identify protocol abuse in the determination of the control channel. However, even if we have a decent control channel establishment mechanism, sending the wrong kind of messages down that channel can render the protocol less than useful. For example, it is great to use SIP to route messages to a media server. However, if those messages emulate H.248, but encoded in XML, it would be much more efficient, cleaner, and avoid the layer violation by simply using H.248. You can even get H.248 in XML! Just please, please, please, don't transport it in SIP or a SIP side channel. NOTE: This is one of the reasons I pulled out of [25] at the last minute. What goes in to the pipe is as important as the pipe itself. 6. Security Considerations One issue is who is allowed to manipulate what at the Media Server. For services like announcements, IVR, and IVVR, a straightforward security model is to have commands come on the same SIP dialog as what established the media connection. Clearly, if you can create the connection, you have some kind of relationship with the end point, if you are not the requesting end point itself. Other relationships get more complicated. For example, if we have a single control pipe from the Application Server, everything is OK if there is only one Application Server. This is the model for H.248. However, if we have more than one Application Server, then we have to ensure a separation of the resources from one Application Server from another. One solution for this problem is to partition the Media Server into multiple virtual Media Servers, each one dedicated to a given Application Server. This is a suggested model in H.248. However, as mentioned above in Section 2.4, this may be difficult for server- centric Media Servers. 7. IANA Considerations Burger Expires December 7, 2006 [Page 21] Internet-Draft MSCL Thoughts June 2006 As this is an Informative exploration, there are no IANA Considerations. 8. Informative References [1] Groves, C., Pantaleo, M., Anderson, T., and T. Taylor, "Gateway Control Protocol Version 1", RFC 3525, June 2003. [2] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [3] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, A., Peterson, J., Sparks, R., Handley, M., and E. Schooler, "SIP: Session Initiation Protocol", RFC 3261, June 2002. [4] Campbell, B. and R. Sparks, "Control of Service Context using SIP Request-URI", RFC 3087, April 2001. [5] Burger, E., Van Dyke, J., and A. Spitzer, "Basic Network Media Services with SIP", RFC 4240, December 2005. [6] Burnett, D., Hunt, A., McGlashan, S., Porter, B., Lucas, B., Ferrans, J., Rehor, K., Carter, J., Danielsen, P., and S. Tryphonas, "Voice Extensible Markup Language (VoiceXML) Version 2.0", W3C REC REC-voicexml20-20040316, March 2004. [7] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. Zeidner, "Internet Small Computer Systems Interface (iSCSI)", RFC 3720, April 2004. [8] Sybase, Inc., "TDS 5.0 Functional Specification Version 3.4", URL http://www.sybase.com/content/1013412/tds34.pdf, August 1999. [9] Handley, M. and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [10] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003. [11] Klensin, J., "Simple Mail Transfer Protocol", RFC 2821, April 2001. [12] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, June 2002. Burger Expires December 7, 2006 [Page 22] Internet-Draft MSCL Thoughts June 2006 [13] Telecommunication Standardization Sector of International Telecommunication Union, "Abstract Syntax Notation One (ASN.1): Specification of basic notation", ITU-T Recommendation X.680, July 2002. [14] Telecommunication Standardization Sector of International Telecommunication Union, "ASN.1 encoding rules: XML Encoding Rules (XER)", ITU-T Recommendation X.693, December 2001. [15] Burnett, D. and S. Shanmugham, "Media Resource Control Protocol Version 2 (MRCPv2)", draft-ietf-speechsc-mrcpv2-09 (work in progress), December 2005. [16] Dyke, J., "Media Server Control Markup Language (MSCML) and Protocol", draft-vandyke-mscml-08 (work in progress), May 2006. [17] Saleem, A. and G. Sharratt, "Media Objects Markup Language (MOML)", draft-melanchuk-sipping-moml-06 (work in progress), October 2005. [18] Melanchuk, T. and G. Sharratt, "Media Sessions Markup Language (MSML)", draft-melanchuk-sipping-msml-05 (work in progress), March 2006. [19] Rosenberg, J., "The Session Initiation Protocol (SIP) INFO Method Considered Harmful", draft-rosenberg-sip-info-harmful-00 (work in progress), January 2003. [20] Rosenberg, J., "A Presence Event Package for the Session Initiation Protocol (SIP)", RFC 3856, August 2004. [21] Mahy, R., "A Message Summary and Message Waiting Indication Event Package for the Session Initiation Protocol (SIP)", RFC 3842, August 2004. [22] Burger, E., "A Session Initiation Protocol (SIP) Event Package for Key Press Stimulus (KPML)", draft-ietf-sipping-kpml-07 (work in progress), December 2004. [23] Niemi, A., "Session Initiation Protocol (SIP) Extension for Event State Publication", RFC 3903, October 2004. [24] Rosenberg, J., "A Framework for Application Interaction in the Session Initiation Protocol (SIP)", draft-ietf-sipping-app-interaction-framework-05 (work in progress), July 2005. [25] Boulton, C. and T. Melanchuk, "Media Server Request Protocol", Burger Expires December 7, 2006 [Page 23] Internet-Draft MSCL Thoughts June 2006 draft-boulton-media-server-control-00 (work in progress), June 2005. [26] Even, R., "Requirements for a media server control protocol", draft-even-media-server-req-00 (work in progress), January 2005. Appendix A. Contributors I cannot share blame with anyone on this one. Appendix B. Acknowledgements Brooks Gelfand in 1985 made the quote, "If you cannot do it in assembly language, you cannot do it at all," during an argument I was having with another engineer about the relative merrits of C versus Lisp. The catalyst for this document was the very hard and dedicated work of Chris Boulton, Tim Melanchuk, and I to bang out the and argue over the other MSRP draft, starting in April of 2005 and lasting through the very end of June. Burger Expires December 7, 2006 [Page 24] Internet-Draft MSCL Thoughts June 2006 Author's Address Eric Burger Cantata Technology, Inc. 18 Keewaydin Dr. Salem, NH 03079-2839 USA Phone: +1 603 890 7587 Fax: +1 603 457 5944 Email: eburger@cantata.com Burger Expires December 7, 2006 [Page 25] Internet-Draft MSCL Thoughts June 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Burger Expires December 7, 2006 [Page 26]