Internet Engineering Task Force SIPPING WG Internet Draft J. Rosenberg dynamicsoft draft-rosenberg-sipping-conferencing-framework-00.txt October 28, 2002 Expires: April 2003 A Framework for Conferencing with the Session Initiation Protocol STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt To view the list Internet-Draft Shadow Directories, see http://www.ietf.org/shadow.html. Abstract The Session Initiation Protocol (SIP) supports the initiation, modification, and termination of media sessions between user agents. These sessions are managed by SIP dialogs, which represent a SIP relationship between a pair of user agents. Because dialogs are between pairs of user agents, SIP's usage for two-party communications (such as a phone call), is obvious. Communications sessions with multiple participants, generally known as conferencing, is more complicated. This document defines a framework for how such conferencing can occur. This framework describes the overall architecture, terminology, and protocol components needed for multi- party conferencing. J. Rosenberg [Page 1] Internet Draft Conferencing Framework October 28, 2002 Table of Contents 1 Introduction ........................................ 3 2 Terminology ......................................... 3 3 Basic Architecture .................................. 7 4 Usage of URIs ....................................... 11 5 Functions of the Elements ........................... 12 5.1 Focus ............................................... 12 5.2 Conference Policy Server ............................ 13 5.3 Mixers .............................................. 14 5.4 Media Policy Server ................................. 14 5.5 Conference Notification Service ..................... 15 5.6 Participants ........................................ 16 5.7 Conference Policy ................................... 16 5.8 Media Policy ........................................ 17 6 Physical Realization ................................ 17 6.1 Centralized Server .................................. 17 6.2 Endpoint Server ..................................... 17 6.3 Media Server Component .............................. 18 6.4 Distributed Mixing .................................. 21 6.5 Cascaded Mixers ..................................... 22 7 Common Operations ................................... 22 7.1 Creating Conferences ................................ 22 7.2 Adding Participants ................................. 25 7.3 Removing Participants ............................... 27 7.4 Approving Policy Changes ............................ 27 7.5 Creating Sidebars ................................... 28 8 Security Considerations ............................. 28 9 Contributors ........................................ 29 10 Authors Addresses ................................... 29 11 Normative References ................................ 29 12 Informative References .............................. 29 J. Rosenberg [Page 2] Internet Draft Conferencing Framework October 28, 2002 1 Introduction The Session Initiation Protocol (SIP) [1] supports the initiation, modification, and termination of media sessions between user agents. These sessions are managed by SIP dialogs, which represent a SIP relationship between a pair of user agents. Because dialogs are between pairs of user agents, SIP's usage for two-party communications (such as a phone call), is obvious. Communications sessions with multiple participants, however, are more complicated. SIP can support many models of multi-party communications. One, referred to as loosely coupled conferences, makes use of multicast media groups. In the loosely coupled model, there is no signaling relationship between participants in the conference. There is no central point of control or conference server. Participation is gradually learned through control information that is passed as part of the conference (using the Real Time Control Protocol (RTCP) [2], for example). Loosely coupled conferences are easily supported in SIP by using multicast addresses within its session descriptions. In another model, referred to as fully distributed multiparty conferencing, each participant maintains a signaling relationship with each other participant, using SIP. There is no central point of control; it is completely distributed amongst the participants. SIP does not yet support this model. In another model, sometimes referrred to as the tightly coupled conference, there is a central point of control. Each participant connects to this central point. It provides a variety of conference functions, and may possibly perform media mixing functions as well. Tightly coupled conferences are not directly addressed by the SIP specification, although basic ones are possible without any additional protocol support. This document is one of a series of specifications that discusses tightly coupled conferences. Here, we present the overall framework for tightly coupled conferencing, referred to simply as "conferencing" from this point forward. This framework presents a general architectural model for these conferences, presents terminology used to discuss such conferences, and describes the sets of protocols involved in a conference. The aim of the framework is to meet the general requirements for conferencing that are outlined in [3]. 2 Terminology Conference: Sadly, conference is an overused term which has different meanings in different contexts. In SIP, a conference is an instance of a multi-party conversation. J. Rosenberg [Page 3] Internet Draft Conferencing Framework October 28, 2002 Within the context of this specification, a conference is always a tightly coupled conference. Loosely Coupled Conference: A loosely coupled conference is a conference without coordinated signaling relationships amongst participants. Loosely coupled conferences use multicast for distribution of conference memberships. Tightly Coupled Conference: A tightly coupled conference is a conference in which a single user agent, referred to as a focus, maintains a dialog with each participant. The focus plays the role of the centralized manager of the conference, and is addressed by a conference URI. Focus: The focus is a SIP user agent that is addressed by a conference URI. The focus maintains a SIP signaling relationship with each participant in the conference. The focus is responsible for insuring, in some way, that each participant receives the media that make up the conference. The focus also implements conference policies. The focus is a logical role. Conference URI: A URI, usually a SIP URI, which identifies the focus of a conference. Participants: The set of user agents, each identified by a URI, which are connected to the focus for a particular conference. Conference Notification Service: A conference notification service is a logical function provided by the focus. The focus can act as a notifier [4], accepting subscriptions to the conference state, and notifying subscribers about changes to that state. The state includes the state maintained by the focus itself, the conference policy, and the media policy. Conference Policy Server: A conference policy server is a logical function which can store and manipulate rules associated with participation in a conference. These rules include directives on the lifespan of the conference, who can and cannot join the conference, definitions of roles available in the conference and the responsibilities associated with those roles, and policies on who is allowed to request which roles. The conference policy server is a logical role. Media Policy Server: A media policy server is a logical function J. Rosenberg [Page 4] Internet Draft Conferencing Framework October 28, 2002 which can store and manipulate rules associated with the media distribution of the conference. These rules can specify which participants receive media from which other participants, and the ways in which that media is combined for each participant. In the case of audio, these rules can include the relative volumes at which each participant is mixed. In the case of video, these rules can indicate whether the video is tiled, whether the video indicates the loudest speaker, and so on. Conference Policy: The set of rules manipulated by the conference policy server. Conference Policy Control Protocol: The client-server protocol used by clients to manipulate the conference policy. Media Policy: The set of rules manipulated by the media policy server. The media policy is used by the focus to determine the mixing characteristics for the conference. Media Policy Control Protocol: The client-server protocol used by clients to manipulate the media policy. Mixer: As defined in the Real Time Transport Protocol [2], a mixer receives a set of media streams, and combines their media in a type-specific manner, redistributing the result to each participant. We use the term here to include combining of non-RTP media streams as well, such as instant messaging sessions [5]. Basic Conference: A basic conference is one where there is no conference policy server, media policy server, or conference subscription server - only a focus. Basic Participant: A basic participant is a participant in a conference that is not aware that it is actually in a conference. As far as the UA is concerned, it is a point- to-point call. Cascaded Conference: A conference in which a participant is the focus of another conference. Complex Conference: A complex conference includes at least one of a conference policy server, media policy server, or conference subscription server, in addition to the focus. Complex Participant: A complex participant is a participant in a conference that has learned, through automated means, that J. Rosenberg [Page 5] Internet Draft Conferencing Framework October 28, 2002 it is in a conference, and that can use a conference policy control protocol, media policy control protocol, or conference subscription, to implement advanced functionality. Conference Server: A conference server is a physical server which contains, at a minimum, the focus. It may also include a media policy server, a conference policy server, and a mixer. Singleton: In this context, a singleton is a conference participant that is not a focus. A singleton represents a single user in a conference. Conference Topology: The conference topology is a graph that defines the connectivity amongst participants connected through conferences. Each node in the graph represents a user agent, whether it is a focus or a singleton. Each leaf node in the tree represents an singleton, and an internal node represents a focus. An edge between two nodes implies that there is a SIP dialog between them. Ideally, conference topologies are trees, not arbitrary graphs. Conversation Space: For each conference URI, there is a unique conversation space. The conversation space is defined as the set of singleton in the conference topology associated with that URI. The conference topology associated with a conference URI is the one that is constructed by starting with the focus for that URI. Under normal circumstances, the set of singleton in a conversation space will all receive each others media. Instant Conference: A conference in which the focus is constructed the instant the first INVITE for a URI is received, and then destroyed in which the last participant has left. Mass Invitation: A conference policy control protocol request to invite a large number of users into the conference. Mass Ejection: A conference policy control protocol request to remove a large number of users from the conference. Sidebar: A sidebar appears to the users as a "conference within the conference". It is a dicsussion amongst a subset of the participants, not heard by the remaining participants in the conference. J. Rosenberg [Page 6] Internet Draft Conferencing Framework October 28, 2002 Anonymous Participant: An anonymous participant is one that is known to other participants (through the conference notification service), but whose identity is being withheld. Invisible Participant: An invisible participant is one that is not known to other participants in the conference. They may be known to the moderator, depending on conference policy. 3 Basic Architecture A SIP conference is represented by a URI. This URI identifies the focus, which is the user agent at the center of the conference. Any participant that is involved in the conference is connected to the focus by a SIP dialog. The result is a star topology, shown in Figure 1. The focus has access to a conference policy and media policy, an instance of which exist for each focus. In a basic SIP conference, these policies are administratively defined. Users join the conference by sending an INVITE to the conference URI. As long as the conference policy allows, the INVITE is accepted by the focus and the user is brought into the conference. Users can leave the conference by sending a BYE, as they would in a normal call. Indeed, a participant in a basic conference does not need to know that the focus is anything other than a normal SIP user agent. Similarly, the focus can terminate a dialog with a participant, should the conference policy change to indicate that the participant is no longer allowed in the conference. A focus can also initiate an INVITE, should the conference policy indicate that the focus needs to bring a participant into the conference. The focus is responsible for making sure that the media streams which constitute the conference are available to the participants in the conference. It does that through the use of one or more mixers, each of which combines a number of input media streams to produce one or more output media streams. The focus uses the media policy to determine the proper configuration of the mixers. With these basic capabilities, a large number of common conferencing applications can be built. None of them require any extensions to SIP; they merely require that the focus is aware of its role and responsibilities in maintaining the conference. However, basic conferences do not allow for the participants to control the way in which the conference operates. J. Rosenberg [Page 7] Internet Draft Conferencing Framework October 28, 2002 +-----------+ | | | | |Participant| | | | | +-----------+ | |SIP |Dialog | | +-----------+ +-----------+ +-----------+ | | | | | | | | | | | | |Participant|-----------| Focus |------------|Participant| | | SIP | | SIP | | | | Dialog | | Dialog | | +-----------+ +-----------+ +-----------+ | | |SIP |Dialog | | +-----------+ | | | | |Participant| | | | | +-----------+ Figure 1: Basic SIP Conference A complex SIP conference is one in which additional interfaces are exposed, allowing for a richer set of controls and information on the conference. In particular, a complex SIP conference can include a J. Rosenberg [Page 8] Internet Draft Conferencing Framework October 28, 2002 conference policy server and a media policy server, and the focus can expose a conference notification service. The model for these conferences is shown in Figure 2. This figure shows the view from one participant. The conference now encompasses an additional set of functions. In addition to maintaining the dialog with the focus, the participant now has access to these other functions. It can, using a conference event package [6], SUBSCRIBE to the conference URI, and be connected to the conference notification service provided by the focus. Through this package, it can learn about changes in participants (effectively, the state of the dialogs), the media policy, and the conference policy. The participant can also communicate with the conference policy server, using a conference policy control protocol. This is a strictly client-server transactional protocol. This protocol might not be a protocol at all; it can be performed using a web interface. In this case, no standardized protocols or policies are needed. However, the web interface can only be manipulated by humans, not automata. For this reason, the participant can use a protocol designed specifically for this purpose. The participant can also communicate with the media policy server, using a media policy control protocol. This is a strictly client- server transactional operation. This can also be through a web interface, or through an explicit protocol. The focus will access the media and conference policies. There is a tight coupling between these policies and the focus. Not only does it need read access to these policies, but it needs to know when they have changed. Such changes might result in SIP signaling (for example, the ejection of a user from the conference using BYE), and most changes will require a notification to be sent to subscribers to the conference notification service. The conference policy and media policy servers need not be available in any particular conference. Even when available, they need not be used by all participants. A participant in a conference that does not access any of these functions, and which doesn't even know that the focus is a focus, is called a basic participant. A conference participant that can discover and access these additional function is a complex participant. Any conference can include basic and complex participants. The interfaces between (1) the focus and the media policy, (2) the focus and the conference policy, (3) the conference policy server and the conference policy, and (4) the media policy server and the media policy are not subject to standardization at the time of this writing. They are intended primarily to show the logical roles J. Rosenberg [Page 9] Internet Draft Conferencing Framework October 28, 2002 Conference ..................................... Policy . +-----------+ . Control . | | . Protocol . |Participant| . +------------------->| Policy | . | . | Server | . | . | | \ . | Media . +-----------+ \ . | Policy . +-----------+ \ //-----\\ . | Control . | | > || || . | Protocol . | Media | \\-----// . | +------------->| Policy | | | . | | . | Server |----> |Conference . | | . | | | | . | | . +-----------+ | & | . | | . | | . | | . | Media | . +-----------+ . +-----------+ | Policy| . | | . | | \ // . | | . | | \-----/ . |Participant|<--------->| Focus | | . | | SIP . | | | . | | Dialog . | |<-----------+ . +-----------+ . |...........| . ^ . | Conference| . | . |Notification . +------------>| Service | . Subscription. +-----------+ . . . . . . . . . ..................................... Conference Functions Figure 2: Complex SIP Conference J. Rosenberg [Page 10] Internet Draft Conferencing Framework October 28, 2002 to encourage clarity in the requirements and to allow individual implementations the flexibility to compose a conferencing system in a scalable and robust manner. 4 Usage of URIs It is fundamental to this framework that a conference is uniquely identified by a URI, and that this URI identify the focus which is responsible for the conference. This URI is always a SIP or SIPS URI. The conference URI is opaque to any participants which might use it. There is no way to look at the URI, and know for certain whether it identifies a focus, as opposed to a user or an interface on a PSTN gateway. This is in line with the general philosophy of URI usage [7]. However, contextual information surrounding the URI (for example, SIP header parameters) may indicate that the URI represents a conference. The conference URI can represent a long-lived conference or interest group, such as "sip:discussion-on-dogs@example.com". The focus identified by this URI would always exist, and always be managing the conference for whatever participants are currently joined. The conference URI can also represent an "instant" conference, for example, "sip:a8sd9998as-9s8daa@example.com". An instant conference is one where the focus is instantiated when the first URI for it arrives, and then destroyed when the last participant leaves. Both of these represent variations in the policies implemented by the focus, and cannot be determined from inspection of the URI. Ideally, a conference URI is never constructed or guessed by a user. Rather, conference URIs are learned through many mechanisms. A conference URI can be emailed or sent in an instant message. A conference URI can be linked on a web page. A conference URI can be obtained from a conference policy control protocol, which can be used to create conferences and the policies associated with them. To determine that a SIP URI does represent a focus, standard techniques for URI capability discovery can be used. First, a participant can send an OPTIONS to a SIP URI, and if it represents a focus, the response will indicate such [TBD]. The response will also indicate whether or not the focus has implemented the subscription notification service. This is known by the presence of an Allow header in the response, indicating support for the SUBSCRIBE method, along with an Allow-Events header, indicating support for the conferencing package. A second method for determining that a URI represents a focus is through a refresh request. The Allow and Allow-Events headers, along with the caller preferences specification [8] can indicate the same information that would be learned through J. Rosenberg [Page 11] Internet Draft Conferencing Framework October 28, 2002 an OPTIONS query. The other functions in a conference are also represented by URIs. If the conference policy and media policy servers are implemented through web pages, these servers are regular HTTP URIs. If they are accessed using an explicit protocol, they are the URIs defined for those protocols. Starting with the conference URI, the URIs for the other logical entities in the conference can be learned using [TBD]. OPEN ISSUE: I suppose we cannot say more until the protocol work is done. But, we have a requirement here - that there be a way to learn these URIs starting only with the conference URI. 5 Functions of the Elements This section gives a more detailed description of the functions typically implemented in each of the elements. 5.1 Focus As its name implies, the focus is the center of the conference. All participants in the conference are connected to it using a SIP dialog. The focus is responsible for maintaining the dialogs connected to it. It insures that the dialogs are connected to a set of participants who are allowed to participate in the conference, as defined by the conference policy. The focus also uses SIP to manipulate the media sessions, in order to make sure each participant obtains all the media for the conference. To do that, the focus makes use of the services of a mixer. When a focus receives an INVITE, it checks the conference policy. The conference policy might indicate that this participant is not allowed to join, in which case the call can be rejected. It might indicate that another participant, acting as a moderator, needs to approve this new participant. In that case, the INVITE might be parked on a music-on-hold server, or a 183 response might be sent to indicate progress. A notification, using the conference notification service, would be sent to the moderator. The moderator then has the ability to manipulate the policies using the conference policy control protocol. If the policies are changed to allow this new participant, the focus can accept the INVITE (or unpark it from the music-on-hold server). The interpretation of the conference policy by the focus is, itself, a matter of local policy, and not subject to standardization. J. Rosenberg [Page 12] Internet Draft Conferencing Framework October 28, 2002 If a participant manipulated the conference policy to indicate that a certain other participant was no longer allowed in the conference, the focus would send a BYE to that other participant to remove them. This is often referred to as "ejecting" a user from the conference. The process of ejecting fundamentally constitutes these two steps - the establishment of the policy through the conference policy protocol, and the implementation of that policy (using a BYE) by the focus. Similarly, if a participant manipulated the conference policy to indicate that a number of users need to be added to the conference, the focus would send an INVITE to those participants. This is often referred to as the "mass invitation" function. As with ejection, it is fundamentally composed of the policy functions that specify the participants which should be present, and the implementation of those functions using SIP. A policy request to add a set of users might not require an INVITE to execute it; those users might already be participants in the conference. A similar model exists for media policy. If the media policy indicates that a participant should not receive any video, the focus might implement that policy by sending a re-INVITE, removing the media stream to that participant. Alternatively, if the video is being centrally mixed, it could inform the mixer to send a black screen to that participant. The means by which the policy is implemented are not subject to specification. 5.2 Conference Policy Server The conference policy server allows clients to manipulate and interact with the conference policy. The conference policy is used by the focus to make authorization decisions and guide its overall behavior. Logically speaking, there is a one-to-one mapping between a conference policy and a focus. The conference policy is represented by a URI. There is a unique conference policy for each focus. The conference policy URI points to a conference policy server which can manipulate that conference policy. A conference policy server also has a "top level" URI which can be used to access functions that are independent of any conference. Perhaps the most important of these functions is the creation of a new conference. This will result in the construction of a new conference URI, which can then be used to join the conference itself. The conference policy server is accessed using a client-server transactional protocol. The client can be a participant in the conference, or it can be a third party. Access control lists for who J. Rosenberg [Page 13] Internet Draft Conferencing Framework October 28, 2002 can modify a conference policy are themselves part of the conference policy. The conference policy server also allows clients to create new conferences. This would result in the instantiation of a focus (and therefore, a conference URI associated with that focus), a conference policy, and a media policy. The conference policy server will also have rules about who can create conferences. The conference policy also includes per-participant policies that specify how the focus is to handle a particular participant. These include whether or not the participant is anonymous, for example. 5.3 Mixers A mixer is responsible for combining the media streams that make up the conference, and generating one or more output streams that are distributed to recipients (which could be participants or other mixers). The combination process is specific to the media type, and is directed by the focus, under the guidance of the rules described in the media policy. A mixer is not aware of a "conference" as an entity, per se. A mixer receives media streams as inputs, and based on directions provided by the focus, generates media streams as outputs. There is no grouping of media streams beyond the policies that describe the ways in which the streams are mixed. A mixer is always under the control of a focus. The focus is responsible for interpreting the media policy, and then installing the appropriate rules in the mixer. If the focus is directly controlling a mixer, the mixer can either be co-resident with the focus, or can be controlled through a protocol like Megaco [9]. However, a focus need not directly control a mixer. Rather, a focus can delegate the mixing to the participants, each of which has their own mixer. This is described in Section 6.4. 5.4 Media Policy Server The media policy server is similar to the conference policy server. It is accessed using a transactional client-server protocol. It manipulates a media policy, identified by a URI. The focus has the responsibility of acting on that media policy, implementing it through direct or indirect control of mixers. The media policy describes the way in which the set of inputs to the mixer are combined to generate the set of outputs. Media policies can span media types. In other words, the policy on how one media stream is mixed can be based on characteristics of other media streams. J. Rosenberg [Page 14] Internet Draft Conferencing Framework October 28, 2002 Media policies can be based on any quantifiable characteristic of the media stream (its source, volume, codecs, speaking/silence, etc.), and they can be based on internal or external variables accessible by the media policy. The media policy server is responsible for reconciliation of potentially conflicting requests regarding the media policy for the conference. The client of the media policy protocol can be any entity interested in manipulating media policies. Clearly, participants might be interested in manipulating them. A participant might want to raise or lower the volume for one of the other participants it is hearing. Or, a participant might want to switch from a tiled video view, to just viewing the active speaker. A client of the media policy protocol could also be another server whose job is to determine the media policy. As an example, a floor control server is responsible for determining which participant(s) in a conference are allowed to speak at any given time, based on participant requests and access rules. The floor control server would act as a client of the media policy server, and inform the media policy server about who is allowed to speak. The client of the media policy protocol could also be another media policy server, as described in Section 6.4. Some examples of media policies include: o The video output is the picture of the loudest speaker (video follows audio). o The audio from each participant will be mixed with equal weight, and distributed to all other participants. o The audio and video that is distributed is the one selected by the floor control server. 5.5 Conference Notification Service The focus can provide a conference notification service. In this role, it acts as a notifier, as defined in RFC 3265 [4]. It accepts subscriptions from clients for the conference URI, and generates notifications to them as the state of the conference changes. This state is composed of three separate pieces. The first is the state of the focus, the second is the conference policy, and the third is the media policy. J. Rosenberg [Page 15] Internet Draft Conferencing Framework October 28, 2002 The state of the focus includes the participants connected to the focus, and information about the dialogs associated with them. As new participants join, this state would change, allowing subscribers to learn about them. Similarly, when someone leaves, this state also changes, allowing subscribers to learn about this fact. The state of the conference policy includes the set of participants that are allowed, or not allowed, to join the conference, and the set of participants who are to be explicitly added to the conference. It includes the roles which are assigned to each participant, such as whether they are a moderator. If there was a change in role, for example, a new moderator was selected, the focus would inform subscribers. The state of the media policy includes the media streams being received by each participant, the audio or video modalities, and so on. 5.6 Participants A participant in a conference is any SIP user agent that has a dialog with the focus. This SIP user agent can be a PC application, a SIP hardphone, or a PSTN gateway. It can also be another focus. A conference which has a participant that is the focus of another conference is called a cascaded conference. They can also be used to provide scalable conferences where there are regional sub- conferences, each of which is connected to the main conference. A conference topology refers to a graph which shows each focus and each participant as a vertex, with a connection between each participant and its focus. 5.7 Conference Policy The conference policy contains the rules that guide the operation of the focus. These rules can be simple, such as an access list that defines the set of allowed participants in a conference. The rules can also be incredibly complex, specifying time-of-day based rules on participation conditional on the presence of other participants. It is important to understand that there is no restriction on the type of rules that can be encapsulated in a conference policy. However, there does exist a protocol means by which a client can request a change in the conference policy. This is done by communicating with the conference policy server, which manipulates the conference policy. By the nature of conference policies, not all aspects of the policy can be manipulated with the conference policy control protocol. It is the responsibility of the conference policy server to reconcile the various requests with the conference policy. J. Rosenberg [Page 16] Internet Draft Conferencing Framework October 28, 2002 5.8 Media Policy The media policy contains the rules that guide the operation of the mixer. The focus uses these rules to interact with the mixer to implement them. These rules can be simple (mix all media from all participants), or they can be incredibly complex. It is important to understand that there is no restriction on the type of rules that can be encapsulated in a media policy. However, there does exist a protocol means by which a client can request a change in the media policy. This is done by communicating with the media policy server, which manipulates the media policy. By the nature of media policies, not all aspects of the policy can be manipulated with the media policy control protocol. It is the responsibility of the media policy server to reconcile the various requests with the media policy. 6 Physical Realization In this section, we present several physical instantiations of these components, to show how these basic functions can be combined to solve a variety of problems. 6.1 Centralized Server In the most simplistic realization of this framework, there is a single physical server in the network which implements the focus, the conference policy server, the media policy server, and the mixer. This is the classic "one box" solution, shown in Figure 3. 6.2 Endpoint Server Another important model is that of a locally-mixed ad-hoc conference. In this scenario, two users (A and B) are in a regular point-to-point call. One of the participants (A) decides to conference in a third participant, C. To do this, A begins acting as a focus. Its existing dialog with B becomes the first dialog attached to the focus. B would re-INVITE A on that dialog, changing its Contact URI to a new value which identifies the focus. In essence, A "mutates" from a single- user UA to a focus plus a single user UA, and in the process of such a mutation, its URI changes. Then, the focus makes an outbound INVITE to C. When C accepts, it mixes the media from A and C together, redistributing the results. The mixed media is also played locally. Figure 4 shows a diagram of this transition. It is important to note that the external interfaces in this model, J. Rosenberg [Page 17] Internet Draft Conferencing Framework October 28, 2002 Conference Server ................................... . . . +------+ +------------+ . . |Media | | Conference | . . |Policy| |Notification| . . |Server| | Server | . . +------+ +------------+ . . +----------+ . . |Conference| . . | Policy | +-------+ +-----+ . . | Server | | Focus | |Mixer| . . +----------+ +-------+ +-----+ . ................//.\.......--./.... // \ ---- / // -\- /RTP SIP // ---- \ / // --- \SIP / // ---- RTP \ / / -- \ / +-----------+ +-----------+ |Participant| |Participant| +-----------+ +-----------+ Figure 3: Centralized server architecture between A and B, and between B and C, are exactly the same to those that would be used in a centralized server model. B could also include a media policy server and conference subscription server too, allowing the participants to have access to them if they so desired. Just because the focus is co-resident with a participant does not mean any aspect of the behaviors and external interfaces will change. 6.3 Media Server Component J. Rosenberg [Page 18] Internet Draft Conferencing Framework October 28, 2002 B B +------+ +------+ | | | | | UA | | UA | | | | | +------+ +------+ | . | . | . | . | . | . | . Transition | . | . ------------> | . SIP| .RTP SIP| .RTP | . | . | . | . | . | . | . | . | . +----------+ +------+ | +------+ | SIP +------+ | | | |Focus | |----------| | | UA | | |M.Pol.| | | UA | | | | |C.Pol.| |..........| | +------+ | |Mixer | | RTP +------+ | +------+ | A | + | C | + <..|....... | + | . | +------+ | . | |Parti-| | . | |cipant| | . | | | | . | +------+ | . +----------+ . B . . Internal Interface Figure 4: Transition from two-party call to conference J. Rosenberg [Page 19] Internet Draft Conferencing Framework October 28, 2002 +------------+ +------------+ | App Server| SIP |Conf. Cmpnt.| | |-------------| | | Focus | Conf. Proto | Focus | | C.Pol |-------------| M.Pol | | M.Pol | Media Proto | Mixer | |Notification|-------------| | | | | | +------------+ +------------+ | \ .. . | \\ RTP... . | \\ .. . | SIP \\ ... . SIP | \\ ... .RTP | ..\ . | ... \\ . | ... \\ . | .. \\ . | ... \\ . | .. \ . +-----------+ +-----------+ |Participant| |Participant| +-----------+ +-----------+ Figure 5: Media server component model In this model, shown in Figure 5, each conference involves two centralized servers. One of these servers, referred to as the "application server" owns and manages the conference and media policies, and maintains a dialog with each participant. As a result, it represents the focus seen by all participants in a conference. However, this server doesn't provide any media support. To perform the actual media mixing function, it makes use of a second server, called the "mixing server". This server includes a focus, but has no J. Rosenberg [Page 20] Internet Draft Conferencing Framework October 28, 2002 conference policy server or conference notification service. It has a default conference policy, which accepts all invitations from the top-level focus. Its media policy server accepts any controls made by the application server. The focus in the application server uses third party call control to connect the media streams of each user to the mixing server, as needed. If the focus in the application server receives a media policy control command from a client, it delegates that to the media server by making the same media policy control command to it. This model allows for the mixing server to be used as a resource for a variety of different conferencing applications. This is because it is unaware of any conference or media policies; it is merely a "slave" to the top-level server, doing whatever it asks. This is consistent with the SIP Application Server Component Model [10]. 6.4 Distributed Mixing In a distributed mixed conference, there is still a centralized server which implements the focus, conference policy server, and media policy server. However, there is no centralized mixer. Rather, there is a mixer in each endpoint, along with a media policy server. The focus distributes the media by using third party call control [11] to move a media stream between each participant and each other participant. As a result, if there are N participants in the conference, there will be a single dialog between each participant and the focus, but the session description associated with that dialog will be constructed to allow media to be distributed amongst the participants. This is shown in Figure 6. There are several ways in which the media can be distributed to each participant for mixing. In a multi-unicast model, each participant sends a copy of its media to each other participant. In this case, the session description manages N-1 media streams. In a multicast model, each participant joins a common multicast group, and each participant sends a single copy of its media stream to that group. The underlying multicast infrastructure then distributes the media, so that each participant gets a copy. In a single-source multicast model (SSM), each participant sends its media stream to a central point, using unicast. The central point then redistributes the media to all participants using multicast. The focus is responsible for selecting the modality of media distribution, and for handling any hybrids that would be necessitated from clients with mixed capabilities. When a new participant joins or is added, the focus will perform the necessary third party call control to distribute the media from the J. Rosenberg [Page 21] Internet Draft Conferencing Framework October 28, 2002 new participant to all the other participants, and vice-a-versa. The central conference server also includes a media policy server. Of course, the central conference server cannot implement any of the media policies directly. Rather, it would delegate the implementation to the media policy servers co-resident with a participant. As an example, if a participant decides to switch the overall conference mode from "video follows audio" to "tiled video", they would communicate with the central media policy server. This media policy server, in turn, would communicate with the media policy servers co- resident with each participant, using the same media policy control protocol, and instruct them to use "tiled video". This model requires additional functionality in user agents, which may or may not be present. The participants, therefore, must be able to advertise this capability to the focus. 6.5 Cascaded Mixers In very large conferences, it may not be possible to have a single mixer that can handle all of the media. A solution to this is to use cascaded mixers. In this architecture, there is a centralized focus, but the mixing function is implemented by a multiplicity of mixers, scattered throughout the network. Each participant is connected to one, and only one of the mixers. The focus uses some kind of control protocol (such as MEGACO [9]) to connect the mixers together, so that all of the participants can hear each other. This architecture is shown in Figure 7. 7 Common Operations There are a large number of ways in which users can interact with a conference. They can join, leave, set policies, approve members, and so on. This section is meant as an overview of the basic primitives, summarizing how they operate. More detailed examples with complete call flows can be found in [12]. 7.1 Creating Conferences There are many ways in which a conference can be created. Ultimately, all of them result in the establishment of a conference URI which identifies a focus. In all cases, a conference URI must be created by the focus itself, or an element which is responsible for managing URIs that are used by the focus. Otherwise, the uniqueness of conference URIs could not be guaranteed. J. Rosenberg [Page 22] Internet Draft Conferencing Framework October 28, 2002 +---------+ |Partcpnt | media | | media ...............| |.................. . | Mixer | . . |M.Pol.Srv| . . +---------+ . . | . . | . . | . . dialog | . . | . . | . . | . . +---------+ . . |Cnf.Srvr.| . . | | . . | Focus | . . |M.Pol.Srv| . . / |C.Pol.Srv| \ . . / +---------+ \ . . / \ . . / \ . . / dialog \ . . / \ . . /dialog \ . . / \ . . / \ . . / \ . . . +---------+ +---------+ |Partcpnt | |Partcpnt | | | | | | | ......................... | | | Mixer | | Mixer | |M.Pol.Srv| media |M.Pol.Srv| +---------+ +---------+ Figure 6: Dialog and media streams in a distributed mixed conference J. Rosenberg [Page 23] Internet Draft Conferencing Framework October 28, 2002 +---------+ +-----------------------| |------------------------+ | ++++++++++++++++++++| |++++++++++++++++++ | | + +------| Focus |---------+ + | | + | | | | + | | + | +-| |--+ | + | | + | | +---------+ | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | +---------+ | | + | | + | | | | | | + | | + | | | Mixer 2 | | | + | | + | | | | | | + | | + | | +---------+ | | + | | + | |... . .... | | + | | + .|....| . .|.... | + | | + ...... | | . | ..|... + | | + ... | | . | | ....+ | | +---------+ | | +---------+ | | +---------+ | | | | | | | | | | | | | | | Mixer 2 | | | | Mixer 3 | | | | Mixer 4 | | | | | | | | | | | | | | | +---------+ | | +---------+ | | +---------+ | | . . | | . . | | . . | | . . | | .. . | | .. . | | . . | | . . | | . . | +---------+ . | +---------+ . | +---------+ . | | Prtcpnt | . | | Prtcpnt | . | | Prtcpnt | . | | 1 | . | | 1 | . | | 1 | . | +---------+ . | +---------+ . | +---------+ . | . | . | . | +---------+ +---------+ +---------+ | Prtcpnt | | Prtcpnt | | Prtcpnt | | 1 | | 1 | | 1 | +---------+ +---------+ +---------+ ------- SIP Dialog ....... Media Flow +++++++ Control Protocol J. Rosenberg [Page 24] Internet Draft Conferencing Framework October 28, 2002 Figure 7: Cascaded Mixers protocol, a client can instruct the conference policy server to create a new conference. The result of this operation is a conference URI, which is returned to the client. Another way to obtain a conference URI is to literally guess. In an instant conferencing server, there are literally an infinite number of conference URIs which can be used. Each of them is a valid conference URI, since it identifies a focus, and when an INVITE is sent to it, will join the user into that conference. As a result, a client can simply choose one of them at random, so long as it is configured with the domain portion of the URI and any naming conventions in use by the instant conferencing server. OPEN ISSUE: Do we need to specify standards for this? The previous two approaches are used to obtain conference URIs for focuses that are hosted within centralized servers. Creation of conferences where the focus resides in an endpoint operates differently. There, the endpoint itself creates the conference URI, and hands it out to other endpoints which are to be the participants. What differs from case to case is how the endpoint decides to create a conference. One important case is the ad-hoc conference described in Section 6.2. There, an endpoint unilaterally decides to create the conference based on local policy. The dialogs that were connected to the UA are migrated to the endpoint-hosted focus, using a re-INVITE to pass the conference URI to the newly joined participants. Alternatively, one UA can ask another UA to create an endpoint-hosted conference. This is accomplished with the SIP Join header [13]. The UA which receives the Join header in an invitation may need to create a new conference URI (a new one is not needed if the dialog that is being joined is already part of a conference). The conference URI is then handed to the recently joined participants through a re-INVITE. 7.2 Adding Participants There are two modes for adding participants to a conference - first party additions, and third party additions. In a first party addition, the participant that wishes to join makes a direct attempt to join. In a third party addition, some other participant takes action with the aim of causing a third party to be added to the conference. J. Rosenberg [Page 25] Internet Draft Conferencing Framework October 28, 2002 First person additions are trivially accomplished with a standard INVITE. A participant can send an INVITE request to the conference URI, and if the conference policy allows them to join, they are added to the conference. If a UA does not know the conference URI, but has learned about a dialog which is connected to a conference (by using the dialog event package, for example [14]), the UA can join the conference by using the Join header to join the dialog. Third party invitations can be done in one of several ways. The first approach is for the user to ask the third party to send an INVITE to the conference URI. This can be done automatically through the usage of REFER [15]. The participant would send a REFER request to the third party. The Refer-To header field in that request would contain the conference URI. There are countless non-automated means for asking a participant to send an INVITE to the conference URI. A user can send an instant message [16] to the third party, containing an HTML document which requests the user to click on the hyperlink to join the conference: Hey, would you like to join the conference now? The second approach for third party additions is for the participant to ask the focus to add the third party to the conference. In this case, however, a REFER cannot be used. REFER would have the effect of telling the focus to send an INVITE to the new potential participant. However, just sending this INVITE is not sufficient for adding the new member. In more complex realizations, such as the distributed mixing scenario of Section 6.4, a multiplicity of invitations will need to be sent. This would require the focus to attach additional meaning to REFER; it would have to be interpreted as a request to add a participant to the conference. However, it is fundamental to the concept of REFER that the recipient not attach specific application semantics to it. Therefore, it cannot be used. Rather, the user would use the conference policy control protocol to request that the focus add the new participant. The conference policy control protocol can also be used to add a multiplicity of new users. This is referred to as mass invitation. In many cases, a new participant will not wish to join the conference unless they can join with a particicular set of policies. As an J. Rosenberg [Page 26] Internet Draft Conferencing Framework October 28, 2002 example, a participant may want to join anonymously, so that other participants know that someone has joined, but not who. To accomplish this, the conference policy control protocol is used to establish these policies prior to the generation or acceptance of an invitation to the conference. For example, if a user wishes to join a conference with a known conference URI, the user would obtain the URI for the conference policy, manipulate the policy to set themself as an anonymous participant, and then actually join the conference by sending an INVITE request to the conference URI. OPEN ISSUE: Will this always work? Are there cases where the conference policy cannot be manipulated until the INVITE has been sent? This would require a preconditions- style solution. 7.3 Removing Participants As with additions, there are two modalities for departures - first person (in which a user explicitly leaves), and third person, where they are removed by a different user. First person departures are trivially accomplished by terminating the dialog that the participant is using to connect to the focus. Third person departures can be done in one of two ways. First, a user can make use of the REFER method to instruct the third party to send a BYE to the conference server on the dialog that connects them to the focus. This requires the user to have knowledge of the dialog identifiers used by that participant. The second mechanism, which is much cleaner, is to use the conference policy control protocol to inform the focus that the participant is explicitly barred from the conference. This will cause the focus to eject the user, sending them a BYE in addition to whatever other signaling is needed to remove them. The conference policy control protocol can also be used to remove a large number of users. This is generally referred to as mass ejection. 7.4 Approving Policy Changes A conference policy for a particular conference may designate one or more users as moderators for some set of media policy or conference policy change requests. This means that those moderators need to approve the specific policy change. Typically, moderators are used to approve member additions and removals. However, the framework allows for moderators to be associated with any policy change that can be made. J. Rosenberg [Page 27] Internet Draft Conferencing Framework October 28, 2002 The general model to support moderator approval is through the conference notification service. The moderator subscribes to the notification service. They are authenticated by the focus, which determines that they are a moderator for the conference. Whenever a policy change request is made by a client that requires moderator approval, the policy change is not actually committed. Rather, it is marked as pending by the conference policy server. Any moderators for that specific policy request who are subscribed to the conference notification service will receive a notification of the pending change. The moderators, using the conference policy control protocol, can approve the specific change. This commits the new policy. All participants are then notified of the new policy through the notification service. 7.5 Creating Sidebars A sidebar is a "conference within a conference", allowing a subset of the participants to converse amongst themselves. Frequently, participants in a sidebar will still receive media from the main conference, but "in the background". For audio, this may mean that the volume of the media is reduced, for example. There are two ways to represent a sidebar in this framework. The first is to treat it as a specific kind of media policy. It is a media policy which would request that sidebar participants be "in the foreground", and others "in the background". There are no additional dialogs or conferences established. The media policy control protocol would allow a user to explicitly request sidebars. The server would alert users (through the notification service) that they have been invited to the sidebar. They would use the media policy control protocol to approve their participation in it. An alternative view is that a sidebar truly is a conference within a conference, and would be implemented that way. There would be a new conference URI associated with the sidebar. Standard techniques would be used to add users to the sidebar, approve their membership, and so on. The sidebar would itself be a participant in the main conference. Users would continue to receive their media stream only through the main conference. They would have a dialog with the sidebar focus, but no media would be exchanged on this dialog. OPEN ISSUE: It is still unclear as to which model is preferrable. We should pick one. 8 Security Considerations Conferences frequently require security features in order to properly J. Rosenberg [Page 28] Internet Draft Conferencing Framework October 28, 2002 operate. The conference policy may dictate that only certain participants can join, or that certain participants can create new policies. Generally speaking, conference applications are very concerned about authorization decisions. Mechanisms for establishing and enforcing such authorization rules is a central concept throughout this document. Of course, authorization rules require authentication. Normal SIP authentication mechanisms should suffice for the the conference authorization mechanisms described here. 9 Contributors This document is the result of discussions amongst the conferencing design team. The members of this team include: Brian Rosen Rohan Mahy Henning Schulzrinne Orit Levin Roni Even Tom Taylor Petri Koskelainen Nermeen Ismail Andy Zmolek Joerg Ott Dan Petrie 10 Authors Addresses Jonathan Rosenberg dynamicsoft 72 Eagle Rock Avenue First Floor East Hanover, NJ 07936 email: jdrosen@dynamicsoft.com 11 Normative References 12 Informative References [1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. J. Rosenberg [Page 29] Internet Draft Conferencing Framework October 28, 2002 Peterson, R. Sparks, M. Handley, and E. Schooler, "SIP: session initiation protocol," RFC 3261, Internet Engineering Task Force, June 2002. [2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: a transport protocol for real-time applications," RFC 1889, Internet Engineering Task Force, Jan. 1996. [3] O. Levin et al. , "Requirements for tightly coupled SIP conferencing," Internet Draft, Internet Engineering Task Force, July 2002. Work in progress. [4] A. B. Roach, "Session initiation protocol (sip)-specific event notification," RFC 3265, Internet Engineering Task Force, June 2002. [5] B. Campbell and J. Rosenberg, "Instant message sessions in simple," Internet Draft, Internet Engineering Task Force, Oct. 2002. Work in progress. [6] J. Rosenberg and H. Schulzrinne, "A session initiation protocol (SIP) event package for conference state," Internet Draft, Internet Engineering Task Force, June 2002. Work in progress. [7] T. Berners-Lee, R. Fielding, and L. Masinter, "Uniform resource identifiers (URI): generic syntax," RFC 2396, Internet Engineering Task Force, Aug. 1998. [8] H. Schulzrinne and J. Rosenberg, "Session initiation protocol (SIP) caller preferences and callee capabilities," Internet Draft, Internet Engineering Task Force, July 2002. Work in progress. [9] F. Cuervo, N. Greene, A. Rayhan, C. Huitema, B. Rosen, and J. Segers, "Megaco protocol version 1.0," RFC 3015, Internet Engineering Task Force, Nov. 2000. [10] J. Rosenberg, P. Mataga, and H. Schulzrinne, "An application server component architecture for SIP," Internet Draft, Internet Engineering Task Force, Mar. 2001. Work in progress. [11] J. Rosenberg, J. Peterson, H. Schulzrinne, and G. Camarillo, "Best current practices for third party call control in the session initiation protocol," Internet Draft, Internet Engineering Task Force, June 2002. Work in progress. [12] A. Johnston and O. Levin, "Session initiation call control - conferencing for user agents," Internet Draft, Internet Engineering Task Force, Oct. 2002. Work in progress. J. Rosenberg [Page 30] Internet Draft Conferencing Framework October 28, 2002 [13] R. Mahy and D. Petrie, "The session initiation protocol (sip) join header," Internet Draft, Internet Engineering Task Force, Oct. 2002. Work in progress. [14] J. Rosenberg and H. Schulzrinne, "A session initiation protocol (SIP) event package for dialog state," Internet Draft, Internet Engineering Task Force, June 2002. Work in progress. [15] R. Sparks, "The SIP refer method," Internet Draft, Internet Engineering Task Force, July 2002. Work in progress. [16] B. Campbell and J. Rosenberg, "Session initiation protocol extension for instant messaging," Internet Draft, Internet Engineering Task Force, Sept. 2002. Work in progress. Full Copyright Statement Copyright (c) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. J. Rosenberg [Page 31]