idnits 2.17.1 draft-ietf-sipping-conferencing-models-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 7 instances of too long lines in the document, the longest one being 7 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 325: '... a session description or MAY echo the...' RFC 2119 keyword, line 942: '... message (5) (as noted in [14], these addresses/ports MUST NOTchange)....' Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. == Line 1072 has weird spacing: '...gnaling media...' == Line 1073 has weird spacing: '...-Mixing tree ...' == Line 1075 has weird spacing: '...lticast pair...' == Line 1081 has weird spacing: '...ial-Out sta...' == Line 1083 has weird spacing: '...central star...' == (1 more instance...) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1, 2002) is 7969 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Obsolete informational reference (is this intentional?): RFC 1889 (ref. '2') (Obsoleted by RFC 3550) -- Obsolete informational reference (is this intentional?): RFC 2327 (ref. '3') (Obsoleted by RFC 4566) -- Obsolete informational reference (is this intentional?): RFC 2908 (ref. '5') (Obsoleted by RFC 6308) -- Obsolete informational reference (is this intentional?): RFC 2617 (ref. '10') (Obsoleted by RFC 7235, RFC 7615, RFC 7616, RFC 7617) Summary: 5 errors (**), 0 flaws (~~), 8 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force SIPPING WG 3 Internet Draft J. Rosenberg 4 dynamicsoft 5 H. Schulzrinne 6 Columbia U. 7 draft-ietf-sipping-conferencing-models-01.txt 8 July 1, 2002 9 Expires: January 2003 11 Models for Multi Party Conferencing in SIP 13 STATUS OF THIS MEMO 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress". 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 To view the list Internet-Draft Shadow Directories, see 32 http://www.ietf.org/shadow.html. 34 Abstract 36 The Session Initiation Protocol (SIP) can support multi-party 37 conferencing in many different ways. In this draft, we define the 38 various multi-party conferencing models, and for each, discuss how 39 they are used and then analyze their relative benefits and drawbacks. 41 Table of Contents 43 1 Introduction ........................................ 4 44 2 End System Mixing ................................... 4 45 2.1 Inviting Users to Join .............................. 6 46 2.2 Users Joining ....................................... 7 47 2.3 Scalability ......................................... 7 48 2.4 Location of Service Logic ........................... 8 49 2.5 Discovering Participant Identities .................. 8 50 3 Large-Scale Multicast Conferences ................... 8 51 3.1 Inviting Users to Join .............................. 9 52 3.2 Users Joining ....................................... 9 53 3.3 Scalability ......................................... 9 54 3.4 Location of Service Logic ........................... 10 55 3.5 Discovering Participant Identities .................. 10 56 4 Dial-In Conference Servers .......................... 10 57 4.1 Inviting Users to Join .............................. 12 58 4.2 Users Joining ....................................... 13 59 4.3 Scalability ......................................... 13 60 4.4 Location of Service Logic ........................... 14 61 4.5 Discovering Participant Identities .................. 14 62 5 Ad-hoc Centralized Conferences ...................... 14 63 5.1 Inviting Users to Join .............................. 16 64 5.2 Users Joining ....................................... 16 65 5.3 Scalability ......................................... 16 66 5.4 Location of Service Logic ........................... 19 67 5.5 Discovering Participant Identities .................. 19 68 6 Dial-Out Conferences ................................ 19 69 6.1 Inviting Users to Join .............................. 19 70 6.2 Users Joining ....................................... 19 71 6.3 Scalability ......................................... 20 72 6.4 Location of Service Logic ........................... 21 73 6.5 Discovering Participant Identities .................. 21 74 7 Centralized Signaling, Distributed Media ............ 21 75 7.1 Inviting Users to Join .............................. 24 76 7.2 Users Joining ....................................... 24 77 7.3 Scalability ......................................... 24 78 7.4 Location of Service Logic ........................... 24 79 7.5 Discovering Participant Identities .................. 25 80 8 Summary of Models ................................... 25 81 9 Security Considerations ............................. 25 82 10 Conclusion .......................................... 26 83 11 Acknowledgements .................................... 26 84 12 Authors Addresses ................................... 26 85 13 Normative References ................................ 26 86 14 Informative References .............................. 27 88 1 Introduction 90 The Session Initiation Protocol (SIP) [1] has been defined for the 91 establishment, maintenance, and termination of calls between one or 92 more users. However, despite its origins as a large scale multiparty 93 conferencing protocol, SIP is used today primarily for point to point 94 calls. This configuration is the focus of the SIP specification and 95 most of its extensions. As a result, there is a lot of confusion 96 about how SIP supports multi-party conferencing. 98 We seek to remedy this problem by describing, in a consistent and 99 complete fashion, the various multi-party conferencing models 100 supported by standard SIP. For each model, we discuss: 102 o How the model works. 104 o How users are invited to join. 106 o How users can join an existing conference without being 107 invited 109 o How well the model scales. 111 o Which entities need to be aware of the model. 113 o How participants learn about each other. 115 We also identify missing pieces and reccomend standard activity to 116 fill them in. This document itself does not define any new extensions 117 of any kind. However, several scenarios discussed in the draft make 118 use of existing extensions to SIP. 120 2 End System Mixing 122 The first model we call "end system mixing". In this model, user A 123 calls user B, and they have a conversation. At some point later, A 124 decides to conference in user C. To do this, A calls C, using a 125 completely separate SIP call. This call uses a different Call-ID, 126 different tags, etc. There is no call set up directly between B and 127 C. A receives media streams from both B and C, and mixes them. A 128 sends a stream containing A's and C's streams to B, and a stream 129 stream containing A's and B's streams to C. 131 This model is depicted graphically in Figure 1. 133 Basically, user A handles both signaling and media mixing. B and C 134 are unaware of the multi-party call, from a SIP perspective at least. 136 +----------+ 137 | | 138 -- | | 139 --- | B | 140 SIP call --- | | 141 --- .. | | 142 --- .. +----------+ 143 -- ... 144 ... 145 +----------+ .. RTP 146 | | .. 147 | | 148 | A | .. 149 | | .. 150 | | .. RTP 151 +----------+ .. 152 -- .. 153 -- .. 154 --- . +----------+ 155 -- | | 156 -- | | 157 SIP call -- | | 158 | C | 159 | | 160 +----------+ 162 Figure 1: Three Way Calling using End System Mixing 164 From an RTP perspective, A is a mixer, and so the RTCP reports from A 165 will contain SDES information that indicates the existence of an 166 additional party in the media stream. 168 Note that this model has the serious drawback that the conference 169 ends when the mixing UA leaves the call. 171 OPEN ISSUE: Another problem with this approach is that 172 there is no specific way for A to determine when a 173 signaling message it receives was meant just for it, or for 174 the entire conference. For example, if B sends a REFER to 175 A, pointing to user D, was this REFER meant for A alone, or 176 for A and C? If it was meant for A and C, presumably A 177 would act upon the REFER and send it to C as well. C too 178 would act on the REFER. This would cause two separate 179 REFER-triggered INVITEs to get routed to D. How would D 180 know that both INVITEs need to be mixed together as a 181 conference? What if it cannot support this capability? 183 Because the three-way calling approach works only for the most basic 184 case, we do not recommend it as a general solution. 186 2.1 Inviting Users to Join 188 Any user in the conference can invite another user to join, so long 189 as they are capable of performing the required mixing and signaling 190 functions. To invite a new user to join, a user in the conference 191 simply calls them using normal SIP procedures. The only difference is 192 that the stream sent to that new user contains the streams received 193 from the other parties in the call. 195 In fact, it is acceptable for complex connectivity graphs to be 196 constructed, as a result of different users inviting other users to 197 join. For example, take our case of A calling B, and then calling C. 198 If, later on, C calls D, C will performing the mixing of the streams 199 it gets from A (which actually contain media from A and B), along 200 with its own stream, and send that to D. This results in a 201 connectivity graph that looks like Figure 2. 203 A------B 204 | 205 | 206 C------D 208 Figure 2: Connectivity Graph 210 Note, however, that there is a possibility of loops. From here, if D 211 calls B, and brings that stream into the conference, a loop is 212 created. This loop can be detected using the mechanisms described in 213 the RTP specification [2]. However, we expect these conditions to be 214 extremely rare. Presumably, D knows B is in the conference already, 215 and so would not likely call B and invite them in. 217 A serious problem with the more complex topologies is that the 218 departure of a participant might cause a partition of the conference 219 into several sub-conferences which cannot easily be healed. 221 2.2 Users Joining 223 In this model, there is not any explicit conference "identifier" that 224 can be used to join. This conference model, by its nature, is built 225 around ad-hoc conferences. However, it is still possible for a user 226 to join in the following way. 228 Lets say a new user, E, simply calls B, unaware even, that B is in a 229 conference (E might actually be aware, but the SIP messaging is no 230 different). B's softphone, recognizing that B is already in a 231 conference, asks B if E should be brought into the conference right 232 away. If B clicks "yes", the call to E is answered. The media stream 233 sent to E contains media from B, along with the media B is already 234 receiving from A. 236 If B had instead clicked no, E can easily be added to the conference 237 later. No SIP signaling at all is needed to do this. B simply starts 238 sending the mixed media to E. 240 2.3 Scalability 242 A drawback of this model is its scalability. Viewing the conference 243 from a graph perspective, if the number of edges touching a vertex 244 (its degree) equals N, the user corresponding to that vertex has to 245 perform up to N separate media stream encodings. We say "up to", as 246 it depends on the number of paricipants who are talking at once. If 247 only one participant is talking, the non-talking "mixer" endpoints 248 don't need to do any additional encoding. If everyone is talking, it 249 is N encodes. Since encoding is generally a complex process, a 250 typical workstation these days can handle two or three simultaneous 251 encodes using a low rate codec like G.723.1. The problem can be 252 mitigated somewhat by distributing the mixing responsibilities 253 (making the graph deep rather than wide). However, this requires a 254 conscious effort of the participants regarding who is to make the 255 call to add a new user. This is unlikely to happen in practice. 257 Another limitation to scalability is bandwidth. If the degree of a 258 vertex is N, the user needs enough bandwidth to send and receive up 259 to N streams, for a total of 2N. On a 56K modem, using a G.723.1 260 codec, this limits the degree to two (remember RTP overheads). This 261 limitation exists even if only one user is talking. In this case, a 262 mixing host receives the encoded packet stream, and needs to send a 263 copy to each participant it is connected to. 265 For these reasons, this conferencing model is ideal for three-way 266 conferences (i.e., degrees of two), but doesn't scale up much higher. 268 2.4 Location of Service Logic 270 This model does not require any extension to SIP in order to work. It 271 does require knowledge of this mechanism within the UA performing the 272 mixing. Non-mixing participants do not need to know anything special. 274 2.5 Discovering Participant Identities 276 The identities of other participants in the conference is NOT known 277 through SIP. Rather, it is learned through RTP. UAs with degrees 278 greater than one are RTP mixers. As such, they take the RTCP SDES of 279 the streams they mix, and aggregrate them into the RTCP stream sent 280 out. Since RTCP messages are sent infrequently, there may be a delay 281 between when a user joins, and when their presence is known to the 282 other participants. 284 3 Large-Scale Multicast Conferences 286 Large-scale multicast conferences were the original motivation for 287 both the Session Description Protocol (SDP) [3] and SIP. In a large- 288 scale multicast conference, one or more multicast addresses are 289 allocated to the conference (more than one may be needed if layered 290 encodings are in use). Each participant joins that multicast groups, 291 and sends their media to those groups. Signaling is not sent to the 292 multicast groups. The sole purpose of the signaling is to inform 293 participants of which multicast groups to join. 295 Large-scale multicast conferences are usually pre-arranged, with 296 specific start and stop times (which is why this information exists 297 in SDP). Protocols such as the Session Announcement Protocol (SAP) 298 [4] are used to announce these conferences. However, multicast 299 conferences do not need to be pre-arranged, so long as a mechanism 300 exists to dynamically obtain a multicast address. SAP itself was 301 originally used for this purpose; this has been supplanted by the 302 malloc architecture [5], still under development. 304 So, if there are N participants, there will be point-to-point SIP 305 relationships with pairs of participants. Each participant sends a 306 single media stream to the group, and receives up to N-1 streams at 307 any time. Note that the number of streams that a user will receive 308 depends on who is actually sending at any given time. If the stream 309 is audio, and silence suppression is utilized, the number of streams 310 a user will receive at any given time is equal to the number of users 311 talking at any given time. Even for very large conferences, this is 312 usually just a small number of users. 314 3.1 Inviting Users to Join 316 Inviting users to join is simple. Any user may invite any other user 317 to join. The SIP INVITE request contains SDP that indicates multicast 318 addresses for each media line. The SDP in the 200 OK response may 319 actually be empty. From Section B.3 of RFC2543: 321 For multicast, receive and send multicast addresses are the 322 same and all parties use the same port numbers to receive 323 media data. If the session description provided by the 324 caller is acceptable to the callee, the callee can choose 325 not to include a session description or MAY echo the 326 description in the response. 328 The called party then joins the multicast groups indicated in the 329 SDP, using multicast protocols such as IGMP [6]. Note that it is not 330 even necessary for users to send each other BYE messages when the 331 conference is over, especially for large-scale, pre-arranged 332 conferences that have explicit end times indicated in SDP. 334 OPEN ISSUE: Do we need to specify a SIP mechanism for 335 indicating that no BYE is needed? 337 SDP aside, a participant can simply leave the conference at any time 338 by leaving the multicast groups. No SIP signaling is needed to 339 accomplish this. 341 3.2 Users Joining 343 Users can join a conference of this type without being invited. All 344 they need is the multicast addresses, ports, and codecs being used. 345 These can be obtained through any number of means, including SAP. SDP 346 conference descriptions can even be obtained from web pages, for 347 example. 349 Once the addresses are obtained, the user simply joins the 350 appropriate multicast groups. Note that absolutely no SIP signaling 351 is required in this case. 353 3.3 Scalability 355 The scalability of conferences of this type is can be excellent, 356 especially for audio conferences. However, it is scalable under the 357 assumption that multicast itself can scale to very large groups. 358 Indeed, in local networks, protocols like DVMRP [7] and PIM-DM have 359 tremendous scalability for conferences with very large numbers of 360 members (the so called dense modes). Given the existence of scalable 361 multicast, the primary bottleneck to scalability of this conference 362 type is the periodicity of RTCP reporting. Work has been done on 363 improving the problematic cases [8] so that conferences with well 364 over a million members are possible. 366 Scaling is a bit harder for video conferences. Unlike voice, where 367 silence suppression allows for no data to be sent during periods of 368 inactivity, the same is not the case for video. This makes it hard to 369 scale without flooding users with lots of video packets. 371 Security is also hard for multicast conferences. Group key 372 management, especially when users leave the group, is very complex. 374 Unfortunately, multicast has not been widely deployed across 375 backbones (some do, like Internet2, but they are the exception rather 376 than the rule). The MBone has collapsed, for all intents and 377 purposes. Very few ISPs support multicast. As a result, wide area 378 conferences are not really viable using multicast. However, these 379 conferences are very suitable for LAN or enterprise conferences, 380 where multicast is often deployed. 382 3.4 Location of Service Logic 384 This conferencing model does not require any SIP extensions. It does 385 require that SIP UAs are prepared to receive SIP invitations with 386 multicast addresses in the SDP. These UAs need to be prepared to 387 mirror the SDP in the response. They should also be prepared to never 388 receive a BYE for the conference. 390 3.5 Discovering Participant Identities 392 The identity of the participants in the session is learned entirely 393 through RTCP. Each user a group multicasts RTCP packets with their 394 name, email address, and so on. Note, however, that in large 395 conferences, there may be significant amounts of time between a 396 participant joining, and sending of their first RTCP SDES packet 397 (this is for receivers only; senders will become known much faster). 399 4 Dial-In Conference Servers 401 Dial-In conference servers closely mirror dial-in conference bridges 402 in the traditional PSTN. 404 A dial-in conference server acts as a normal SIP UA. Users call it, 405 and the server maintains point to point SIP relationships with each 406 user that calls in. The server takes the media from the users who 407 dial into the same conference, mixes them, and sends out the 408 appropriate mixed stream to each participant separately. 410 +-----+ 411 | | 412 | A | 413 | | 414 +-----+ 415 | . 416 | . 417 | . 418 | . 419 | . 420 +---------+ 421 +-----+ | | +-----+ 422 | |---------| Conf. |---------| | 423 | D | | Server | | B | 424 | |.........| |.........| | 425 +-----+ | | +-----+ 426 +---------+ 427 | . 428 | . 429 | . 430 | . 431 +-----+ 432 | | 433 | C | 434 | | 435 +-----+ 437 Figure 3: Dial-In Conference Servers 439 The model is depicted in Figure 3. Note that each UA (A,B,C,D) has a 440 point to point SIP and RTP relationship with the conference server. 441 Each call has a different Call-ID. Each user sends their own media to 442 the server. The media delivered to user A by the server is the media 443 mixed from users B,C and D. The media delivered to user B by the 444 server is the media mixed from users A, C and D. The media delivered 445 to user C by the server is the media mixed from users A, B and D. The 446 media delivered to user D is the media mixed from users A, B and C 447 (this is also known as a mix-minus configuration). 449 The conference is identified by the request URI of the calls from 450 each participant. This provides numerous advantages from a services 451 and routing point of view [9]. For example, one conference on the 452 server might be known as sip:conference34@servers.com. All users who 453 call sip:conference34@servers.com are mixed together. 455 Dial-In conference servers are usually associated with pre-arranged 456 conferences. However, the same model applies to ad-hoc conferences. 457 An ad-hoc conference server creates the conference state when the 458 first user joins, and destroys it when the last one leaves. The SIP 459 and RTP interfaces are identical to the pre-arranged case. 461 Since conferencing servers are nothing more than SIP UASes, they can 462 use any of the procedures SIP allows a UAS to use. This includes 463 authentication. So, for example, a specific conference may have a 464 password associated with it. Users who join are challenged (with a 465 401) using digest authentication. The realm, in this case, would 466 identify the conference. The INVITE that comes back would have an 467 Authorization header that includes the response to the challenge - 468 the name of the user trying to join the conference, and the 469 conference password, hashed as defined in [10]. 471 Conferences can also limit the number of participants. When a new 472 user tries to join, but the conference is full, the conference server 473 can just reject the request with a "500 Conference Full" response. 475 4.1 Inviting Users to Join 477 Inviting users to join is done using the SIP REFER message [11]. If 478 user A wishes to ask user B to join, A would send B a REFER that 479 looks like: 481 REFER sip:B@example.com SIP/2.0 482 From: sip:A@example.com 483 To: sip:B@example.com 484 Refer-To: sip:conference34@servers.com 486 This would cause B to send an INVITE message to the conference 487 server: 489 INVITE sip:conference34@servers.com 490 From: sip:B@example.com 491 To: sip:conference34@servers.com 492 Referred-By: sip:A@example.com 493 Since the request URI identifies the conference, this will cause B to 494 get added to conference 34. 496 An additional mechanism for inviting a user to join is to send REFER 497 from A to the conference server, with a Refer-To containing the 498 address of B. This REFER would look like: 500 REFER sip:conference34@servers.com SIP/2.0 501 From: sip:A@example.com 502 To: sip:B@example.com 503 Refer-To: sip:B@example.com 505 This approach has the advantage that it doesn't require REFER support 506 from B, only from the conference server. 508 OPEN ISSUE: A problem with the mechanisms for adding a user 509 is that they assume that the UA for user A (the one who 510 adds another user to the conference) knows that it is 511 indeed talking to a conference server. If the mechanisms in 512 this section were applied to a UA which was not a 513 conference server, the result would be the creation of 514 additional call legs, but not a conference. This means that 515 we require some mechanism for identifying that a URL is a 516 conference URL. 518 4.2 Users Joining 520 It is easy for users to join the conference. The participant that 521 wishes to join simply sends an INVITE to the conference server, with 522 the conference ID in the request URI. The conference ID (which is a 523 SIP URL), can be learned by any number of means, including having it 524 on a web page, receiving it in an email, etc. 526 For example, if B wishes to join sip:conference34@servers.com, B 527 would send the following request: 529 INVITE sip:conference34@servers.com 530 From: sip:B@example.com 531 To: sip:conference34@servers.com 533 4.3 Scalability 534 The scalability of this model is limited by the bandwidth and 535 processing power of the conference server. If there are N 536 participants in a conference, M of which are sending media streams, 537 the server will need to manage N signaling relationships, perform M 538 RTP stream decodes, and N RTP stream encodes (assuming M > 0). The 539 encoding is the primary processing bottleneck, and the sending of the 540 N media streams is the primary bandwidth bottleneck. However, 541 conference servers can be built using heavy duty hardware, and have 542 high bandwith access. 544 Furthermore, since we are using the request URI to name the 545 conferences, we can use standard SIP techniques for distributing 546 conferences across servers [9]. 548 4.4 Location of Service Logic 550 The SIP UA of the conference participants does not require any 551 special processing. The RTP implementation in those clients, however, 552 should support RTCP and be prepared to receive contributing sources. 554 All of the new logic for providing this service resides in the 555 conferencing server. No SIP extensions are needed, simply logic that 556 resides above the SIP stack to manage the conferencing service. 558 4.5 Discovering Participant Identities 560 The identities of other participants in the conference are NOT known 561 through SIP. Rather, it is learned through RTP. THe conference server 562 is an RTP mixer. As such, it takes the RTCP SDES of the streams it 563 mixes, and aggregrates them into the RTCP stream sent out. This will 564 allow participants to gradually (over a few seconds), learn the 565 identities of the other participants. 567 As an implementation choice, the conference server can generate the 568 RTCP SDES of its participants, rather than using those provided by 569 the participants. The reason for this is authenticity. A conference 570 server can use SIP authentication mechanisms to identify the 571 participants in the conference. This may allow it to validate the 572 RTCP SDES provided by the participants. A conference server could 573 remove any false information, and regenerate the SDES using the 574 correct user identity as validated through SIP. 576 5 Ad-hoc Centralized Conferences 578 In an ad-hoc centralized conference, two users A and B start with a 579 normal SIP call. At some point later, they decide to add a third 580 party. Instead of using end system mixing, they would prefer to use a 581 conference server, as defined in Section 4. 583 The call flow for starting this kind of conference is shown in Figure 584 4. Initially, A calls B (1-3). At some point, B decides to add a 585 user, C, to the call, and begins the transition to a conference 586 server. The first step in this process is the discovery of a 587 conference server that supports ad-hoc conferences. This can be done 588 through static configuration, or through any of a number of standard 589 service discovery protocols, such as the Service Location Protocol 590 [12]. 592 Once the server is discovered, a conference ID is chosen. This ID 593 must be globally unique. The conference ID is then prepended to the 594 server, and a SIP URL for the ad-hoc conference is formed. For 595 example, if the server "a.servers.com" is used, and the unique ID is 596 "a7hytaskp09878a", the SIP URL for this conference is 597 sip:a7hytaskp09878a@a.servers.com. 599 B then sends an INVITE to this URL (4). This creates the initial 600 conference state in the server. The conference server accepts the 601 call (5) and B sends an ACK (6). B then sends a REFER to A (7), 602 referring them to sip:a7hytaskp09878a@a.servers.com. A accepts the 603 referral (8) and this triggers an INVITE to this address (9). This 604 causes A to be added to the conference. The conference server accepts 605 the INVITE (10), and an ACK is generated (11). Once the NOTIFY 606 request (indicating successful completion of the referred call) is 607 sent from A to B (12), A responds with a 200 OK. Since B is now 608 assured that A is connected through the conference server, B hangs up 609 to A with a BYE (14). 611 OPEN ISSUE: Its not clear that this is the best flow. An 612 alternative flow is for B to REFER the conference server to 613 A, using a call replacement mechanism. This is probably 614 more correct, since this is not so much a transfer as a 615 call leg replacement. 617 Finally, B can add C to the call. This is identical to the procedures 618 described in Section 4 for adding userst to the conference. First, B 619 generates a REFER (16) to C. The Refer-To header contains the 620 conference URL, sip:a7hytaskp09878a@a.servers.com. C responds to the 621 referral with a 200 OK (17). C then INVITEs itself to the conference 622 (18-20). C then generates a NOTIFY informing B that the REFER has 623 completed (21). 625 It is also possible to transition from a end system mixed conference 626 (even one with a complex connection topology), to a centralized 627 conference server. Consider a end-system mixed conference with the 628 topology of Figure 2. User A wishes to transition to a centralized 629 conference server in order to add another participant. The transition 630 is shown in Figures 5 and 6. 632 First, user A discovers a conference server, and creates a new 633 conference by sending an INVITE to it (1-3). A then REFERs the two 634 end systems it is connected to (B and C), to the server (4-5 and 6-7 635 respectively). This causes B to INVITE itself to the conference 636 server (8-10), and C to do the same (11-13). Since C had gotten a 637 REFER from B, it "passes it on" to D by sending a REFER to it (14- 638 15). This causes D to join the conference server by sending it an 639 INVITE (16-18). 641 Once the REFER triggered INVITEs complete, notifications start to get 642 sent. Since B completed first, it will be the first to send a NOTIFY 643 to A (19) followed by C (21). At this point, A can terminate its legs 644 to B and C (23-24 and 25-26 respectively). Since D completed its 645 REFER triggered INVITE next, it generates a NOTIFY to C (27). This 646 causes C to terminate its leg with D (29). The call has now 647 transitioned to a centralized server. 649 OPEN ISSUE: There is no way for A to know that the entire 650 conference has transitioned. Also, as above, its not clear 651 that a REFER from the conference server wouldn't be better. 653 Once the conference has been formed, further operation is identical 654 to the dial-in conferencing model of Section 4. The only difference 655 in the conferences is that the conference identifier is dynamic in 656 this case, and static in Section 4. This makes users asynchronously 657 joining nearly impossible. 659 5.1 Inviting Users to Join 661 Once the ad-hoc conference has been created on the server, inviting 662 users proceeds as defined in Section 4.1. 664 5.2 Users Joining 666 Once the ad-hoc conference has been created on the server, joining 667 proceeds as defined in Section 4.2. 669 5.3 Scalability 671 The scalability of this conference model is identical to that of 672 dial-in conference servers, as described in Section 4.3. 674 A B Conference C 675 Server 676 |(1) INVITE | | | 677 |-------------->| | | 678 |(2) 200 OK | | | 679 |<--------------| | | 680 |(3) ACK | | | 681 |-------------->| | | 682 | |(4) INVITE | | 683 | |-------------->| | 684 | |(5) 200 OK | | 685 | |<--------------| | 686 | |(6) ACK | | 687 |(7) REFER |-------------->| | 688 |<--------------| | | 689 |(8) 200 OK | | | 690 |-------------->| | | 691 |(9) INVITE | | | 692 |------------------------------>| | 693 |(10) 200 OK | | | 694 |<------------------------------| | 695 |(11) ACK | | | 696 |------------------------------>| | 697 |(12) NOTIFY | | | 698 |-------------->| | | 699 |(13) 200 OK | | | 700 |<--------------| | | 701 |(14) BYE | | | 702 |<--------------| | | 703 |(15) 200 OK | | | 704 |-------------->|(16) REFER | | 705 | |------------------------------>| 706 | |(17) 200 OK | | 707 | |<------------------------------| 708 | | |(18) INVITE | 709 | | |<--------------| 710 | | |(19) 200 OK | 711 | | |-------------->| 712 | | |(20) ACK | 713 | | |<--------------| 714 | |(21) NOTIFY | | 715 | |<------------------------------| 716 | |(22) 200 OK | | 717 | |------------------------------>| 718 | | | | 720 Figure 4: Transitioning to ad-hoc 721 |(1) INVITE | | | | 722 |---------------------------------------------------------->| 723 |(2) 200 OK | | | | 724 |<----------------------------------------------------------| 725 |(3) ACK | | | | 726 |---------------------------------------------------------->| 727 |(4) REFER | | | | 728 |------------->| | | | 729 |(5) 200 OK | | | | 730 |<-------------| | | | 731 |(6) REFER | | | | 732 |---------------------------->| | | 733 |(7) 200 OK | | | | 734 |<----------------------------| | | 735 | |(8) INVITE | | | 736 | |------------------------------------------->| 737 | |(9) 200 OK | | | 738 | |<-------------------------------------------| 739 | |(10) ACK | | | 740 | |------------------------------------------->| 741 | | |(11) INVITE | | 742 | | |---------------------------->| 743 | | |(12) 200 OK | | 744 | | |<----------------------------| 745 | | |(13) ACK | | 746 | | |---------------------------->| 747 | | |(14) REFER | | 748 | | |------------->| | 749 | | |(15) 200 OK | | 750 | | |<-------------|(16) INVITE | 751 | | | |------------->| 752 | | | |(17) 200 OK | 753 | | | |<-------------| 754 | | | |(18) ACK | 755 | | | |------------->| 756 | | | | | 757 | | | | | 759 A B C D Conf. 760 Server 762 Figure 5: Adhoc transition from end-system mixed: part I 764 5.4 Location of Service Logic 766 The logic for handling the transition process must be located in at 767 least one UA in the conference. All UAs that are mixers in a end 768 system mixed conference must know to propagate the REFER requests 769 they receive during the transition. 771 5.5 Discovering Participant Identities 773 Once the ad-hoc conference is established, conference identities are 774 determined through RTCP, as in the dial-in case. 776 6 Dial-Out Conferences 778 Dial-out conferences are a simple variation on dial-in conferences. 779 Instead of the users joining the conference by sending an INVITE to 780 the server, the server chooses the users who are to be members of the 781 conference, and then sends them the INVITE. Typically dial out 782 conferences are pre-arranged, with specific start times and an 783 initial group membership list. However, there are other means for the 784 dial-out server to determine the list of participants, including user 785 presence [13]. The model in no way limits the means by which the 786 server determines the set of users. 788 Once the users accept or reject the call from the dial out server, 789 the behavior of this system is identical to the dial-in server case 790 of Section 4. Thus, a dial-out conference server will generally need 791 to support dial-in access for the same conference, if it wishes to 792 allow joining after the conference begins. 794 Note that, from the participants perspective, they will learn the 795 conference identity (the URL) from the From field in the INVITE 796 messages received from the server. 798 OPEN ISSUE: Or is the Contact more appropriate? 800 6.1 Inviting Users to Join 802 Once the conference is established, inviting users to join is 803 identical to the scenario described in Section 4.1. Note that the URL 804 to be used in the REFER is obtained from the From field of the INVITE 805 received from the dial-out server. 807 6.2 Users Joining 809 Once the conference is established, joining is identical to the 810 scenario described in Section 4.2. Note that the URL to be used in 811 |(19) NOTIFY | | | | 812 |<-------------| | | | 813 |(20) 200 OK | | | | 814 |------------->| | | | 815 |(21) NOTIFY | | | | 816 |<----------------------------| | | 817 |(22) 200 OK | | | | 818 |---------------------------->| | | 819 |(23) BYE | | | | 820 |------------->| | | | 821 |(24) 200 OK | | | | 822 |<-------------| | | | 823 |(25) BYE | | | | 824 |---------------------------->| | | 825 |(26) 200 OK | | | | 826 |<----------------------------|(27) NOTIFY | | 827 | | |<-------------| | 828 | | |(28) 200 OK | | 829 | | |------------->| | 830 | | |(29) BYE | | 831 | | |------------->| | 832 | | |(30) 200 OK | | 833 | | |<-------------| | 834 | | | | | 835 | | | | | 836 | | | | | 837 | | | | | 838 | | | | | 839 | | | | | 840 | | | | | 841 | | | | | 842 | | | | | 843 | | | | | 844 | | | | | 845 | | | | | 846 | | | | | 847 | | | | | 849 A B C D Conf. 850 Server 852 Figure 6: Adhoc transition from end-system mixed: part II 854 the INVITE of new participants is obtained from the From field of the 855 INVITE received from the dial-out server by the initial participants. 857 The scalability of this conference model is identical to that of 858 dial-in conference servers, as described in Section 4.3. 860 6.4 Location of Service Logic 862 The SIP UA of the conference participants does not require any 863 special processing. The RTP implementation in those clients, however, 864 should support RTCP and be prepared to receive contributing sources. 866 All of the new logic for providing this service resides in the 867 conferencing server. No SIP extensions are needed, simply logic that 868 resides above the SIP stack to manage the conferencing service. 870 6.5 Discovering Participant Identities 872 Once the conference is established, conference identities are 873 determined through RTCP, as in the dial-in case. 875 7 Centralized Signaling, Distributed Media 877 In this conferencing model, there is a centralized controller, as in 878 the dial-in and dial-out cases. However, the centralized server 879 handles signaling only. The media is still sent directly between 880 participants, using either multicast or multi-unicast. Multi-unicast 881 is when a user sends multiple packets (one for each recipient, 882 addressed to that recipient). This is referred to as a "Decentralized 883 Multipoint Conference" in H.323. Interestingly, this conference model 884 is possible with baseline SIP. 886 It works through third party call control [14]. The conference server 887 uses re-INVITEs to each participant when a new one joins. The re- 888 INVITEs add a media stream that gets sent to the new participant (and 889 similarly in the reverse direction). 891 Let us assume for the moment that a conference already exists with 892 three participants. In this state, each participant is sending media 893 directly to each other. This is because the SDP that the conference 894 server has given to each participant contains three media lines, each 895 of type audio, with connection addresses and ports corresponding to 896 each of the three users. 898 The call flow from here is shown in Figure 7. In the figure, the word 899 after the INV or SIP response code refers to the connection 900 adress(es) in the SDP in the message. +X means the addition of a 901 stream with X as the recipient address. 903 A new participant joins the conference. It does so by sending an 904 INVITE (1)to the server, with the conference ID in the request URI. 905 The SDP in the INVITE contains a single media stream, with an IP 906 address and port where it would like to receive media (D). The 200 907 response from the conference server (2) contains a single media line 908 with an IP address of 0.0.0.0 and a random port, indicating hold. 910 The next step is for the server to obtain two more addresses where 911 the new participant will be receiving media (it already has one from 912 the original INVITE). To do this, it sends a re-INVITE to the new 913 participant (4). This re-INVITE contains two additional media streams 914 (for three total), all three of which are on hold. The 200 response 915 to the re-INVITE (5) contains two additional IP addresses and ports 916 where the user is willing to receive media. 918 Now the server needs to inform the other parties that they should 919 begin sending media to the new user. It first sends a re-INVITE to 920 user C (7). This re-INVITE adds an additional media stream to the two 921 already that C has been sending. This new media stream uses one of 922 the three connection addresses and ports returned by D in message 923 (5). Call this address/port D1. The other two are D2 and D3. The 200 924 OK response from user C (8) contains the address and port where C is 925 willing to receive a new, third media stream. Call this port C3. The 926 server holds on to this port, as it will use it later on, sending it 927 to D, so that D sends media there. At this point, however, C can 928 begin sending media to D. 930 This re-INVITE process happens for B and for A as well. In the re- 931 INVITE to B (10), the server adds an additional media line (above the 932 two already in use by C) using address/port D2. The response (11) 933 contains a new address/port to send media to B. Call this port B3. In 934 the re-INVITE to A (13), the server adds an additional media line 935 using address/port D3. The response (14) contains a new address/port 936 to send media to A. Call this port A3. 938 Finally, the server sends a re-INVITE (15) to the new party. This 939 re-INVITE takes all three streams off hold, and updates their 940 connection addresses and ports with C3, B3, and A3, respectively. The 941 200 OK response (16) returns the same ports and addresses returned in 942 message (5) (as noted in [14], these addresses/ports MUST NOTchange). 943 Now, D can send media to A,B and C. 945 The result of these manipulations is, indeed, a full mesh of unicast 946 RTP streams between all participants. Unlike the case of end system 947 mixing, the stream sent by any participant to all of the others is 948 identical. Each particpant needs to mix, but it mixes the media it 949 receives, and plays that out the speakers. This is normal behavior 950 for multiple streams of the same type. Note that the SIP relationship 951 is still point-to-point. There are four calls at the end of Figure 7, 952 | | | |(1) INV D | 953 | | | |-------------->| 954 | | | |(2) 200 hold | 955 | | | |<--------------| 956 | | | |(3) ACK | 957 | | | |-------------->| 958 | | | |(4) INV 3held | 959 | | | |<--------------| 960 | | | |(5) 200 3recv | 961 | | | |-------------->| 962 | | | |(6) ACK | 963 | | | |<--------------| 964 | | | (7) INV +D1 | | 965 | | |<------------------------------| 966 | | | (8) 200 +C3 | | 967 | | |------------------------------>| 968 | | | (9) ACK | | 969 | | |<------------------------------| 970 | |(10) INV +D2 | | | 971 | |<---------------------------------------------| 972 | |(11) 200 +B3 | | | 973 | |--------------------------------------------->| 974 | |(12) ACK | | | 975 | |<---------------------------------------------| 976 |(13) INV +D3 | | | | 977 |<-----------------------------------------------------------| 978 |(14) 200 +A3 | | | | 979 |----------------------------------------------------------->| 980 |(15) ACK | | | | 981 |<-----------------------------------------------------------| 982 | | | |(16) INV A3,B3,C3 983 | | | |<--------------| 984 | | | |(17) 200 | 985 | | | |-------------->| 986 | | | |(18) ACK | 987 | | | |<--------------| 988 | | | | | 989 | | | | | 990 | | | | | 991 | | | | | 992 | | | | | 993 | | | | | 995 A B C D Server 997 Figure 7: Centralized Signaling, Decentralized Media 998 one from each participant to the server, each with a different Call- 999 ID. 1001 Note that hybrids are easily possible. Certain users can instead be 1002 mixed (sending audio to the conference server), while others are set 1003 to send audio to each other. 1005 7.1 Inviting Users to Join 1007 Inviting users to join works identically to the dial-in conference 1008 bridge scenario 4. 1010 7.2 Users Joining 1012 A user joins in the same way described in section 4. 1014 7.3 Scalability 1016 The scalability of this conferencing model depends on many factors. 1017 From a media perspective, the conference server never even touches a 1018 single media stream. However, for N participants, each participant 1019 needs to be able to receive, decode, and mix N-1 media streams. For 1020 users accessing the server through dial-in modems, this will severely 1021 limit the sizes of these conferences. However, the processing burden 1022 is much less than that of the end system mixing model. This is 1023 because each end user needs to decode N-1 streams, but only encode 1. 1024 Decoding is much, much cheaper than encoding, so supporting many 1025 decodes is not necessarily a problem. This is especially the case 1026 when silence suppression is in use. In that case, streams are only 1027 sent by talking users. This means any given user only needs to decode 1028 (and receive) as many streams at a time as there are users talking. 1029 THis can vastly improve scalability of the conference. 1031 There is a signaling burden on the server, however. If there are N 1032 users in the conference, addition of a new user (the N+1th) requires 1033 N+3 INVITE transactions, each of which has three messages. Similarly, 1034 departure of a user requires N BYE transactions, each of which has 2 1035 messages. For large N, and highly dynamic conferences, this can 1036 represent a potential burden. However, we believe this bottleneck is 1037 much farther out than the processing and bandwidth bottlenecks at the 1038 end users. 1040 For these reasons, we believe this conference model is ideal in 1041 corporate enterprises, where bandwidth is more plentiful and PCs are 1042 generally faster. 1044 7.4 Location of Service Logic 1045 Nearly all of the logic for implementing this conferencing service 1046 lives in the server itself. 1048 The only requirement from the end users is that they support 1049 multiple, parallel media streams of the same type, and that they be 1050 prepared to mix those streams together. They must also support the 1051 third party control primitives [14], which don't require anything 1052 beyond baseline SIP, but are not likely supported unless explicit 1053 actions are taken to do so. 1055 It is this combination - no need for media processing in the server, 1056 combined with no need for specialized SIP processing in the end 1057 systems, that makes this model attractive. 1059 7.5 Discovering Participant Identities 1061 Conference identities are discovered through RTCP. Each user will 1062 receive N-1 RTP streams, each of which has its own RTCP channel that 1063 carries the participant identification. 1065 8 Summary of Models 1067 Table 1 shows a summary of the differences between the various 1068 models. 1070 Table 1: Summary of Models 1072 Name signaling media inviting joining discovering scale 1073 End-Mixing tree tree normal normal RTCP small 1074 invite invite 1075 Multicast pairs m-cast normal multicast RTCP large 1076 invite join 1077 Dial-Up star star refer normal RTCP medium 1078 invite 1079 Ad-Hoc star star refer normal RTCP medium 1080 invite 1081 Dial-Out star star refer normal RTCP medium 1082 invite 1083 Decentral star fullmesh refer + normal RTCP medium 1084 server invite and 1085 messaging server msg. 1087 9 Security Considerations 1089 The use of a server that performs the mixing on behalf of other 1090 users, which is the case for all but one of the conference models 1091 described here, introduces security risks. That entity must be 1092 trusted by the others to properly mix the media - not omitting a 1093 stream, for example. As such, it is recommended that participants in 1094 a conference authenticate the identity of the server. In the dial-in, 1095 dial-out, and decentralized conferences, this will require 1096 authentication of responses by participants. 1098 Mixing also eliminates the privacy possible with end-to-end media 1099 transport with mixing in the receivers. Such privacy is still 1100 possible in the large-scale multicast conferences, but requires 1101 shared keying material for the conference. Doing this for highly 1102 dynamic groups is still an open research problem. 1104 10 Conclusion 1106 In this draft, we have shown how to use baseline SIP (assuming 1107 endpoints that support the mixing and/or third party call control 1108 feature sets) to construct several multiparty conferencing models. 1109 These include end system mixing, large-scale multicast conferences, 1110 dial-in conference servers, dial-out conferences, ad-hoc centralized 1111 conferences, and centralized signaling, distributed media 1112 conferences. 1114 11 Acknowledgements 1116 We would like to thank Mary Barnes for her comments and input. 1118 12 Authors Addresses 1120 Jonathan Rosenberg 1121 dynamicsoft 1122 72 Eagle Rock Avenue 1123 First Floor 1124 East Hanover, NJ 07936 1125 email: jdrosen@dynamicsoft.com 1127 Henning Schulzrinne 1128 Columbia University 1129 M/S 0401 1130 1214 Amsterdam Ave. 1131 New York, NY 10027-7003 1132 email: schulzrinne@cs.columbia.edu 1134 13 Normative References 1135 14 Informative References 1137 [1] J. Rosenberg, H. Schulzrinne, et al. , "SIP: Session initiation 1138 protocol," Internet Draft, Internet Engineering Task Force, Feb. 1139 2002. Work in progress. 1141 [2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: a 1142 transport protocol for real-time applications," RFC 1889, Internet 1143 Engineering Task Force, Jan. 1996. 1145 [3] M. Handley and V. Jacobson, "SDP: session description protocol," 1146 RFC 2327, Internet Engineering Task Force, Apr. 1998. 1148 [4] M. Handley, C. Perkins, and E. Whelan, "Session announcement 1149 protocol," RFC 2974, Internet Engineering Task Force, Oct. 2000. 1151 [5] D. Thaler, M. Handley, and D. Estrin, "The internet multicast 1152 address allocation architecture," RFC 2908, Internet Engineering Task 1153 Force, Sept. 2000. 1155 [6] W. Fenner, "Internet group management protocol, version 2," RFC 1156 2236, Internet Engineering Task Force, Nov. 1997. 1158 [7] D. Waitzman, C. Partridge, and S. E. Deering, "Distance vector 1159 multicast routing protocol," RFC 1075, Internet Engineering Task 1160 Force, Nov. 1988. 1162 [8] J. Rosenberg and H. Schulzrinne, "Timer reconsideration for 1163 enhanced RTP scalability," in Proceedings of the Conference on 1164 Computer Communications (IEEE Infocom) , (San Francisco, California), 1165 March/April 1998. 1167 [9] J. Rosenberg, P. Mataga, and H. Schulzrinne, "An application 1168 server component architecture for SIP," Internet Draft, Internet 1169 Engineering Task Force, Mar. 2001. Work in progress. 1171 [10] J. Franks, P. Hallam-Baker, J. Hostetler, S. Lawrence, P. Leach, 1172 A. Luotonen, and L. Stewart, "HTTP authentication: Basic and digest 1173 access authentication," RFC 2617, Internet Engineering Task Force, 1174 June 1999. 1176 [11] R. Sparks, "The SIP refer method," Internet Draft, Internet 1177 Engineering Task Force, June 2002. Work in progress. 1179 [12] E. Guttman, C. Perkins, J. Veizades, and M. Day, "Service 1180 location protocol, version 2," RFC 2608, Internet Engineering Task 1181 Force, June 1999. 1183 [13] J. Rosenberg, "Session initiation protocol (SIP) extensions for 1184 presence," Internet Draft, Internet Engineering Task Force, May 2002. 1185 Work in progress. 1187 [14] J. Rosenberg, J. Peterson, H. Schulzrinne, and G. Camarillo, 1188 "Third party call control in SIP," Internet Draft, Internet 1189 Engineering Task Force, Nov. 2001. Work in progress. 1191 Full Copyright Statement 1193 Copyright (c) The Internet Society (2002). All Rights Reserved. 1195 This document and translations of it may be copied and furnished to 1196 others, and derivative works that comment on or otherwise explain it 1197 or assist in its implementation may be prepared, copied, published 1198 and distributed, in whole or in part, without restriction of any 1199 kind, provided that the above copyright notice and this paragraph are 1200 included on all such copies and derivative works. However, this 1201 document itself may not be modified in any way, such as by removing 1202 the copyright notice or references to the Internet Society or other 1203 Internet organizations, except as needed for the purpose of 1204 developing Internet standards in which case the procedures for 1205 copyrights defined in the Internet Standards process must be 1206 followed, or as required to translate it into languages other than 1207 English. 1209 The limited permissions granted above are perpetual and will not be 1210 revoked by the Internet Society or its successors or assigns. 1212 This document and the information contained herein is provided on an 1213 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1214 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1215 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1216 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1217 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.