idnits 2.17.1 draft-burman-rtcweb-mmusic-media-structure-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 31, 2013) is 4096 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-avtcore-multi-media-rtp-session-01 == Outdated reference: A later version (-25) exists of draft-ietf-clue-framework-08 == Outdated reference: A later version (-09) exists of draft-ietf-clue-telepresence-use-cases-04 == Outdated reference: A later version (-54) exists of draft-ietf-mmusic-sdp-bundle-negotiation-01 == Outdated reference: A later version (-19) exists of draft-ietf-rtcweb-overview-05 Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Burman 3 Internet-Draft M. Westerlund 4 Intended status: Informational Ericsson 5 Expires: August 4, 2013 January 31, 2013 7 Multi-Media Concepts and Relations 8 draft-burman-rtcweb-mmusic-media-structure-00 10 Abstract 12 There are currently significant efforts ongoing in IETF regarding 13 more advanced multi-media functionalities, such as the work related 14 to RTCWEB and CLUE. This work includes use cases for both multi- 15 party communication and multiple media streams from an individual 16 end-point. The usage of scalable encoding or simulcast encoding as 17 well as different types of transport mechanisms have created 18 additional needs to correctly identify different types of resources 19 and describe their relations to achieve intended functionalities. 21 The different usages have both commonalities and differences in needs 22 and behavior. This document attempts to review some usages and 23 identify commonalities and needs. It then continues to highlight 24 important aspects that need to be considered in the definition of 25 these usages. 27 Status of this Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on August 4, 2013. 44 Copyright Notice 46 Copyright (c) 2013 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 3.1. Existing RTP Usages . . . . . . . . . . . . . . . . . . . 4 65 3.1.1. Basic VoIP call . . . . . . . . . . . . . . . . . . . 4 66 3.1.2. Audio and Video Conference . . . . . . . . . . . . . . 5 67 3.1.3. Audio and Video Switched Conference . . . . . . . . . 7 68 3.2. WebRTC . . . . . . . . . . . . . . . . . . . . . . . . . . 8 69 3.2.1. Mesh-based Multi-party . . . . . . . . . . . . . . . . 9 70 3.2.2. Multi-source Endpoints . . . . . . . . . . . . . . . . 10 71 3.2.3. Media Relaying . . . . . . . . . . . . . . . . . . . . 11 72 3.2.4. Usage of Simulcast . . . . . . . . . . . . . . . . . . 11 73 3.3. CLUE Telepresence . . . . . . . . . . . . . . . . . . . . 13 74 3.3.1. Telepresence Functionality . . . . . . . . . . . . . . 13 75 3.3.2. Distributed Endpoint . . . . . . . . . . . . . . . . . 14 76 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 14 77 4.1. Commonalities in Use Cases . . . . . . . . . . . . . . . . 14 78 4.1.1. Media Source . . . . . . . . . . . . . . . . . . . . . 14 79 4.1.2. Encodings . . . . . . . . . . . . . . . . . . . . . . 16 80 4.1.3. Synchronization contexts . . . . . . . . . . . . . . . 17 81 4.1.4. Distributed Endpoints . . . . . . . . . . . . . . . . 18 82 4.2. Identified WebRTC issues . . . . . . . . . . . . . . . . . 18 83 4.3. Relevant to SDP evolution . . . . . . . . . . . . . . . . 19 84 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 85 6. Security Considerations . . . . . . . . . . . . . . . . . . . 21 86 7. Informative References . . . . . . . . . . . . . . . . . . . . 21 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 89 1. Introduction 91 This document concerns itself with the conceptual structures that can 92 be found in different logical levels of a multi-media communication, 93 from transport aspects to high-level needs of the communication 94 application. The intention is to provide considerations and guidance 95 that can be used when discussing how to resolve issues in the RTCWEB 96 and CLUE related standardization. Typical use cases for those WG 97 have commonalities that likely should be addressed similarly and in a 98 way that allows to align them. 100 The document starts with going deeper in the motivation why this has 101 become an important problem at this time. This is followed by 102 studies of some use cases and what concepts they contain, and 103 concludes with a discussion of observed commonalities and important 104 aspects to consider. 106 2. Motivation 108 There has arisen a number of new needs and requirements lately from 109 work such as WebRTC/RTCWEB [I-D.ietf-rtcweb-overview] and CLUE 110 [I-D.ietf-clue-framework]. The applications considered in those WG 111 has surfaced new requirements on the usage of both RTP [RFC3550] and 112 existing signalling solutions. 114 The main application aspects that have created new needs are: 116 o Multiple Media Streams from an end-point. The fact that an end- 117 point may have multiple media capture devices, such as cameras or 118 microphone mixes. 120 o Group communications involving multiple end-points. This is 121 realized using both mesh based connections as well as centralized 122 conference nodes. These creating a need for dealing with multiple 123 endpoints and/or multiple streams with different origins from a 124 transport peer. 126 o Media Stream Adaptation, both to adjust network resource 127 consumption as well as to handle varying end-point capabilities in 128 group communication. 130 o Transport mechanisms including both higher levels of aggregation 131 [I-D.ietf-mmusic-sdp-bundle-negotiation] 132 [I-D.ietf-avtcore-multi-media-rtp-session] and the use of 133 application-level transport repair mechanisms such as forward 134 error correction (FEC) and/or retransmission. 136 The presence of multiple media resources or components creates a need 137 to identify, handle and group those resources across multiple 138 different instantiations or alternatives. 140 3. Use Cases 142 3.1. Existing RTP Usages 144 There are many different existing RTP usages. This section brings up 145 some that we deem interesting in comparison to the other use cases. 147 3.1.1. Basic VoIP call 149 This use case is intended to function as a base-line to contrast 150 against the rest of the use cases. 152 The communication context is an audio-only bi-directional 153 communication between two users, Alice and Bob. This communication 154 uses a single multi-media session that can be established in a number 155 of ways, but let's assume SIP/SDP [RFC3261][RFC3264]. This multi- 156 media session contains two end-points, one for Alice and one for Bob. 157 Each end-point has an audio capture device that is used to create a 158 single audio media source at each end-point. 160 +-------+ +-------+ 161 | Alice |<------->| Bob | 162 +-------+ +-------+ 164 Figure 1: Point-to-point Audio 166 The session establishment (SIP/SDP) negotiates the intent to 167 communicate over RTP using only the audio media type. Inherent in 168 the application is an assumption of only a single media source in 169 each direction. The boundaries for the encodings are represented 170 using RTP Payload types in conjunction with the SDP bandwidth 171 parameter (b=). The session establishment is also used to negotiate 172 that RTP will be used, thus resulting in that an RTP session will be 173 created for the audio. The underlying transport flows, in this case 174 a single bi-directional UDP flow for RTP, another for RTCP, is 175 configured by each end-point providing its' IP address and port, 176 which becomes source or destination depending on in which direction 177 the packet is sent. 179 The RTP session will have two RTP media streams, one in each 180 direction, which carries the encoding of the media source the sending 181 implementation has chosen based on the boundaries established by the 182 RTP payload types and other SDP parameters, e.g. codec, and bit- 183 rates. The streams are in the RTP context identified by their SSRCs. 185 3.1.2. Audio and Video Conference 187 This use case is a multi-party use case with a central conference 188 node performing media mixing. It also includes two media types, both 189 audio and video. The high level topology of the communication 190 session is the following: 192 +-------+ +------------+ +-------+ 193 | |<-Audio->| |<--Audio-->| | 194 | Alice | | | | Bob | 195 | |<-Video->| |<--Video-->| | 196 +-------+ | | +-------+ 197 | Mixer | 198 +-------+ | | +-------+ 199 | |<-Audio->| |<--Audio-->| | 200 |Charlie| | | | David | 201 | |<-Video->| |<--Video-->| | 202 +-------+ +------------+ +-------+ 204 Figure 2: Audio and Video Conference with Centralized Mixing 206 The communication session is a multi-party conference including the 207 four users Alice, Bob, Charlie, and David. This communication 208 session contains four end-points and one middlebox (the Mixer). The 209 communication session is established using four different multi-media 210 sessions; one each between the user's endpoints and the middlebox. 211 Each of these multi-media sessions uses a session establishment 212 method, like SIP/SDP. 214 Looking at a single multi-media session between a user, e.g. Alice, 215 and the Mixer, there exist two media types, audio and video. Alice 216 has two capture devices, one video camera giving her a video media 217 source, and an audio capture device giving an audio media source. 218 These two media sources are captured in the same room by the same 219 end-point and thus have a strong timing relationship, requiring 220 inter-media synchronization at playback to provide the correct 221 fidelity. Thus Alice's endpoint has a synchronization context that 222 both her media sources use. 224 These two media sources are encoded using encoding parameters within 225 the boundaries that has been agreed between the end-point and the 226 Mixer using the session establishment. As has been common practice, 227 each media type will use its own RTP session between the end-point 228 and the mixer. Thus a single audio stream using a single SSRC will 229 flow from Alice to the Mixer in the Audio RTP session and a single 230 video stream will flow in the Video RTP session. Using this division 231 in separate RTP sessions, the bandwidth of both audio and video can 232 be unambiguously and separately negotiated by the SDP bandwidth 233 attributes exchanged between the end-points and the mixer. Each RTP 234 session is using its own Transport Flows. The common synchronization 235 context across Alice's two media streams is identified by binding 236 both streams to the same CNAME, generated by Alice's endpoint. 238 The mixer does not have any physical capture devices, instead it 239 creates conceptual media sources. It provides two media sources 240 towards Alice; one audio being a mix of the audio from Bob, Charlie 241 and David, the second one being a conceptual video source that 242 contains a selection of one of the other video sources received from 243 Bob, Charlie, or David depending on who is speaking. The Mixer's 244 audio and video sources are provided in an encoding using a codec 245 that is supported by both Alice's endpoint and the mixer. These 246 streams are identified by a single SSRC in the respective RTP 247 session. 249 The mixer will have its own synchronization context and it will 250 inject the media from Bob, Charlie and David in a synchronized way 251 into the mixer's synchronization context to maintain the inter-media 252 synchronization of the original media sources. 254 The mixer establishes independent multimedia sessions with each of 255 the participant's endpoints. The mixer will in most cases also have 256 unique conceptual media sources for each of the endpoints. This as 257 audio mixes and video selections typically exclude media sources 258 originating from the receiving end-point. For example, Bob's audio 259 mix will be a mix of Alice, Charlie and David, and will not contain 260 Bob's own audio. 262 This use case may need unique user identities across the whole 263 communication session. An example functionality of this is a 264 participant list which includes audio energy levels showing who is 265 speaking within the audio mix. If that information is carried in RTP 266 using the RTP header extension for Mixer to audio clients [RFC6465] 267 then contributing source identities in the form of CSRC need to be 268 bound to the other end-point's media sources or user identities. 269 This despite the fact that each RTP session towards a particular 270 user's endpoint is terminated in the RTP mixer. This points out the 271 need for identifiers that exist in multiple multi-media session 272 contexts. In most cases this can easily be solved by the application 273 having identities tailored specifically for its own needs, but some 274 applications will benefit from having access to some commonly defined 275 structure for media source identities. 277 3.1.3. Audio and Video Switched Conference 279 This use case is similar to the one above (Section 3.1.2), with the 280 difference that the mixer does not mix media streams by decoding, 281 mixing and re-encoding them, but rather switches a selection of 282 received media more or less unmodified towards receiving end-points. 283 This difference may not be very apparent to the end-user, but the 284 main motivations to eliminate the mixing operation and switch rather 285 than mix are: 287 o Lower processing requirements in the mixer. 289 o Lower complexity in the mixer. 291 o Higher media quality at the receiver given a certain media 292 bitrate. 294 o Lower end-to-end media delay. 296 Without the mixing operation, the mixer has limited ability to create 297 conceptual media sources that are customized for each receiver. The 298 reasons for such customizations comes from sender and receiver 299 differences in available resources and preferences: 301 o Presenting multiple conference users simultaneously, like in a 302 video mosaic. 304 o Alignment of sent media quality to receivers presentation needs. 306 o Alignment of codec type and configuration between sender and 307 receiver. 309 o Alignment of encoded bitrate to the available end-to-end link 310 bandwidth. 312 To enable elimination of the mixing operation, media sent to the 313 mixer must sufficiently well meet the above constraints for all 314 intended receivers. There are several ways to achieve this. One way 315 is to, by some system-wide design, ensure that all senders and 316 receivers are basically identical in all the above aspects. This may 317 however prove unrealistic when variations in conditions and end- 318 points are too large. Another way is to let a sender provide a 319 (small) set of alternative representations for each sent media 320 source, enough to sufficiently well cover the expected range of 321 variation. If those media source representations, encodings, are 322 independent from one another, they constitute a Simulcast of the 323 media source. If an encoding is instead dependent on and thus 324 requires reception of one or more other encodings, the representation 325 of the media source jointly achieved by all dependent encodings is 326 said to be Scalable. Simulcast and Scalable encoding can also be 327 combined. 329 Both Simulcast and Scalable encodings result in that a single media 330 source generates multiple RTP media streams of the same media type. 331 The division of bandwidth between the Simulcast or Scalable streams 332 for a single media source is application specific and will vary. The 333 total bandwidth for a Simulcast or a Scalable source is the sum of 334 all included RTP media streams. Since all streams in a Simulcast or 335 Scalable source originate from the same capture device, they are 336 closely related and should thus share synchronization context. 338 The first and second customizations listed above, presenting multiple 339 conference users simultaneously, aligned with the presentation needs 340 in the receiver, can also be achieved without mixing operation by 341 simply sending appropriate quality media from those users 342 individually to each receiver. The total bandwidth of this user 343 presentation aggregate is the sum of all included RTP media streams. 344 Audio and video from a single user share synchronization context and 345 can be synchronized. Streams that originate from different users do 346 not have the same synchronization context, which is acceptable since 347 they do not need to be synchronized, but just presented jointly. 349 An actual mixer device need not be either mixing-only or switching- 350 only, but may implement both mixing and switching and may also choose 351 dynamically what to do for a specific media and a specific receiving 352 user on a case-by-case basis or based on some policy. 354 3.2. WebRTC 356 This section brings up two different instantiations of WebRTC 357 [ref-webrtc10] that stresses different aspects. But let's start with 358 reviewing some important aspects of WebRTC and the MediaStream 359 [ref-media-capture] API. 361 In WebRTC, an application gets access to a media source by calling 362 getUserMedia(), which creates a MediaStream [ref-media-capture] (note 363 the capitalization). A MediaStream consists of zero or more 364 MediaStreamTracks, where each MediaStreamTrack is associated with a 365 media source. These locally generated MediaStreams and their tracks 366 are connected to local media sources, which can be media devices such 367 as video cameras or microphones, but can also be files. 369 An WebRTC PeerConnection (PC) is an association between two endpoints 370 that is capable of communicating media from one end to the other. 371 The PC concept includes establishment procedures, including media 372 negotiation. Thus a PC is an instantiation of a Multimedia Session. 374 When one end-point adds a MediaStream to a PC, the other endpoint 375 will by default receive an encoded representation of the MediaStream 376 and the active MediaStreamTracks. 378 3.2.1. Mesh-based Multi-party 380 This is a use case of WebRTC which establishes a multi-party 381 communication session by establishing an individual PC with each 382 participant in the communication session. 384 +---+ +---+ 385 | A |<---->| B | 386 +---+ +---+ 387 ^ ^ 388 \ / 389 \ / 390 v v 391 +---+ 392 | C | 393 +---+ 395 Figure 3: WebRTC Mesh-based Multi-party 397 Users A, B and C want to have a joint communication session. This 398 communication session is created using a Web-application without any 399 central conference functionality. Instead, it uses a mesh of 400 PeerConnections to connect each participant's endpoint with the other 401 endpoints. In this example, three double-ended connections are 402 required to connect the three participants, and each endpoint has two 403 PCs. 405 This is an audio and video communication and each end-point has one 406 video camera and one microphone as media sources. Each endpoint 407 creates its own MediaStream with one video MediaStreamTrack and one 408 audio MediaStreamTrack. The endpoints add their MediaStream to both 409 of their PCs. 411 Let's now focus on a single PC; in this case the one established 412 between A and B. During the establishment of this PC, the two 413 endpoints agree to use only a single transport flow for all media 414 types, thus a single RTP session is created between A and B. A's 415 MediaStream has one audio media source that is encoded according to 416 the boundaries established by the PeerConnection establishment 417 signalling, which includes the RTP payload types and thus Codecs 418 supported as well as bit-rate boundaries. The encoding of A's media 419 source is then sent in an RTP stream identified by a unique SSRC. In 420 this case, as there are two media sources at A, two encodings will be 421 created which will be transmitted using two different RTP streams 422 with their respective SSRC. Both these streams will reference the 423 same synchronization context through a common CNAME identifier used 424 by A. B will have the same configuration, thus resulting in at least 425 four SSRC being used in the RTP session part of the A-B PC. 427 Depending on the configuration of the two PCs that A has, i.e. the 428 A-B and the A-C ones, A could potentially reuse the encoding of a 429 media source in both contexts, under certain conditions. First, a 430 common codec and configuration needs to exist and the boundaries for 431 these configurations must allow a common work point. In addition, 432 the required bandwidth capacity needs to be available over the paths 433 used by the different PCs. Both of those conditions are not always 434 true. Thus it is quite likely that the endpoint will sometimes 435 instead be required to produce two different encodings of the same 436 media source. 438 If an application needs to reference the media from a particular 439 endpoint, it can use the MediaStream and MediaStreamTrack as they 440 point back to the media sources at a particular endpoint. This as 441 the MediaStream has a scope that is not PeerConnection specific. 443 The programmer can however implement this differently while 444 supporting the same use case. In this case the programmer creates 445 two MediaStreams that each have MediaStreamTracks that share common 446 media sources. This can be done either by calling getUserMedia() 447 twice, or by cloning the MediaStream obtained by the only 448 getUserMedia() call. In this example the result is two MediaStreams 449 that are connected to different PCs. From an identity perspective, 450 the two MediaStreams are different but share common media sources. 451 This fact is currently not made explicit in the API. 453 3.2.2. Multi-source Endpoints 455 This section concerns itself with endpoints that have more than one 456 media source for a particular media type. A straightforward example 457 would be a laptop with a built in video camera used to capture the 458 user and a second video camera, for example attached by USB, that is 459 used to capture something else the user wants to show. Both these 460 cameras are typically present in the same sound field, so it will be 461 common to have only a single audio media source. 463 A possible way of representing this is to have two MediaStreams, one 464 with the built in camera and the audio, and a second one with the USB 465 camera and the audio. Each MediaStream is intended to be played with 466 audio and video synchronized, but the user (local or remote) or 467 application is likely to switch between the two captures. 469 It becomes important for a receiving endpoint that it can determine 470 that the audio in the two MediaStreams have the same synchronization 471 context. Otherwise a receiver may playback the same media source 472 twice, with some time overlap, at a switch between playing the two 473 MediaStreams. Being able to determine that they are the same media 474 source further allow for removing redundancy by having a single 475 encoding if appropriate for both MediaStreamTracks. 477 3.2.3. Media Relaying 479 WebRTC endpoints can relay a received MediaStream from one PC to 480 another by the simple API level maneuver of adding the received 481 MediaStream to the other PC. To realize this in the implementation 482 is more complex. This can also cause some issues from a media 483 perspective. If an application spanning across multiple endpoints 484 that relay media between each other makes a mistake, a media loop can 485 be created. Media Loops could become a significant issue. For 486 example could an audio echo be created, i.e. an endpoint receives its 487 own media without detecting that it is its own media and plays it 488 back with some delay. In case a WebRTC endpoint produces a 489 conceptual media source by mixing incoming MediaStreams, if there is 490 no loop detection, a feedback loop can be created. 492 RTP has loop detection to detect and handle such cases within a 493 single RTP session. However, in the context of WebRTC, the RTP 494 session is local to the PC and thus cannot rely on the RTP level loop 495 detection. Instead, if this protection is needed on the WebRTC 496 MediaStream level, it could for example be achieved by having media 497 source identifiers that can be preserved between the different 498 MediaStreams in the PCs. 500 When relaying media and in case one receives multiple encodings of 501 the same source it is beneficial to know that. For example, if one 502 encoding arrives with a delay of 80 ms and another with 450 ms, being 503 able to choose the one with 80 ms and not be forced to delay all 504 media sources from the same synchronization context to the most 505 delayed source improves performance. 507 3.2.4. Usage of Simulcast 509 In this section we look at a use case applying simulcast from each 510 user's endpoint to a central conference node to avoid the need for an 511 individual encoding to each receiving endpoint. Instead, the central 512 node chooses which of the available encodings that is forwarded to a 513 particular receiver, like in Section 3.1.3. 515 +-----------+ +------------+ Enc2 +---+ 516 | A +-Enc1|----->| |----->| B | 517 | | | | | +---+ 518 | Src-+-Enc2|----->| | Enc1 +---+ 519 +-----------+ | Mixer |----->| C | 520 | | +---+ 521 | | Enc2 +---+ 522 | |----->| D | 523 +------------+ +---+ 525 Figure 4 527 In this Communication Session there are four users with endpoints and 528 one middlebox (The Mixer). This is an audio and video communication 529 session. The audio source is not simulcasted and the endpoint only 530 needs to produce a single encoding. For the video source, each 531 endpoint will produce multiple encodings (Enc1 and Enc2 in Figure 4) 532 and transfer them simultaneously to the mixer. The mixer picks the 533 most appropriate encoding for the path from the mixer to each 534 receiving client. 536 Currently there exists no specified way in WebRTC to realise the 537 above, although use-cases and requirements discuss simulcast 538 functionality. The authors believe there exist two possible solution 539 alternatives in the WebRTC context: 541 Multiple Encodings within a PeerConnection: The endpoint that wants 542 to provide a simulcast creates one or more MediaStreams with the 543 media sources it wants to transmit over a particular PC. The 544 WebRTC API provides functionality to enable multiple encodings to 545 be produced for a particular MediaStreamTrack and have possibility 546 to configure the desired quality levels and/or differences for 547 each of the encodings. 549 Using Multiple PeerConnections: There exist capabilities to both 550 negotiate and control the codec, bit-rate, video resolution, 551 frame-rate, etc of a particular MediaStreamTrack in the context of 552 one PeerConnection. Thus one method to provide multiple encodings 553 is to establish multiple PeerConnections between A and the Mixer, 554 where each PC is configured to provide the desired quality. Note 555 that this solution comes in two flavors from an application 556 perspective. One is that the same MediaStream object is added to 557 the two PeerConnections. The second is that two different 558 MediaStream objects, with the same number of MediaStreamTracks and 559 representing the same sources, are created (e.g by cloning), one 560 of them added to the first PeerConnection and the second one to 561 the second PeerConnection. 563 Both of these solutions share a common requirement, the need to 564 separate the received RTP streams not only based on media source, but 565 also on the encoding. However, on an API level the solutions appear 566 different. For Multiple Encodings within the context of a PC, the 567 receiver will need new access methods for accessing and manipulating 568 the different encodings. Using multiple PC instead requires that one 569 can easily determine the shared (simulcasted) media source despite 570 receiving it in multiple MediaStreams on different PCs. If the same 571 MediaStream is added to both PC's the id's of the MediaStream and 572 MediaStreamTracks will be the same, while they will be different if 573 different MediaStream's (but representing the same sources) are added 574 to the two PC's. 576 3.3. CLUE Telepresence 578 The CLUE framework [I-D.ietf-clue-framework] and use case 579 [I-D.ietf-clue-telepresence-use-cases] documents make use of most, if 580 not all, media concepts that were already discussed in previous 581 sections, and adds a few more. 583 3.3.1. Telepresence Functionality 585 A communicating CLUE Endpoint can, compared to other types of 586 Endpoints, be characterized by using multiple media resources: 588 o Multiple capture devices, such as cameras or microphones, 589 generating the media for a media source. 591 o Multiple render devices, such as displays or speakers. 593 o Multiple Media Types, such as audio, video and presentation 594 streams. 596 o Multiple remote Endpoints, since conference is a typical use case. 598 o Multiple Encodings (encoded representations) of a media source. 600 o Multiple Media Streams representing multiple media sources. 602 To make the multitude of resources more manageable, CLUE introduces 603 some additional structures. For example, related media sources in a 604 multimedia session are grouped into Scenes, which can generally be 605 represented in different ways, described by alternative Scene 606 Entries. CLUE explicitly separates the concept of a media source 607 from the encoded representations of it and a single media source can 608 be used to create multiple Encodings. It is also possible in CLUE to 609 account for constraints in resource handling, like limitations in 610 possible Encoding combinations due to physical device implementation. 612 The number of media resources typically differ between Endpoints. 613 Specifically, the number of available media resources of a certain 614 type used for sending at the sender side typically does not match the 615 number of corresponding media resources used for receiving at the 616 receiver side. Some selection process must thus be applied either at 617 the sender or the receiver to select a subset of resources to be 618 used. Hence, each resource that need to be part of that selection 619 process must have some identification and characterization that can 620 be understood by the selecting party. In the CLUE model, the sender 621 (Provider) announces available resources and the receiver (Consumer) 622 chooses what to receive. This choice is made independently in the 623 two directions of a bi-directional communication. 625 3.3.2. Distributed Endpoint 627 The definition of a single CLUE Endpoint in the framework 628 [I-D.ietf-clue-framework] says it can consist of several physical 629 devices with source and sink media streams. This means that each 630 logical node of such distributed Endpoint can have a separate 631 transport interface, and thus that media sources originating from the 632 same Endpoint can have different transport addresses. 634 4. Discussion 636 This section discusses some conclusions the authors make based on the 637 use cases. First we will discuss commonalities between use cases. 638 Secondly we will provide a summary of issues we see affect WebRTC. 639 Lastly we consider aspects that need to be considered in the SDP 640 evolution that is ongoing. 642 4.1. Commonalities in Use Cases 644 The above use cases illustrate a couple of concepts that are not well 645 defined, nor have they fully specified standard mechanisms or 646 behaviors. This section contains a discussion of such concepts, 647 which the authors believe are useful in more than one context and 648 thus should be defined to provide a common function when needed by 649 multi-media communication applications. 651 4.1.1. Media Source 653 In several of the above use cases there exist a need for a separation 654 between the media source, the particular encoding and its transport 655 stream. In vanilla RTP there exist a one-to-one mapping between 656 these; one media source is encoded in one particular way and 657 transported as one RTP stream using a single SSRC in a particular RTP 658 session. 660 The reason for not keeping a strict one-to-one mapping, allowing the 661 media source to be identified separately from the RTP media stream 662 (SSRC), varies depending on the application's needs and the desired 663 functionalities: 665 Simulcast: Simulcast is a functionality to provide multiple 666 simultaneous encodings of the same media source. As each encoding 667 is independent of the other, in contrast to scalable encoding, 668 independent transport streams for each encoding is needed. The 669 receiver of a simulcast stream will need to be able to explicitly 670 identify each encoding upon reception, as well as which media 671 source it is an encoding of. This is especially important in a 672 context of multiple media sources being provided from the same 673 endpoint. 675 Mesh-based communication: When a communication application 676 implements multi-party communication through a mesh of transport 677 flows, there exist a need for tracking the original media source, 678 especially when relaying between nodes is possible. It is likely 679 that the encodings provided over the different transports are 680 different. If an application uses relaying between different 681 transports, an endpoint may, intentionally or not, receive 682 multiple encodings of the same media source over the same or 683 different transports. Some applications can handle the needed 684 identification, but some can benefit from a standardized method to 685 identify sources. 687 The second argument above can be generalized into a common need in 688 applications that utilize multiple multimedia sessions, such as 689 multiple PeerConnections or multiple SIP/SDP-established RTP 690 sessions, to form a larger communication session between multiple 691 endpoints. These applications commonly need to track media sources 692 that occur in more than one multimedia session. 694 Looking at both CLUE and WebRTC, they appear to contain their own 695 variants of the concept that was above denoted a media source. In 696 CLUE it is called Media Capture. In WebRTC each MediaStreamTrack is 697 identifiable, however, several MediaStreamTracks can share the actual 698 source, and there is no way for the application to realize this 699 currently. The identification of sources is being discussed, and 700 there is a proposal [ref-leithead] that introduces the concept 'Track 701 Source'. Thus, in this document we see the media source as the 702 generalized commonality between these two concepts. Giving each 703 media source a unique identifier in the communication session/context 704 that is reused in all the PeerConnections or SIP/SDP-established RTP 705 sessions would enable loop detection, correctly associate alternative 706 encodings and provide a common name across the endpoints for 707 application logic to reference the actual media source rather than a 708 particular encoding or transport stream. 710 It is arguable if the application should really know a long term 711 persistent source identification, such as based on hardware 712 identities, for example due to fingerprinting issues, and it would 713 likely be better to use an anonymous identification that is still 714 unique in a sufficiently wide context, for example within the 715 communication application instance. 717 4.1.2. Encodings 719 An Encoding is a particular encoded representation of a particular 720 media source. In the context of RTP and Signalling, a particular 721 encoding must fit the established parameters, such as RTP payload 722 types, media bandwidths, and other more or less codec-specific media 723 constraints such as resolution, frame-rate, fidelity, audio 724 bandwidth, etc. 726 In the context of an application, it appears that there are primarily 727 two considerations around the use of multiple encodings. 729 The first is how many and what their defining parameters are. This 730 may require to be negotiated, something the existing signalling 731 solutions, like SDP, currently lack support for. For example in SDP, 732 there exist no way to express that you would like to receive three 733 different encodings of a particular video source. In addition, if 734 you for example prefer these three encodings to be 720p/25 Hz, 735 360p/25 Hz and 180p/12.5 Hz, and even if you could define RTP payload 736 types with these constraints, they must be linked to RTP streams 737 carrying the encodings of the particular source. Also, for some RTP 738 payload types there exist difficulties to express encoding 739 characteristics with the desired granularity. The number of RTP 740 payload types that can be used for a particular potential encoding 741 can also be a constraint, especially as a single RTP payload type 742 could well be used for all three target resolutions and frame rates 743 in the example. Using multiple encodings might even be desirable for 744 multi-party conferences that switches video, rather than composites 745 and re-encodes it. It might be that SDP is not the most suitable 746 place to negotiate this. From an application perspective, utilizing 747 clients that have standardized APIs or protocols to control them, 748 there exist a need for the application to express what it prefers in 749 number of encodings as well as what their primary target parameters 750 are. 752 Secondly, some applications may need explicit indication of what 753 encoding a particular stream represents. In some cases this can be 754 deduced based on information such as RTP payload types and parameters 755 received in the media stream, but such implicit information will not 756 always be detailed enough and it may also be time-consuming to 757 extract. For example, in SDP there is currently limitations for 758 binding the relevant information about a particular encoding to the 759 corresponding RTP stream, unless only a single RTP stream is defined 760 per media description (m= line). 762 The CLUE framework explicitly discusses encodings as constraints that 763 are applied when transforming a media source (capture) into what CLUE 764 calls a capture encoding. This includes both explicit identification 765 as well as a set of boundary parameters such as maximum width, 766 height, frame rate as well as bandwidth. In WebRTC nothing related 767 has yet been defined, and we note this as an issue that needs to be 768 resolved. This as the authors expect that support for multiple 769 encodings will be required to enable simulcast and scalability. 771 4.1.3. Synchronization contexts 773 The shortcomings around synchronization contexts appears rather 774 limited. In RTP, each RTP media stream is associated with a 775 particular synchronization context through the CNAME session 776 description item. The main concerns here are likely twofold. 778 The first concern is to avoid unnecessary creation of new contexts, 779 and rather correctly associate with the contexts that actually exist. 780 For example, WebRTC MediaStreams are defined so that all 781 MediaStreamTracks within a particular MediaStream shall be 782 synchronized. An easy method for meeting this would be to assign a 783 new CNAME for each MediaStream. However, that would ignore the fact 784 that several media sources from the same synchronization context may 785 appear in different combinations across several MediaStreams. Thus 786 all these MediaStreams should share synchronization context to avoid 787 playback glitches, like playing back different instantiations of a 788 single media source out of sync because the media source was shared 789 between two different MediaStreams. 791 The second problem is that synchronization context identification in 792 RTP, i.e. CNAME, is overloaded as an endpoint identifier. As an 793 example, consider an endpoint that has two synchronization contexts; 794 one for audio and video in the room and another for an audio and 795 video presentation stream, like the output of an DVD player. Relying 796 on that an endpoint has only a single synchronization context and 797 CNAME may be incorrect and could create issues that an application 798 designer as well as RTP and signalling extension specifications need 799 to watch out for. 801 CLUE discusses so far quite little about synchronization, but clearly 802 intends to enable lip synchronization between captures that have that 803 relation. The second issue is however quite likely to be encountered 804 in CLUE due to explicit inclusion of the Scene concept, where 805 different Scenes do not require to share the same synchronization 806 context, but is rather intended for situations where Scenes cannot 807 share synchronization context. 809 4.1.4. Distributed Endpoints 811 When an endpoint consists of multiple nodes, the added complexity is 812 often local to that endpoint, which is appropriate. However, some 813 few properties of distributed endpoints needs to be tolerated by all 814 entities in a multimedia communication session. The main item is to 815 not assume that a single endpoint will only use a single network 816 address. This is a dangerous assumption even for non-distributed 817 endpoints due to multi-homing and the common deployment of NATs, 818 especially large scale NATs which in worst case uses multiple 819 addresses for a single endpoint's transport flows. 821 Distributed endpoints are brought up in the CLUE context. They are 822 not specifically discussed in the WebRTC context, instead the desire 823 for transport level aggregation makes such endpoints problematic. 824 However, WebRTC does allow for fallback to media type specific 825 transport flows and can thus without issues support distributed 826 endpoints. 828 4.2. Identified WebRTC issues 830 In the process of identifying commonalities and differences between 831 the different use cases we have identified what to us appears to be 832 issues in the current specification of WebRTC that needs to be 833 reviewed. 835 1. If simulcast or scalability are to be supported at all, the 836 WebRTC API will need to find a method to deal more explicitly 837 with the existence of different encodings and how these are 838 configured, accessed and referenced. For simulcast, the authors 839 see a quite straightforward solution where each PeerConnection is 840 only allowed to contain a single encoding for a specific media 841 source and the desired quality level can be negotiated for the 842 full PeerConnection. When multiple encodings are desired, 843 multiple PeerConnections with differences in configuration are 844 established. That would only require that the underlying media 845 source can explicitly be indicated and tracked by the receiver. 847 2. The current API structure allows to have multiple MediaStreams 848 with fully or partially overlapping media sources. This, 849 combined with multiple PeerConnections and the likely possibility 850 to do relaying, there appears to exist a significant need to 851 determine the underlying media source, despite receiving 852 different MediaStreams with particular media sources encoded in 853 different ways. It is proposed that MediaSources are made 854 possible to identify uniquely across multiple PeerConnections in 855 the context of the communication application. It is however 856 likely that while being unique in a sufficiently large context, 857 the identification should also be anonymous to avoid 858 fingerprinting issues, similar to the situation discussed in 859 Section 4.1.1. 861 3. Implementations of the MediaStream API must be careful in how 862 they name and deal with synchronization contexts, so that the 863 actual underlying synchronization context is preserved when 864 possible. It should be noted that cannot be done when a 865 MediaStream is created that contains media sources from multiple 866 synchronization contexts. This will instead require 867 resynchronization of contributing sources, creation of a new 868 synchronization context, and inserting the sources into that 869 synchronization context. 871 These issues need to be discussed and an appropriate way to resolve 872 them must be chosen. 874 4.3. Relevant to SDP evolution 876 The joint MMUSIC / RTCWeb WGs interim meeting in February 2013 will 877 discuss a number of SDP related issues around the handling of 878 multiple sources; the aggregation of multiple media types over the 879 same RTP session as well as RTP sharing its transport flow not only 880 with ICE/STUN but also with the WebRTC data channel using SCTP/DTLS/ 881 UDP. These issues will potentially result in a significant impact on 882 SDP. It may also impact other ongoing work as well as existing 883 usages and applications, making these discussions difficult. 885 The above use cases and discussion points to the existence of a 886 number of commonalities between WebRTC and CLUE, and that a solution 887 should preferably be usable by both. It is a very open question how 888 much functionality CLUE requires from SDP, as CLUE WG plans to 889 develop a protocol with a different usage model. The appropriate 890 division in functionality between SDP and this protocol is currently 891 unknown. 893 Based on this document, it is possible to express some protocol 894 requirements when negotiating multimedia sessions and their media 895 configurations. Note that this is written as requirements to 896 consider, given that one believes this functionality is needed in 897 SDP. 899 The Requirements: 901 Encoding negotiation: For Simulcast and Scalability in applications, 902 it must be possible to negotiate the number and the boundary 903 conditions for the desired encodings created from a particular 904 media source. 906 Media Resource Identification: SDP-based applications that need 907 explicit information about media sources, multiple encodings and 908 their related RTP media streams could benefit from a common way of 909 providing this information. This need can result in multiple 910 different actual requirements. Some require a common, explicit 911 identification of media sources across multiple signalling 912 contexts. Some may require explicit indication of which set of 913 encodings that has the same media source and thus which sets of 914 RTP media streams (SSRCs) that are related to a particular media 915 source. 917 RTP media stream parameters: With a greater heterogeneity of the 918 possible encodings and their boundary conditions, situations may 919 arise where some or sets of RTP media streams will need to have 920 specific sets of parameters associated with them, compared to 921 other (sets of) RTP media streams. 923 The above are general requirements and in some cases the appropriate 924 point to address the requirement may not even be SDP. For example, 925 media source identification could primarily be put in an RTCP Session 926 Description (SDES) item, and only when so required by the application 927 also be included in the signalling. 929 The discussion in this document has impact on the high level decision 930 regarding how to relate RTP media streams to SDP media descriptions. 931 However, as it is currently presenting concepts rather than giving 932 concrete proposals on how to enable these concepts as extensions to 933 SDP or other protocols, it is difficult to determine the actual 934 impact that a high level solution will have. However, the authors 935 are convinced that neither of the directions will prevent the 936 definition of suitable concepts in SDP. 938 5. IANA Considerations 940 This document makes no request of IANA. 942 Note to RFC Editor: this section may be removed on publication as an 943 RFC. 945 6. Security Considerations 947 The realization of the proposed concepts and the resolution will have 948 security considerations. However, at this stage it is unclear if any 949 has not already common considerations regarding preserving privacy, 950 confidentiality and ensure integrity to prevent denial of service or 951 quality degradations. 953 7. Informative References 955 [I-D.ietf-avtcore-multi-media-rtp-session] 956 Westerlund, M., Perkins, C., and J. Lennox, "Multiple 957 Media Types in an RTP Session", 958 draft-ietf-avtcore-multi-media-rtp-session-01 (work in 959 progress), October 2012. 961 [I-D.ietf-clue-framework] 962 Duckworth, M., Pepperell, A., and S. Wenger, "Framework 963 for Telepresence Multi-Streams", 964 draft-ietf-clue-framework-08 (work in progress), 965 December 2012. 967 [I-D.ietf-clue-telepresence-use-cases] 968 Romanow, A., Botzko, S., Duckworth, M., Even, R., and I. 969 Communications, "Use Cases for Telepresence Multi- 970 streams", draft-ietf-clue-telepresence-use-cases-04 (work 971 in progress), August 2012. 973 [I-D.ietf-mmusic-sdp-bundle-negotiation] 974 Holmberg, C. and H. Alvestrand, "Multiplexing Negotiation 975 Using Session Description Protocol (SDP) Port Numbers", 976 draft-ietf-mmusic-sdp-bundle-negotiation-01 (work in 977 progress), August 2012. 979 [I-D.ietf-rtcweb-overview] 980 Alvestrand, H., "Overview: Real Time Protocols for Brower- 981 based Applications", draft-ietf-rtcweb-overview-05 (work 982 in progress), December 2012. 984 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 985 A., Peterson, J., Sparks, R., Handley, M., and E. 986 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 987 June 2002. 989 [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model 990 with Session Description Protocol (SDP)", RFC 3264, 991 June 2002. 993 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 994 Jacobson, "RTP: A Transport Protocol for Real-Time 995 Applications", STD 64, RFC 3550, July 2003. 997 [RFC6465] Ivov, E., Marocco, E., and J. Lennox, "A Real-time 998 Transport Protocol (RTP) Header Extension for Mixer-to- 999 Client Audio Level Indication", RFC 6465, December 2011. 1001 [ref-leithead] 1002 Microsoft, "Proposal: Media Capture and Streams Settings 1003 API v6, https://dvcs.w3.org/hg/dap/raw-file/tip/ 1004 media-stream-capture/proposals/ 1005 SettingsAPI_proposal_v6.html", December 2012. 1007 [ref-media-capture] 1008 "Media Capture and Streams, 1009 http://dev.w3.org/2011/webrtc/editor/getusermedia.html", 1010 December 2012. 1012 [ref-webrtc10] 1013 "WebRTC 1.0: Real-time Communication Between Browsers, 1014 http://dev.w3.org/2011/webrtc/editor/webrtc.html", 1015 January 2013. 1017 Authors' Addresses 1019 Bo Burman 1020 Ericsson 1021 Farogatan 6 1022 SE-164 80 Kista 1023 Sweden 1025 Phone: +46 10 714 13 11 1026 Email: bo.burman@ericsson.com 1028 Magnus Westerlund 1029 Ericsson 1030 Farogatan 6 1031 SE-164 80 Kista 1032 Sweden 1034 Phone: +46 10 714 82 87 1035 Email: magnus.westerlund@ericsson.com