idnits 2.17.1 

draft-burman-rtcweb-mmusic-media-structure-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 31, 2013) is 4096 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-13) exists of
     draft-ietf-avtcore-multi-media-rtp-session-01

  == Outdated reference: A later version (-25) exists of
     draft-ietf-clue-framework-08

  == Outdated reference: A later version (-09) exists of
     draft-ietf-clue-telepresence-use-cases-04

  == Outdated reference: A later version (-54) exists of
     draft-ietf-mmusic-sdp-bundle-negotiation-01

  == Outdated reference: A later version (-19) exists of
     draft-ietf-rtcweb-overview-05


     Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                          B. Burman
3	Internet-Draft                                             M. Westerlund
4	Intended status: Informational                                  Ericsson
5	Expires: August 4, 2013                                 January 31, 2013

7	                   Multi-Media Concepts and Relations
8	             draft-burman-rtcweb-mmusic-media-structure-00

10	Abstract

12	   There are currently significant efforts ongoing in IETF regarding
13	   more advanced multi-media functionalities, such as the work related
14	   to RTCWEB and CLUE.  This work includes use cases for both multi-
15	   party communication and multiple media streams from an individual
16	   end-point.  The usage of scalable encoding or simulcast encoding as
17	   well as different types of transport mechanisms have created
18	   additional needs to correctly identify different types of resources
19	   and describe their relations to achieve intended functionalities.

21	   The different usages have both commonalities and differences in needs
22	   and behavior.  This document attempts to review some usages and
23	   identify commonalities and needs.  It then continues to highlight
24	   important aspects that need to be considered in the definition of
25	   these usages.

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on August 4, 2013.

44	Copyright Notice

46	   Copyright (c) 2013 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
62	   2.  Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .  3
63	   3.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . . .  4
64	     3.1.  Existing RTP Usages  . . . . . . . . . . . . . . . . . . .  4
65	       3.1.1.  Basic VoIP call  . . . . . . . . . . . . . . . . . . .  4
66	       3.1.2.  Audio and Video Conference . . . . . . . . . . . . . .  5
67	       3.1.3.  Audio and Video Switched Conference  . . . . . . . . .  7
68	     3.2.  WebRTC . . . . . . . . . . . . . . . . . . . . . . . . . .  8
69	       3.2.1.  Mesh-based Multi-party . . . . . . . . . . . . . . . .  9
70	       3.2.2.  Multi-source Endpoints . . . . . . . . . . . . . . . . 10
71	       3.2.3.  Media Relaying . . . . . . . . . . . . . . . . . . . . 11
72	       3.2.4.  Usage of Simulcast . . . . . . . . . . . . . . . . . . 11
73	     3.3.  CLUE Telepresence  . . . . . . . . . . . . . . . . . . . . 13
74	       3.3.1.  Telepresence Functionality . . . . . . . . . . . . . . 13
75	       3.3.2.  Distributed Endpoint . . . . . . . . . . . . . . . . . 14
76	   4.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 14
77	     4.1.  Commonalities in Use Cases . . . . . . . . . . . . . . . . 14
78	       4.1.1.  Media Source . . . . . . . . . . . . . . . . . . . . . 14
79	       4.1.2.  Encodings  . . . . . . . . . . . . . . . . . . . . . . 16
80	       4.1.3.  Synchronization contexts . . . . . . . . . . . . . . . 17
81	       4.1.4.  Distributed Endpoints  . . . . . . . . . . . . . . . . 18
82	     4.2.  Identified WebRTC issues . . . . . . . . . . . . . . . . . 18
83	     4.3.  Relevant to SDP evolution  . . . . . . . . . . . . . . . . 19
84	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
85	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
86	   7.  Informative References . . . . . . . . . . . . . . . . . . . . 21
87	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22

89	1.  Introduction

91	   This document concerns itself with the conceptual structures that can
92	   be found in different logical levels of a multi-media communication,
93	   from transport aspects to high-level needs of the communication
94	   application.  The intention is to provide considerations and guidance
95	   that can be used when discussing how to resolve issues in the RTCWEB
96	   and CLUE related standardization.  Typical use cases for those WG
97	   have commonalities that likely should be addressed similarly and in a
98	   way that allows to align them.

100	   The document starts with going deeper in the motivation why this has
101	   become an important problem at this time.  This is followed by
102	   studies of some use cases and what concepts they contain, and
103	   concludes with a discussion of observed commonalities and important
104	   aspects to consider.

106	2.  Motivation

108	   There has arisen a number of new needs and requirements lately from
109	   work such as WebRTC/RTCWEB [I-D.ietf-rtcweb-overview] and CLUE
110	   [I-D.ietf-clue-framework].  The applications considered in those WG
111	   has surfaced new requirements on the usage of both RTP [RFC3550] and
112	   existing signalling solutions.

114	   The main application aspects that have created new needs are:

116	   o  Multiple Media Streams from an end-point.  The fact that an end-
117	      point may have multiple media capture devices, such as cameras or
118	      microphone mixes.

120	   o  Group communications involving multiple end-points.  This is
121	      realized using both mesh based connections as well as centralized
122	      conference nodes.  These creating a need for dealing with multiple
123	      endpoints and/or multiple streams with different origins from a
124	      transport peer.

126	   o  Media Stream Adaptation, both to adjust network resource
127	      consumption as well as to handle varying end-point capabilities in
128	      group communication.

130	   o  Transport mechanisms including both higher levels of aggregation
131	      [I-D.ietf-mmusic-sdp-bundle-negotiation]
132	      [I-D.ietf-avtcore-multi-media-rtp-session] and the use of
133	      application-level transport repair mechanisms such as forward
134	      error correction (FEC) and/or retransmission.

136	   The presence of multiple media resources or components creates a need
137	   to identify, handle and group those resources across multiple
138	   different instantiations or alternatives.

140	3.  Use Cases

142	3.1.  Existing RTP Usages

144	   There are many different existing RTP usages.  This section brings up
145	   some that we deem interesting in comparison to the other use cases.

147	3.1.1.  Basic VoIP call

149	   This use case is intended to function as a base-line to contrast
150	   against the rest of the use cases.

152	   The communication context is an audio-only bi-directional
153	   communication between two users, Alice and Bob. This communication
154	   uses a single multi-media session that can be established in a number
155	   of ways, but let's assume SIP/SDP [RFC3261][RFC3264].  This multi-
156	   media session contains two end-points, one for Alice and one for Bob.
157	   Each end-point has an audio capture device that is used to create a
158	   single audio media source at each end-point.

160	                        +-------+         +-------+
161	                        | Alice |<------->|  Bob  |
162	                        +-------+         +-------+

164	                      Figure 1: Point-to-point Audio

166	   The session establishment (SIP/SDP) negotiates the intent to
167	   communicate over RTP using only the audio media type.  Inherent in
168	   the application is an assumption of only a single media source in
169	   each direction.  The boundaries for the encodings are represented
170	   using RTP Payload types in conjunction with the SDP bandwidth
171	   parameter (b=).  The session establishment is also used to negotiate
172	   that RTP will be used, thus resulting in that an RTP session will be
173	   created for the audio.  The underlying transport flows, in this case
174	   a single bi-directional UDP flow for RTP, another for RTCP, is
175	   configured by each end-point providing its' IP address and port,
176	   which becomes source or destination depending on in which direction
177	   the packet is sent.

179	   The RTP session will have two RTP media streams, one in each
180	   direction, which carries the encoding of the media source the sending
181	   implementation has chosen based on the boundaries established by the
182	   RTP payload types and other SDP parameters, e.g. codec, and bit-
183	   rates.  The streams are in the RTP context identified by their SSRCs.

185	3.1.2.  Audio and Video Conference

187	   This use case is a multi-party use case with a central conference
188	   node performing media mixing.  It also includes two media types, both
189	   audio and video.  The high level topology of the communication
190	   session is the following:

192	            +-------+         +------------+           +-------+
193	            |       |<-Audio->|            |<--Audio-->|       |
194	            | Alice |         |            |           |  Bob  |
195	            |       |<-Video->|            |<--Video-->|       |
196	            +-------+         |            |           +-------+
197	                              |   Mixer    |
198	            +-------+         |            |           +-------+
199	            |       |<-Audio->|            |<--Audio-->|       |
200	            |Charlie|         |            |           | David |
201	            |       |<-Video->|            |<--Video-->|       |
202	            +-------+         +------------+           +-------+

204	       Figure 2: Audio and Video Conference with Centralized Mixing

206	   The communication session is a multi-party conference including the
207	   four users Alice, Bob, Charlie, and David.  This communication
208	   session contains four end-points and one middlebox (the Mixer).  The
209	   communication session is established using four different multi-media
210	   sessions; one each between the user's endpoints and the middlebox.
211	   Each of these multi-media sessions uses a session establishment
212	   method, like SIP/SDP.

214	   Looking at a single multi-media session between a user, e.g.  Alice,
215	   and the Mixer, there exist two media types, audio and video.  Alice
216	   has two capture devices, one video camera giving her a video media
217	   source, and an audio capture device giving an audio media source.
218	   These two media sources are captured in the same room by the same
219	   end-point and thus have a strong timing relationship, requiring
220	   inter-media synchronization at playback to provide the correct
221	   fidelity.  Thus Alice's endpoint has a synchronization context that
222	   both her media sources use.

224	   These two media sources are encoded using encoding parameters within
225	   the boundaries that has been agreed between the end-point and the
226	   Mixer using the session establishment.  As has been common practice,
227	   each media type will use its own RTP session between the end-point
228	   and the mixer.  Thus a single audio stream using a single SSRC will
229	   flow from Alice to the Mixer in the Audio RTP session and a single
230	   video stream will flow in the Video RTP session.  Using this division
231	   in separate RTP sessions, the bandwidth of both audio and video can
232	   be unambiguously and separately negotiated by the SDP bandwidth
233	   attributes exchanged between the end-points and the mixer.  Each RTP
234	   session is using its own Transport Flows.  The common synchronization
235	   context across Alice's two media streams is identified by binding
236	   both streams to the same CNAME, generated by Alice's endpoint.

238	   The mixer does not have any physical capture devices, instead it
239	   creates conceptual media sources.  It provides two media sources
240	   towards Alice; one audio being a mix of the audio from Bob, Charlie
241	   and David, the second one being a conceptual video source that
242	   contains a selection of one of the other video sources received from
243	   Bob, Charlie, or David depending on who is speaking.  The Mixer's
244	   audio and video sources are provided in an encoding using a codec
245	   that is supported by both Alice's endpoint and the mixer.  These
246	   streams are identified by a single SSRC in the respective RTP
247	   session.

249	   The mixer will have its own synchronization context and it will
250	   inject the media from Bob, Charlie and David in a synchronized way
251	   into the mixer's synchronization context to maintain the inter-media
252	   synchronization of the original media sources.

254	   The mixer establishes independent multimedia sessions with each of
255	   the participant's endpoints.  The mixer will in most cases also have
256	   unique conceptual media sources for each of the endpoints.  This as
257	   audio mixes and video selections typically exclude media sources
258	   originating from the receiving end-point.  For example, Bob's audio
259	   mix will be a mix of Alice, Charlie and David, and will not contain
260	   Bob's own audio.

262	   This use case may need unique user identities across the whole
263	   communication session.  An example functionality of this is a
264	   participant list which includes audio energy levels showing who is
265	   speaking within the audio mix.  If that information is carried in RTP
266	   using the RTP header extension for Mixer to audio clients [RFC6465]
267	   then contributing source identities in the form of CSRC need to be
268	   bound to the other end-point's media sources or user identities.
269	   This despite the fact that each RTP session towards a particular
270	   user's endpoint is terminated in the RTP mixer.  This points out the
271	   need for identifiers that exist in multiple multi-media session
272	   contexts.  In most cases this can easily be solved by the application
273	   having identities tailored specifically for its own needs, but some
274	   applications will benefit from having access to some commonly defined
275	   structure for media source identities.

277	3.1.3.  Audio and Video Switched Conference

279	   This use case is similar to the one above (Section 3.1.2), with the
280	   difference that the mixer does not mix media streams by decoding,
281	   mixing and re-encoding them, but rather switches a selection of
282	   received media more or less unmodified towards receiving end-points.
283	   This difference may not be very apparent to the end-user, but the
284	   main motivations to eliminate the mixing operation and switch rather
285	   than mix are:

287	   o  Lower processing requirements in the mixer.

289	   o  Lower complexity in the mixer.

291	   o  Higher media quality at the receiver given a certain media
292	      bitrate.

294	   o  Lower end-to-end media delay.

296	   Without the mixing operation, the mixer has limited ability to create
297	   conceptual media sources that are customized for each receiver.  The
298	   reasons for such customizations comes from sender and receiver
299	   differences in available resources and preferences:

301	   o  Presenting multiple conference users simultaneously, like in a
302	      video mosaic.

304	   o  Alignment of sent media quality to receivers presentation needs.

306	   o  Alignment of codec type and configuration between sender and
307	      receiver.

309	   o  Alignment of encoded bitrate to the available end-to-end link
310	      bandwidth.

312	   To enable elimination of the mixing operation, media sent to the
313	   mixer must sufficiently well meet the above constraints for all
314	   intended receivers.  There are several ways to achieve this.  One way
315	   is to, by some system-wide design, ensure that all senders and
316	   receivers are basically identical in all the above aspects.  This may
317	   however prove unrealistic when variations in conditions and end-
318	   points are too large.  Another way is to let a sender provide a
319	   (small) set of alternative representations for each sent media
320	   source, enough to sufficiently well cover the expected range of
321	   variation.  If those media source representations, encodings, are
322	   independent from one another, they constitute a Simulcast of the
323	   media source.  If an encoding is instead dependent on and thus
324	   requires reception of one or more other encodings, the representation
325	   of the media source jointly achieved by all dependent encodings is
326	   said to be Scalable.  Simulcast and Scalable encoding can also be
327	   combined.

329	   Both Simulcast and Scalable encodings result in that a single media
330	   source generates multiple RTP media streams of the same media type.
331	   The division of bandwidth between the Simulcast or Scalable streams
332	   for a single media source is application specific and will vary.  The
333	   total bandwidth for a Simulcast or a Scalable source is the sum of
334	   all included RTP media streams.  Since all streams in a Simulcast or
335	   Scalable source originate from the same capture device, they are
336	   closely related and should thus share synchronization context.

338	   The first and second customizations listed above, presenting multiple
339	   conference users simultaneously, aligned with the presentation needs
340	   in the receiver, can also be achieved without mixing operation by
341	   simply sending appropriate quality media from those users
342	   individually to each receiver.  The total bandwidth of this user
343	   presentation aggregate is the sum of all included RTP media streams.
344	   Audio and video from a single user share synchronization context and
345	   can be synchronized.  Streams that originate from different users do
346	   not have the same synchronization context, which is acceptable since
347	   they do not need to be synchronized, but just presented jointly.

349	   An actual mixer device need not be either mixing-only or switching-
350	   only, but may implement both mixing and switching and may also choose
351	   dynamically what to do for a specific media and a specific receiving
352	   user on a case-by-case basis or based on some policy.

354	3.2.  WebRTC

356	   This section brings up two different instantiations of WebRTC
357	   [ref-webrtc10] that stresses different aspects.  But let's start with
358	   reviewing some important aspects of WebRTC and the MediaStream
359	   [ref-media-capture] API.

361	   In WebRTC, an application gets access to a media source by calling
362	   getUserMedia(), which creates a MediaStream [ref-media-capture] (note
363	   the capitalization).  A MediaStream consists of zero or more
364	   MediaStreamTracks, where each MediaStreamTrack is associated with a
365	   media source.  These locally generated MediaStreams and their tracks
366	   are connected to local media sources, which can be media devices such
367	   as video cameras or microphones, but can also be files.

369	   An WebRTC PeerConnection (PC) is an association between two endpoints
370	   that is capable of communicating media from one end to the other.
371	   The PC concept includes establishment procedures, including media
372	   negotiation.  Thus a PC is an instantiation of a Multimedia Session.

374	   When one end-point adds a MediaStream to a PC, the other endpoint
375	   will by default receive an encoded representation of the MediaStream
376	   and the active MediaStreamTracks.

378	3.2.1.  Mesh-based Multi-party

380	   This is a use case of WebRTC which establishes a multi-party
381	   communication session by establishing an individual PC with each
382	   participant in the communication session.

384	                              +---+      +---+
385	                              | A |<---->| B |
386	                              +---+      +---+
387	                                ^         ^
388	                                 \       /
389	                                  \     /
390	                                   v   v
391	                                   +---+
392	                                   | C |
393	                                   +---+

395	                  Figure 3: WebRTC Mesh-based Multi-party

397	   Users A, B and C want to have a joint communication session.  This
398	   communication session is created using a Web-application without any
399	   central conference functionality.  Instead, it uses a mesh of
400	   PeerConnections to connect each participant's endpoint with the other
401	   endpoints.  In this example, three double-ended connections are
402	   required to connect the three participants, and each endpoint has two
403	   PCs.

405	   This is an audio and video communication and each end-point has one
406	   video camera and one microphone as media sources.  Each endpoint
407	   creates its own MediaStream with one video MediaStreamTrack and one
408	   audio MediaStreamTrack.  The endpoints add their MediaStream to both
409	   of their PCs.

411	   Let's now focus on a single PC; in this case the one established
412	   between A and B. During the establishment of this PC, the two
413	   endpoints agree to use only a single transport flow for all media
414	   types, thus a single RTP session is created between A and B. A's
415	   MediaStream has one audio media source that is encoded according to
416	   the boundaries established by the PeerConnection establishment
417	   signalling, which includes the RTP payload types and thus Codecs
418	   supported as well as bit-rate boundaries.  The encoding of A's media
419	   source is then sent in an RTP stream identified by a unique SSRC.  In
420	   this case, as there are two media sources at A, two encodings will be
421	   created which will be transmitted using two different RTP streams
422	   with their respective SSRC.  Both these streams will reference the
423	   same synchronization context through a common CNAME identifier used
424	   by A. B will have the same configuration, thus resulting in at least
425	   four SSRC being used in the RTP session part of the A-B PC.

427	   Depending on the configuration of the two PCs that A has, i.e. the
428	   A-B and the A-C ones, A could potentially reuse the encoding of a
429	   media source in both contexts, under certain conditions.  First, a
430	   common codec and configuration needs to exist and the boundaries for
431	   these configurations must allow a common work point.  In addition,
432	   the required bandwidth capacity needs to be available over the paths
433	   used by the different PCs.  Both of those conditions are not always
434	   true.  Thus it is quite likely that the endpoint will sometimes
435	   instead be required to produce two different encodings of the same
436	   media source.

438	   If an application needs to reference the media from a particular
439	   endpoint, it can use the MediaStream and MediaStreamTrack as they
440	   point back to the media sources at a particular endpoint.  This as
441	   the MediaStream has a scope that is not PeerConnection specific.

443	   The programmer can however implement this differently while
444	   supporting the same use case.  In this case the programmer creates
445	   two MediaStreams that each have MediaStreamTracks that share common
446	   media sources.  This can be done either by calling getUserMedia()
447	   twice, or by cloning the MediaStream obtained by the only
448	   getUserMedia() call.  In this example the result is two MediaStreams
449	   that are connected to different PCs.  From an identity perspective,
450	   the two MediaStreams are different but share common media sources.
451	   This fact is currently not made explicit in the API.

453	3.2.2.  Multi-source Endpoints

455	   This section concerns itself with endpoints that have more than one
456	   media source for a particular media type.  A straightforward example
457	   would be a laptop with a built in video camera used to capture the
458	   user and a second video camera, for example attached by USB, that is
459	   used to capture something else the user wants to show.  Both these
460	   cameras are typically present in the same sound field, so it will be
461	   common to have only a single audio media source.

463	   A possible way of representing this is to have two MediaStreams, one
464	   with the built in camera and the audio, and a second one with the USB
465	   camera and the audio.  Each MediaStream is intended to be played with
466	   audio and video synchronized, but the user (local or remote) or
467	   application is likely to switch between the two captures.

469	   It becomes important for a receiving endpoint that it can determine
470	   that the audio in the two MediaStreams have the same synchronization
471	   context.  Otherwise a receiver may playback the same media source
472	   twice, with some time overlap, at a switch between playing the two
473	   MediaStreams.  Being able to determine that they are the same media
474	   source further allow for removing redundancy by having a single
475	   encoding if appropriate for both MediaStreamTracks.

477	3.2.3.  Media Relaying

479	   WebRTC endpoints can relay a received MediaStream from one PC to
480	   another by the simple API level maneuver of adding the received
481	   MediaStream to the other PC.  To realize this in the implementation
482	   is more complex.  This can also cause some issues from a media
483	   perspective.  If an application spanning across multiple endpoints
484	   that relay media between each other makes a mistake, a media loop can
485	   be created.  Media Loops could become a significant issue.  For
486	   example could an audio echo be created, i.e. an endpoint receives its
487	   own media without detecting that it is its own media and plays it
488	   back with some delay.  In case a WebRTC endpoint produces a
489	   conceptual media source by mixing incoming MediaStreams, if there is
490	   no loop detection, a feedback loop can be created.

492	   RTP has loop detection to detect and handle such cases within a
493	   single RTP session.  However, in the context of WebRTC, the RTP
494	   session is local to the PC and thus cannot rely on the RTP level loop
495	   detection.  Instead, if this protection is needed on the WebRTC
496	   MediaStream level, it could for example be achieved by having media
497	   source identifiers that can be preserved between the different
498	   MediaStreams in the PCs.

500	   When relaying media and in case one receives multiple encodings of
501	   the same source it is beneficial to know that.  For example, if one
502	   encoding arrives with a delay of 80 ms and another with 450 ms, being
503	   able to choose the one with 80 ms and not be forced to delay all
504	   media sources from the same synchronization context to the most
505	   delayed source improves performance.

507	3.2.4.  Usage of Simulcast

509	   In this section we look at a use case applying simulcast from each
510	   user's endpoint to a central conference node to avoid the need for an
511	   individual encoding to each receiving endpoint.  Instead, the central
512	   node chooses which of the available encodings that is forwarded to a
513	   particular receiver, like in Section 3.1.3.

515	                +-----------+      +------------+ Enc2 +---+
516	                | A   +-Enc1|----->|            |----->| B |
517	                |     |     |      |            |      +---+
518	                | Src-+-Enc2|----->|            | Enc1 +---+
519	                +-----------+      |   Mixer    |----->| C |
520	                                   |            |      +---+
521	                                   |            | Enc2 +---+
522	                                   |            |----->| D |
523	                                   +------------+      +---+

525	                                 Figure 4

527	   In this Communication Session there are four users with endpoints and
528	   one middlebox (The Mixer).  This is an audio and video communication
529	   session.  The audio source is not simulcasted and the endpoint only
530	   needs to produce a single encoding.  For the video source, each
531	   endpoint will produce multiple encodings (Enc1 and Enc2 in Figure 4)
532	   and transfer them simultaneously to the mixer.  The mixer picks the
533	   most appropriate encoding for the path from the mixer to each
534	   receiving client.

536	   Currently there exists no specified way in WebRTC to realise the
537	   above, although use-cases and requirements discuss simulcast
538	   functionality.  The authors believe there exist two possible solution
539	   alternatives in the WebRTC context:

541	   Multiple Encodings within a PeerConnection:  The endpoint that wants
542	      to provide a simulcast creates one or more MediaStreams with the
543	      media sources it wants to transmit over a particular PC.  The
544	      WebRTC API provides functionality to enable multiple encodings to
545	      be produced for a particular MediaStreamTrack and have possibility
546	      to configure the desired quality levels and/or differences for
547	      each of the encodings.

549	   Using Multiple PeerConnections:  There exist capabilities to both
550	      negotiate and control the codec, bit-rate, video resolution,
551	      frame-rate, etc of a particular MediaStreamTrack in the context of
552	      one PeerConnection.  Thus one method to provide multiple encodings
553	      is to establish multiple PeerConnections between A and the Mixer,
554	      where each PC is configured to provide the desired quality.  Note
555	      that this solution comes in two flavors from an application
556	      perspective.  One is that the same MediaStream object is added to
557	      the two PeerConnections.  The second is that two different
558	      MediaStream objects, with the same number of MediaStreamTracks and
559	      representing the same sources, are created (e.g by cloning), one
560	      of them added to the first PeerConnection and the second one to
561	      the second PeerConnection.

563	   Both of these solutions share a common requirement, the need to
564	   separate the received RTP streams not only based on media source, but
565	   also on the encoding.  However, on an API level the solutions appear
566	   different.  For Multiple Encodings within the context of a PC, the
567	   receiver will need new access methods for accessing and manipulating
568	   the different encodings.  Using multiple PC instead requires that one
569	   can easily determine the shared (simulcasted) media source despite
570	   receiving it in multiple MediaStreams on different PCs.  If the same
571	   MediaStream is added to both PC's the id's of the MediaStream and
572	   MediaStreamTracks will be the same, while they will be different if
573	   different MediaStream's (but representing the same sources) are added
574	   to the two PC's.

576	3.3.  CLUE Telepresence

578	   The CLUE framework [I-D.ietf-clue-framework] and use case
579	   [I-D.ietf-clue-telepresence-use-cases] documents make use of most, if
580	   not all, media concepts that were already discussed in previous
581	   sections, and adds a few more.

583	3.3.1.  Telepresence Functionality

585	   A communicating CLUE Endpoint can, compared to other types of
586	   Endpoints, be characterized by using multiple media resources:

588	   o  Multiple capture devices, such as cameras or microphones,
589	      generating the media for a media source.

591	   o  Multiple render devices, such as displays or speakers.

593	   o  Multiple Media Types, such as audio, video and presentation
594	      streams.

596	   o  Multiple remote Endpoints, since conference is a typical use case.

598	   o  Multiple Encodings (encoded representations) of a media source.

600	   o  Multiple Media Streams representing multiple media sources.

602	   To make the multitude of resources more manageable, CLUE introduces
603	   some additional structures.  For example, related media sources in a
604	   multimedia session are grouped into Scenes, which can generally be
605	   represented in different ways, described by alternative Scene
606	   Entries.  CLUE explicitly separates the concept of a media source
607	   from the encoded representations of it and a single media source can
608	   be used to create multiple Encodings.  It is also possible in CLUE to
609	   account for constraints in resource handling, like limitations in
610	   possible Encoding combinations due to physical device implementation.

612	   The number of media resources typically differ between Endpoints.
613	   Specifically, the number of available media resources of a certain
614	   type used for sending at the sender side typically does not match the
615	   number of corresponding media resources used for receiving at the
616	   receiver side.  Some selection process must thus be applied either at
617	   the sender or the receiver to select a subset of resources to be
618	   used.  Hence, each resource that need to be part of that selection
619	   process must have some identification and characterization that can
620	   be understood by the selecting party.  In the CLUE model, the sender
621	   (Provider) announces available resources and the receiver (Consumer)
622	   chooses what to receive.  This choice is made independently in the
623	   two directions of a bi-directional communication.

625	3.3.2.  Distributed Endpoint

627	   The definition of a single CLUE Endpoint in the framework
628	   [I-D.ietf-clue-framework] says it can consist of several physical
629	   devices with source and sink media streams.  This means that each
630	   logical node of such distributed Endpoint can have a separate
631	   transport interface, and thus that media sources originating from the
632	   same Endpoint can have different transport addresses.

634	4.  Discussion

636	   This section discusses some conclusions the authors make based on the
637	   use cases.  First we will discuss commonalities between use cases.
638	   Secondly we will provide a summary of issues we see affect WebRTC.
639	   Lastly we consider aspects that need to be considered in the SDP
640	   evolution that is ongoing.

642	4.1.  Commonalities in Use Cases

644	   The above use cases illustrate a couple of concepts that are not well
645	   defined, nor have they fully specified standard mechanisms or
646	   behaviors.  This section contains a discussion of such concepts,
647	   which the authors believe are useful in more than one context and
648	   thus should be defined to provide a common function when needed by
649	   multi-media communication applications.

651	4.1.1.  Media Source

653	   In several of the above use cases there exist a need for a separation
654	   between the media source, the particular encoding and its transport
655	   stream.  In vanilla RTP there exist a one-to-one mapping between
656	   these; one media source is encoded in one particular way and
657	   transported as one RTP stream using a single SSRC in a particular RTP
658	   session.

660	   The reason for not keeping a strict one-to-one mapping, allowing the
661	   media source to be identified separately from the RTP media stream
662	   (SSRC), varies depending on the application's needs and the desired
663	   functionalities:

665	   Simulcast:  Simulcast is a functionality to provide multiple
666	      simultaneous encodings of the same media source.  As each encoding
667	      is independent of the other, in contrast to scalable encoding,
668	      independent transport streams for each encoding is needed.  The
669	      receiver of a simulcast stream will need to be able to explicitly
670	      identify each encoding upon reception, as well as which media
671	      source it is an encoding of.  This is especially important in a
672	      context of multiple media sources being provided from the same
673	      endpoint.

675	   Mesh-based communication:  When a communication application
676	      implements multi-party communication through a mesh of transport
677	      flows, there exist a need for tracking the original media source,
678	      especially when relaying between nodes is possible.  It is likely
679	      that the encodings provided over the different transports are
680	      different.  If an application uses relaying between different
681	      transports, an endpoint may, intentionally or not, receive
682	      multiple encodings of the same media source over the same or
683	      different transports.  Some applications can handle the needed
684	      identification, but some can benefit from a standardized method to
685	      identify sources.

687	   The second argument above can be generalized into a common need in
688	   applications that utilize multiple multimedia sessions, such as
689	   multiple PeerConnections or multiple SIP/SDP-established RTP
690	   sessions, to form a larger communication session between multiple
691	   endpoints.  These applications commonly need to track media sources
692	   that occur in more than one multimedia session.

694	   Looking at both CLUE and WebRTC, they appear to contain their own
695	   variants of the concept that was above denoted a media source.  In
696	   CLUE it is called Media Capture.  In WebRTC each MediaStreamTrack is
697	   identifiable, however, several MediaStreamTracks can share the actual
698	   source, and there is no way for the application to realize this
699	   currently.  The identification of sources is being discussed, and
700	   there is a proposal [ref-leithead] that introduces the concept 'Track
701	   Source'.  Thus, in this document we see the media source as the
702	   generalized commonality between these two concepts.  Giving each
703	   media source a unique identifier in the communication session/context
704	   that is reused in all the PeerConnections or SIP/SDP-established RTP
705	   sessions would enable loop detection, correctly associate alternative
706	   encodings and provide a common name across the endpoints for
707	   application logic to reference the actual media source rather than a
708	   particular encoding or transport stream.

710	   It is arguable if the application should really know a long term
711	   persistent source identification, such as based on hardware
712	   identities, for example due to fingerprinting issues, and it would
713	   likely be better to use an anonymous identification that is still
714	   unique in a sufficiently wide context, for example within the
715	   communication application instance.

717	4.1.2.  Encodings

719	   An Encoding is a particular encoded representation of a particular
720	   media source.  In the context of RTP and Signalling, a particular
721	   encoding must fit the established parameters, such as RTP payload
722	   types, media bandwidths, and other more or less codec-specific media
723	   constraints such as resolution, frame-rate, fidelity, audio
724	   bandwidth, etc.

726	   In the context of an application, it appears that there are primarily
727	   two considerations around the use of multiple encodings.

729	   The first is how many and what their defining parameters are.  This
730	   may require to be negotiated, something the existing signalling
731	   solutions, like SDP, currently lack support for.  For example in SDP,
732	   there exist no way to express that you would like to receive three
733	   different encodings of a particular video source.  In addition, if
734	   you for example prefer these three encodings to be 720p/25 Hz,
735	   360p/25 Hz and 180p/12.5 Hz, and even if you could define RTP payload
736	   types with these constraints, they must be linked to RTP streams
737	   carrying the encodings of the particular source.  Also, for some RTP
738	   payload types there exist difficulties to express encoding
739	   characteristics with the desired granularity.  The number of RTP
740	   payload types that can be used for a particular potential encoding
741	   can also be a constraint, especially as a single RTP payload type
742	   could well be used for all three target resolutions and frame rates
743	   in the example.  Using multiple encodings might even be desirable for
744	   multi-party conferences that switches video, rather than composites
745	   and re-encodes it.  It might be that SDP is not the most suitable
746	   place to negotiate this.  From an application perspective, utilizing
747	   clients that have standardized APIs or protocols to control them,
748	   there exist a need for the application to express what it prefers in
749	   number of encodings as well as what their primary target parameters
750	   are.

752	   Secondly, some applications may need explicit indication of what
753	   encoding a particular stream represents.  In some cases this can be
754	   deduced based on information such as RTP payload types and parameters
755	   received in the media stream, but such implicit information will not
756	   always be detailed enough and it may also be time-consuming to
757	   extract.  For example, in SDP there is currently limitations for
758	   binding the relevant information about a particular encoding to the
759	   corresponding RTP stream, unless only a single RTP stream is defined
760	   per media description (m= line).

762	   The CLUE framework explicitly discusses encodings as constraints that
763	   are applied when transforming a media source (capture) into what CLUE
764	   calls a capture encoding.  This includes both explicit identification
765	   as well as a set of boundary parameters such as maximum width,
766	   height, frame rate as well as bandwidth.  In WebRTC nothing related
767	   has yet been defined, and we note this as an issue that needs to be
768	   resolved.  This as the authors expect that support for multiple
769	   encodings will be required to enable simulcast and scalability.

771	4.1.3.  Synchronization contexts

773	   The shortcomings around synchronization contexts appears rather
774	   limited.  In RTP, each RTP media stream is associated with a
775	   particular synchronization context through the CNAME session
776	   description item.  The main concerns here are likely twofold.

778	   The first concern is to avoid unnecessary creation of new contexts,
779	   and rather correctly associate with the contexts that actually exist.
780	   For example, WebRTC MediaStreams are defined so that all
781	   MediaStreamTracks within a particular MediaStream shall be
782	   synchronized.  An easy method for meeting this would be to assign a
783	   new CNAME for each MediaStream.  However, that would ignore the fact
784	   that several media sources from the same synchronization context may
785	   appear in different combinations across several MediaStreams.  Thus
786	   all these MediaStreams should share synchronization context to avoid
787	   playback glitches, like playing back different instantiations of a
788	   single media source out of sync because the media source was shared
789	   between two different MediaStreams.

791	   The second problem is that synchronization context identification in
792	   RTP, i.e.  CNAME, is overloaded as an endpoint identifier.  As an
793	   example, consider an endpoint that has two synchronization contexts;
794	   one for audio and video in the room and another for an audio and
795	   video presentation stream, like the output of an DVD player.  Relying
796	   on that an endpoint has only a single synchronization context and
797	   CNAME may be incorrect and could create issues that an application
798	   designer as well as RTP and signalling extension specifications need
799	   to watch out for.

801	   CLUE discusses so far quite little about synchronization, but clearly
802	   intends to enable lip synchronization between captures that have that
803	   relation.  The second issue is however quite likely to be encountered
804	   in CLUE due to explicit inclusion of the Scene concept, where
805	   different Scenes do not require to share the same synchronization
806	   context, but is rather intended for situations where Scenes cannot
807	   share synchronization context.

809	4.1.4.  Distributed Endpoints

811	   When an endpoint consists of multiple nodes, the added complexity is
812	   often local to that endpoint, which is appropriate.  However, some
813	   few properties of distributed endpoints needs to be tolerated by all
814	   entities in a multimedia communication session.  The main item is to
815	   not assume that a single endpoint will only use a single network
816	   address.  This is a dangerous assumption even for non-distributed
817	   endpoints due to multi-homing and the common deployment of NATs,
818	   especially large scale NATs which in worst case uses multiple
819	   addresses for a single endpoint's transport flows.

821	   Distributed endpoints are brought up in the CLUE context.  They are
822	   not specifically discussed in the WebRTC context, instead the desire
823	   for transport level aggregation makes such endpoints problematic.
824	   However, WebRTC does allow for fallback to media type specific
825	   transport flows and can thus without issues support distributed
826	   endpoints.

828	4.2.  Identified WebRTC issues

830	   In the process of identifying commonalities and differences between
831	   the different use cases we have identified what to us appears to be
832	   issues in the current specification of WebRTC that needs to be
833	   reviewed.

835	   1.  If simulcast or scalability are to be supported at all, the
836	       WebRTC API will need to find a method to deal more explicitly
837	       with the existence of different encodings and how these are
838	       configured, accessed and referenced.  For simulcast, the authors
839	       see a quite straightforward solution where each PeerConnection is
840	       only allowed to contain a single encoding for a specific media
841	       source and the desired quality level can be negotiated for the
842	       full PeerConnection.  When multiple encodings are desired,
843	       multiple PeerConnections with differences in configuration are
844	       established.  That would only require that the underlying media
845	       source can explicitly be indicated and tracked by the receiver.

847	   2.  The current API structure allows to have multiple MediaStreams
848	       with fully or partially overlapping media sources.  This,
849	       combined with multiple PeerConnections and the likely possibility
850	       to do relaying, there appears to exist a significant need to
851	       determine the underlying media source, despite receiving
852	       different MediaStreams with particular media sources encoded in
853	       different ways.  It is proposed that MediaSources are made
854	       possible to identify uniquely across multiple PeerConnections in
855	       the context of the communication application.  It is however
856	       likely that while being unique in a sufficiently large context,
857	       the identification should also be anonymous to avoid
858	       fingerprinting issues, similar to the situation discussed in
859	       Section 4.1.1.

861	   3.  Implementations of the MediaStream API must be careful in how
862	       they name and deal with synchronization contexts, so that the
863	       actual underlying synchronization context is preserved when
864	       possible.  It should be noted that cannot be done when a
865	       MediaStream is created that contains media sources from multiple
866	       synchronization contexts.  This will instead require
867	       resynchronization of contributing sources, creation of a new
868	       synchronization context, and inserting the sources into that
869	       synchronization context.

871	   These issues need to be discussed and an appropriate way to resolve
872	   them must be chosen.

874	4.3.  Relevant to SDP evolution

876	   The joint MMUSIC / RTCWeb WGs interim meeting in February 2013 will
877	   discuss a number of SDP related issues around the handling of
878	   multiple sources; the aggregation of multiple media types over the
879	   same RTP session as well as RTP sharing its transport flow not only
880	   with ICE/STUN but also with the WebRTC data channel using SCTP/DTLS/
881	   UDP.  These issues will potentially result in a significant impact on
882	   SDP.  It may also impact other ongoing work as well as existing
883	   usages and applications, making these discussions difficult.

885	   The above use cases and discussion points to the existence of a
886	   number of commonalities between WebRTC and CLUE, and that a solution
887	   should preferably be usable by both.  It is a very open question how
888	   much functionality CLUE requires from SDP, as CLUE WG plans to
889	   develop a protocol with a different usage model.  The appropriate
890	   division in functionality between SDP and this protocol is currently
891	   unknown.

893	   Based on this document, it is possible to express some protocol
894	   requirements when negotiating multimedia sessions and their media
895	   configurations.  Note that this is written as requirements to
896	   consider, given that one believes this functionality is needed in
897	   SDP.

899	   The Requirements:

901	   Encoding negotiation:  For Simulcast and Scalability in applications,
902	      it must be possible to negotiate the number and the boundary
903	      conditions for the desired encodings created from a particular
904	      media source.

906	   Media Resource Identification:  SDP-based applications that need
907	      explicit information about media sources, multiple encodings and
908	      their related RTP media streams could benefit from a common way of
909	      providing this information.  This need can result in multiple
910	      different actual requirements.  Some require a common, explicit
911	      identification of media sources across multiple signalling
912	      contexts.  Some may require explicit indication of which set of
913	      encodings that has the same media source and thus which sets of
914	      RTP media streams (SSRCs) that are related to a particular media
915	      source.

917	   RTP media stream parameters:  With a greater heterogeneity of the
918	      possible encodings and their boundary conditions, situations may
919	      arise where some or sets of RTP media streams will need to have
920	      specific sets of parameters associated with them, compared to
921	      other (sets of) RTP media streams.

923	   The above are general requirements and in some cases the appropriate
924	   point to address the requirement may not even be SDP.  For example,
925	   media source identification could primarily be put in an RTCP Session
926	   Description (SDES) item, and only when so required by the application
927	   also be included in the signalling.

929	   The discussion in this document has impact on the high level decision
930	   regarding how to relate RTP media streams to SDP media descriptions.
931	   However, as it is currently presenting concepts rather than giving
932	   concrete proposals on how to enable these concepts as extensions to
933	   SDP or other protocols, it is difficult to determine the actual
934	   impact that a high level solution will have.  However, the authors
935	   are convinced that neither of the directions will prevent the
936	   definition of suitable concepts in SDP.

938	5.  IANA Considerations

940	   This document makes no request of IANA.

942	   Note to RFC Editor: this section may be removed on publication as an
943	   RFC.

945	6.  Security Considerations

947	   The realization of the proposed concepts and the resolution will have
948	   security considerations.  However, at this stage it is unclear if any
949	   has not already common considerations regarding preserving privacy,
950	   confidentiality and ensure integrity to prevent denial of service or
951	   quality degradations.

953	7.  Informative References

955	   [I-D.ietf-avtcore-multi-media-rtp-session]
956	              Westerlund, M., Perkins, C., and J. Lennox, "Multiple
957	              Media Types in an RTP Session",
958	              draft-ietf-avtcore-multi-media-rtp-session-01 (work in
959	              progress), October 2012.

961	   [I-D.ietf-clue-framework]
962	              Duckworth, M., Pepperell, A., and S. Wenger, "Framework
963	              for Telepresence Multi-Streams",
964	              draft-ietf-clue-framework-08 (work in progress),
965	              December 2012.

967	   [I-D.ietf-clue-telepresence-use-cases]
968	              Romanow, A., Botzko, S., Duckworth, M., Even, R., and I.
969	              Communications, "Use Cases for Telepresence Multi-
970	              streams", draft-ietf-clue-telepresence-use-cases-04 (work
971	              in progress), August 2012.

973	   [I-D.ietf-mmusic-sdp-bundle-negotiation]
974	              Holmberg, C. and H. Alvestrand, "Multiplexing Negotiation
975	              Using Session Description Protocol (SDP) Port Numbers",
976	              draft-ietf-mmusic-sdp-bundle-negotiation-01 (work in
977	              progress), August 2012.

979	   [I-D.ietf-rtcweb-overview]
980	              Alvestrand, H., "Overview: Real Time Protocols for Brower-
981	              based Applications", draft-ietf-rtcweb-overview-05 (work
982	              in progress), December 2012.

984	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
985	              A., Peterson, J., Sparks, R., Handley, M., and E.
986	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
987	              June 2002.

989	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
990	              with Session Description Protocol (SDP)", RFC 3264,
991	              June 2002.

993	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
994	              Jacobson, "RTP: A Transport Protocol for Real-Time
995	              Applications", STD 64, RFC 3550, July 2003.

997	   [RFC6465]  Ivov, E., Marocco, E., and J. Lennox, "A Real-time
998	              Transport Protocol (RTP) Header Extension for Mixer-to-
999	              Client Audio Level Indication", RFC 6465, December 2011.

1001	   [ref-leithead]
1002	              Microsoft, "Proposal: Media Capture and Streams Settings
1003	              API v6, https://dvcs.w3.org/hg/dap/raw-file/tip/
1004	              media-stream-capture/proposals/
1005	              SettingsAPI_proposal_v6.html", December 2012.

1007	   [ref-media-capture]
1008	              "Media Capture and Streams,
1009	              http://dev.w3.org/2011/webrtc/editor/getusermedia.html",
1010	              December 2012.

1012	   [ref-webrtc10]
1013	              "WebRTC 1.0: Real-time Communication Between Browsers,
1014	              http://dev.w3.org/2011/webrtc/editor/webrtc.html",
1015	              January 2013.

1017	Authors' Addresses

1019	   Bo Burman
1020	   Ericsson
1021	   Farogatan 6
1022	   SE-164 80 Kista
1023	   Sweden

1025	   Phone: +46 10 714 13 11
1026	   Email: bo.burman@ericsson.com

1028	   Magnus Westerlund
1029	   Ericsson
1030	   Farogatan 6
1031	   SE-164 80 Kista
1032	   Sweden

1034	   Phone: +46 10 714 82 87
1035	   Email: magnus.westerlund@ericsson.com