idnits 2.17.1 

draft-ietf-clue-framework-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 993 has weird spacing: '...om left    bot...'

  == Line 1041 has weird spacing: '...om left    bot...'

  -- The document date (February 4, 2012) is 4464 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                         M. Duckworth, Ed.
5	Expires: August 7, 2012                                          Polycom
6	                                                            A. Pepperell

8	                                                              B. Baldino
9	                                                           Cisco Systems
10	                                                        February 4, 2012

12	                Framework for Telepresence Multi-Streams
13	                    draft-ietf-clue-framework-03.txt

15	Abstract

17	   This memo offers a framework for a protocol that enables devices in a
18	   telepresence conference to interoperate by specifying the
19	   relationships between multiple media streams.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on August 7, 2012.

38	Copyright Notice

40	   Copyright (c) 2012 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
56	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
58	   4.  Overview of the Framework/Model  . . . . . . . . . . . . . . .  6
59	   5.  Spatial Relationships  . . . . . . . . . . . . . . . . . . . .  8
60	   6.  Media Captures and Capture Sets  . . . . . . . . . . . . . . .  8
61	     6.1.  Media Captures . . . . . . . . . . . . . . . . . . . . . .  9
62	       6.1.1.  Media Capture Attributes . . . . . . . . . . . . . . .  9
63	     6.2.  Capture Set  . . . . . . . . . . . . . . . . . . . . . . . 11
64	       6.2.1.  Capture set attributes . . . . . . . . . . . . . . . . 12
65	     6.3.  Simultaneous Transmission Set Constraints  . . . . . . . . 13
66	   7.  Encodings  . . . . . . . . . . . . . . . . . . . . . . . . . . 14
67	     7.1.  Individual Encodings . . . . . . . . . . . . . . . . . . . 14
68	     7.2.  Encoding Group . . . . . . . . . . . . . . . . . . . . . . 15
69	   8.  Associating Media Captures with Encoding Groups  . . . . . . . 16
70	   9.  Consumer's Choice of Streams to Receive from the Provider  . . 17
71	     9.1.  Local preference . . . . . . . . . . . . . . . . . . . . . 17
72	     9.2.  Physical simultaneity restrictions . . . . . . . . . . . . 18
73	     9.3.  Encoding and encoding group limits . . . . . . . . . . . . 18
74	     9.4.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 18
75	   10. Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 19
76	   11. Examples - Using the Framework . . . . . . . . . . . . . . . . 20
77	     11.1. Three screen endpoint media provider . . . . . . . . . . . 20
78	     11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 26
79	     11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 27
80	     11.4. Media Consumer Behavior  . . . . . . . . . . . . . . . . . 27
81	       11.4.1. One screen consumer  . . . . . . . . . . . . . . . . . 28
82	       11.4.2. Two screen consumer configuring the example  . . . . . 28
83	       11.4.3. Three screen consumer configuring the example  . . . . 29
84	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29
85	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 29
86	   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 29
87	   15. Informative References . . . . . . . . . . . . . . . . . . . . 29
88	   Appendix A.  Open Issues . . . . . . . . . . . . . . . . . . . . . 30
89	     A.1.  Video layout arrangements and centralized composition  . . 30
90	     A.2.  Source is selectable . . . . . . . . . . . . . . . . . . . 30
91	     A.3.  Media Source Selection . . . . . . . . . . . . . . . . . . 30
92	     A.4.  Endpoint requesting many streams from MCU  . . . . . . . . 31
93	     A.5.  VAD (voice activity detection) tagging of audio streams  . 31
94	     A.6.  Private Information  . . . . . . . . . . . . . . . . . . . 31
95	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 31

97	1.  Introduction

99	   Current telepresence systems, though based on open standards such as
100	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
101	   other.  A major factor limiting the interoperability of telepresence
102	   systems is the lack of a standardized way to describe and negotiate
103	   the use of the multiple streams of audio and video comprising the
104	   media flows.  This draft provides a framework for a protocol to
105	   enable interoperability by handling multiple streams in a
106	   standardized way.  It is intended to support the use cases described
107	   in draft-ietf-clue-telepresence-use-cases-02 and to meet the
108	   requirements in draft-ietf-clue-telepresence-requirements-01.

110	   The solution described here is strongly focused on what is being done
111	   today, rather than on a vision of future conferencing.  At the same
112	   time, the highest priority has been given to creating an extensible
113	   framework to make it easy to accommodate future conferencing
114	   functionality as it evolves.

116	   The purpose of this effort is to make it possible to handle multiple
117	   streams of media in such a way that a satisfactory user experience is
118	   possible even when participants are using different vendor equipment,
119	   and also when they are using devices with different types of
120	   communication capabilities.  Information about the relationship of
121	   media streams at the provider's end must be communicated so that
122	   streams can be chosen and audio/video rendering can be done in the
123	   best possible manner.

125	   There is no attempt here to dictate to the renderer what it should
126	   do.  What the renderer does is up to the renderer.

128	   After the following Definitions, a short section introduces key
129	   concepts.  The body of the text comprises several sections about the
130	   key elements of the framework, how a consumer chooses streams to
131	   receive, and some examples.  The appendix describe topics that are
132	   under discussion for adding to the document.

134	2.  Terminology

136	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
137	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
138	   document are to be interpreted as described in RFC 2119 [RFC2119].

140	3.  Definitions

142	   The definitions marked with an "*" are new; all the others are from
143	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

145	   Camera-Left and Right: For media captures, camera-left and camera-
146	   right are from the point of view of a person observing the rendered
147	   media.  They are the opposite of stage-left and stage-right.

149	   Capture Device: A device that converts audio and video input into an
150	   electrical signal, in most cases to be fed into a media encoder.
151	   Cameras and microphones are examples for capture devices.

153	   *Capture Scene: the scene that is captured by a collection of Capture
154	   Devices.  A Capture Scene may be represented by more than one type of
155	   Media.  A Capture Scene may include more than one Media Capture of
156	   the same type.  An example of a Capture Scene is the video image of a
157	   group of people seated next to each other, along with the sound of
158	   their voices, which could be represented by some number of VCs and
159	   ACs.  A middle box may also express Capture Scenes that it constructs
160	   from Media streams it receives.

162	   *Capture Set: A Capture Set includes media captures that are arranged
163	   by the provider to help the consumer choose which captures it wants.
164	   The entries in a Capture Set represent different alternatives for
165	   representing the same Capture Scene.

167	   Conference: used as defined in [RFC4353], A Framework for
168	   Conferencing within the Session Initiation Protocol (SIP).

170	   *Individual Encoding: A variable with a set of attributes that
171	   describes the maximum values of a single audio or video capture
172	   encoding.  The attributes include: maximum bandwidth- and for video
173	   maximum macroblocks (for H.264), maximum width, maximum height,
174	   maximum frame rate.

176	   *Encoding Group: A set of encoding parameters representing a media
177	   provider's encoding capabilities.  Media stream providers formed of
178	   multiple physical units, in each of which resides some encoding
179	   capability, would typically advertise themselves to the remote media
180	   stream consumer using multiple encoding groups.  Within each encoding
181	   group, multiple potential encodings are possible, with the sum of the
182	   chosen encodings' characteristics constrained to being less than or
183	   equal to the group-wide constraints.

185	   Endpoint: The logical point of final termination through receiving,
186	   decoding and rendering, and/or initiation through capturing,
187	   encoding, and sending of media streams.  An endpoint consists of one
188	   or more physical devices which source and sink media streams, and
189	   exactly one [RFC4353] Participant (which, in turn, includes exactly
190	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
191	   send and receive media streams, but it is not the initiator nor the
192	   final terminator in the sense that Media is Captured or Rendered.
193	   Endpoints can be anything from multiscreen/multicamera rooms to
194	   handheld devices.

196	   Front: the portion of the room closest to the cameras.  In going
197	   towards back you move away from the cameras.

199	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
200	   more endpoints together into one single multimedia conference
201	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
202	   tardy in requiring that media from the mixer be sent to EACH
203	   participant.  I think we have practical use cases where this is not
204	   the case.  But the bug (if it is one) is in 4353 and not herein.]

206	   Media: Any data that, after suitable encoding, can be conveyed over
207	   RTP, including audio, video or timed text.

209	   *Media Capture: a source of Media, such as from one or more Capture
210	   Devices.  A Media Capture (MC) may be the source of one or more Media
211	   streams.  A Media Capture may also be constructed from other Media
212	   streams.  A middle box can express Media Captures that it constructs
213	   from Media streams it receives.

215	   *Media Consumer: an Endpoint or middle box that receives media
216	   streams

218	   *Media Provider: an Endpoint or middle box that sends Media streams

220	   Model: a set of assumptions a telepresence system of a given vendor
221	   adheres to and expects the remote telepresence system(s) also to
222	   adhere to.

224	   *Plane of Interest: The spatial plane containing the most relevant
225	   subject matter.

227	   Render: the process of generating a representation from a media, such
228	   as displayed motion video or sound emitted from loudspeakers.

230	   *Simultaneous Transmission Set: a set of media captures that can be
231	   transmitted simultaneously from a Media Provider.

233	   Spatial Relation: The arrangement in space of two objects, in
234	   contrast to relation in time or other relationships.  See also
235	   Camera-Left and Right.

237	   Stage-Left and Right: For media captures, stage-left and stage-right
238	   are the opposite of camera-left and camera-right.  For the case of a
239	   person facing (and captured by) a camera, stage-left and stage-right
240	   are from the point of view of that person.

242	   *Stream: RTP stream as in [RFC3550].

244	   Stream Characteristics: the media stream attributes commonly used in
245	   non-CLUE SIP/SDP environments (such as: media codec, bit rate,
246	   resolution, profile/level etc.) as well as CLUE specific attributes,
247	   such as the ID of a capture or a spatial location.

249	   Telepresence: an environment that gives non co-located users or user
250	   groups a feeling of (co-located) presence - the feeling that a Local
251	   user is in the same room with other Local users and the Remote
252	   parties.  The inclusion of Remote parties is achieved through
253	   multimedia communication including at least audio and video signals
254	   of high fidelity.

256	   *Video Capture: Media Capture for video.  Denoted as VCn.

258	   Video composite: A single image that is formed from combining visual
259	   elements from separate sources.

261	4.  Overview of the Framework/Model

263	   The CLUE framework specifies how multiple media streams are to be
264	   handled in a telepresence conference.

266	   The main goals include:

268	   o  Interoperability

270	   o  Extensibility

272	   o  Flexibility

274	   Interoperability is achieved by the media provider describing the
275	   relationships between media streams in constructs that are understood
276	   by the consumer, who can then render the media.  Extensibility is
277	   achieved through abstractions and the generality of the model, making
278	   it easy to add new parameters.  Flexibility is achieved largely by
279	   having the consumer choose what content and format it wants to
280	   receive from what the provider is capable of sending.  This
281	   constitutes a significant change from previous video conferencing
282	   systems in which transmission of content was determined primarily by
283	   the sender.

285	   A transmitting endpoint or MCU describes specific aspects of the
286	   content of the media and the formatting of the media streams it can
287	   send (advertisement); and the receiving end responds to the provider
288	   by specifying which content and media streams it wants to receive
289	   (configuration).  The provider then transmits the asked for content
290	   in the specified streams.

292	   This advertisement and configuration occurs at call initiation but
293	   may also happen at any time throughout the conference, whenever there
294	   is a change in what the consumer wants or the provider can send.

296	   An endpoint or MCU typically acts as both provider and consumer at
297	   the same time, sending advertisements and sending configurations in
298	   response to receiving advertisements.  (It is possible to be just one
299	   or the other.)

301	   The data model is based around two main concepts: a capture and an
302	   encoding.  A media capture (MC), such as audio or video, describes
303	   the content a provider can send.  Media captures are described in
304	   terms of CLUE-defined attributes, such as spatial relationships and
305	   purpose of the capture.  Providers tell consumers which media
306	   captures they can provide, described in terms of the media capture
307	   attributes.

309	   A provider organizes its media captures that represent the same scene
310	   into capture sets.  A consumer chooses which media captures it wants
311	   to receive according to the capture sets sent by the provider.

313	   In addition, the provider sends the consumer a description of the
314	   streams it can send in terms of the media attributes of the stream,
315	   in particular, well-known audio and video parameters such as
316	   bandwidth, frame rate, macroblocks per second.

318	   The provider also specifies constraints on its ability to provide
319	   media, and the consumer must take these into account in choosing the
320	   content and streams it wants.  Some constraints are due to the
321	   physical limitations of devices - for example, a camera may not be
322	   able to provide zoom and non-zoom views simultaneously.  Other
323	   constraints are system based constraints, such as maximum bandwidth
324	   and maximum macroblocks/second.

326	   The following sections discuss these constructs and processes in
327	   detail, followed by use cases showing how the framework specification
328	   can be used.

330	5.  Spatial Relationships

332	   In order for a consumer to perform a proper rendering, it is often
333	   necessary to provide spatial information about the streams it is
334	   receiving.  CLUE defines a coordinate system that allows producers to
335	   describe the spatial relationships of their Media Captures to enable
336	   proper scaling and spatial rendering of their streams.  The
337	   coordinate system is based on a few principles:

339	   o  Simple systems which do not have multiple Media Captures to
340	      associate spatially need not use the coordinate model.

342	   o  Coordinates can either be in real, physical units (millimeters),
343	      have an unknown scale or have no physical scale.  Systems which
344	      know their physical dimensions should always provide those real-
345	      world measurements.  Systems which don't know specific physical
346	      dimensions but still know relative distances should use 'unknown
347	      scale'.  'No scale' is intended to be used where Media Captures
348	      from different devices (with potentially different scales) will be
349	      forwarded alongside one another (e.g. in the case of a middle
350	      box).

352	      *  "millimeters" means the scale is in millimeters

354	      *  "Unknown" means the scale is not necessarily millimeters, but
355	         the scale is the same for every capture in the capture set.

357	      *  "No Scale" means the scale could be different for each capture
358	         - an MCU provider that advertises two adjacent captures and
359	         picks sources (which can change quickly) from different
360	         endpoints might use this value; the scale could be different
361	         and changing for each capture.  But the areas of capture still
362	         represent a spatial relation between captures.

364	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
365	      spot of the provider's choosing.  The provider must use the same
366	      origin for all coordinates within the same capture set.

368	   The direction of increasing coordinate values is:
369	   X increases from camera left to camera right
370	   Y increases from front to back
371	   Z increases from low to high

373	6.  Media Captures and Capture Sets

375	   This section describes how media providers can describe the content
376	   of media to consumers.

378	6.1.  Media Captures

380	   Media captures are the fundamental representations of streams that a
381	   device can transmit.  What a Media Capture actually represents is
382	   flexible:

384	   o  It can represent the immediate output of a physical source (e.g.
385	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
386	      DVD player).

388	   o  It can represent the output of an audio mixer or video composer

390	   o  It can represent a concept such as 'the loudest speaker'

392	   o  It can represent a conceptual position such as 'the leftmost
393	      stream'

395	   To distinguish between multiple instances, video and audio captures
396	   are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 refer to
397	   two different video captures and AC1 and AC2 refer to two different
398	   audio captures.

400	   Each Media Capture can be associated with attributes to describe what
401	   it represents.

403	6.1.1.  Media Capture Attributes

405	   Media Capture Attributes describe static information about the
406	   captures that can be used by the consumer to help decide which Media
407	   Captures should be requested.  Attributes are defined by a variable
408	   and its value.  The currently defined attributes and their values
409	   are:

411	   Purpose: {main, presentation}

413	   A field with enumerated values which describes the role of the Media
414	   Capture and can be applied to any media type.

416	   A value of 'main' describes the primary content of the room (such as
417	   participant media).

419	   A value of 'presentation' describes the secondary content of the room
420	   (such as media coming from a laptop).

422	   Composed: {true, false}

424	   A field with a Boolean value which indicates whether or not the Media
425	   Capture is a mix (audio) or composition (video) of streams.

427	   This attribute is not intended to describe the layout used when
428	   compositing video streams.

430	   Audio Channel Format: {mono, stereo} A field with enumerated values
431	   which describes the method of encoding used for audio.

433	   A value of 'mono' means the Audio Capture has one channel.

435	   A value of 'stereo' means the Audio Capture has two audio channels,
436	   left and right.

438	   This attribute applies only to Audio Captures.

440	   Switched: {true, false}

442	   A field with a Boolean value which indicates whether or not the Media
443	   Capture represents the (dynamic) most appropriate subset of a
444	   'whole'.  What is 'most appropriate' is up to the producer and could
445	   be the active speaker, a lecturer or a VIP.

447	   Point of Capture: {(X, Y, Z)} A field with a single Cartesian (X, Y,
448	   Z) point value which describes the spatial location, virtual or
449	   physical, of the capturing device (such as camera).

451	   When the Point of Capture attribute is specified, it must include X,
452	   Y and Z coordinates.

454	   Area of Capture:

456	   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3,
457	   Z3), top right(X4, Y4, Z4)}

459	   A field with a set of four (X, Y, Z) points as a value which describe
460	   the spatial location of what is being "captured".  By comparing the
461	   Area of Capture for different Media Captures within the same capture
462	   set a consumer can determine the spatial relationships between them
463	   and render them correctly.

465	   The four points should be co-planar.  The four points form a
466	   quadrilateral, not necessarily a rectangle.

468	   The quadrilateral described by the four (X, Y, Z) points defines the
469	   plane of interest for the particular media capture.

471	   If the area of capture attribute is specified, it must include X, Y
472	   and Z coordinates for all four points.

474	   For a switched capture that switches between different sections
475	   within a larger area, the area of capture should use coordinates for
476	   the larger potential area.

478	   EncodingGroup: {<encodeGroupID value>}

480	   A field with a value equal to the encodeGroupID of the encoding group
481	   associated with the media capture.

483	6.2.  Capture Set

485	   In order for a provider's individual media captures to be used
486	   effectively by a consumer, the provider organizes the media captures
487	   into capture sets, with the structure and contents of these sets
488	   being sent from the provider to the consumer.

490	   A provider may advertise multiple capture sets or just a single
491	   capture set.  A capture set can be said to correspond to a provided
492	   "scene", and a media provider might typically use one capture set for
493	   main participant media and another capture set for a computer
494	   generated presentation.  Capture sets will commonly include media
495	   captures of different types, for instance, audio captures and video
496	   captures.

498	   A provider can express spatial relationships between media captures
499	   that are included in the same capture set.  But there is no spatial
500	   relationship between media captures that are in different capture
501	   sets.

503	   A capture set is most usefully thought of as being a collection of
504	   entries, with each entry being a list of media captures.  In grouping
505	   multiple media captures together within a capture set entry, the
506	   provider is signaling that those captures together form a
507	   representation of that capture set's scene.  Media captures within
508	   the same capture set entry must be of the same media type - it is not
509	   possible to mix audio and video captures in the same capture set
510	   entry, for instance.  The provider must be capable of encoding and
511	   sending all media captures in a single entry simultaneously.

513	   When a provider advertises a capture set with multiple entries, it is
514	   essentially signaling that there are multiple representations of the
515	   same scene available.  In some cases, these multiple representations
516	   would typically be used simultaneously (for instance a "video entry"
517	   and an "audio entry").  In some cases the entries would conceptually
518	   be alternatives (for instance an entry consisting of 3 video captures
519	   versus an entry consisting of just a single video capture).  In this
520	   latter example, the provider would in the simple case end up
521	   providing to the consumer the entry containing the number of video
522	   captures that most closely matched the media consumer's number of
523	   display devices.

525	   The following is an example of 4 potential capture set entries for an
526	   endpoint-style media provider:

528	   1.  (VC0, VC1, VC2) - left, center and right camera video captures

530	   2.  (VC3) - video capture associated with loudest room segment

532	   3.  (VC4) - video capture zoomed out view of all people in the room

534	   4.  (AC0) - main audio

536	   The first entry in this capture set example is a list of video
537	   captures with a spatial relationship to each other.  Determination of
538	   the order of these captures (VC0, VC1 and VC2) for rendering purposes
539	   is accomplished through use of their Area of Capture attributes.  The
540	   second entry (VC3) and the third entry (VC4) are additional
541	   alternatives of how to capture the same room in different ways.  The
542	   inclusion of the audio capture in the same capture set indicates that
543	   AC0 is associated with those video captures, meaning it comes from
544	   the same scene.  The audio should be rendered in conjunction with any
545	   rendered video captures from the same capture set (for instance, the
546	   consumer should attempt to perform lip sync between all audio and
547	   video captures from the same capture set).

549	6.2.1.  Capture set attributes

551	   Attributes can be applied to capture sets as well as to individual
552	   media captures.  Attributes specified at this level apply to all
553	   constituent media captures.

555	   Area of Scene attribute

557	   The area of scene attribute for a capture set has the same format as
558	   the area of capture attribute for a media capture.  The area of scene
559	   is for the entire scene, which is captured by the one or more media
560	   captures in the capture set entries.

562	   Scale attribute

564	   An optional attribute indicating if the numbers used for area of
565	   scene, area of capture and point of capture are in terms of
566	   millimeters, unknown scale factor, or not any scale, as described in
567	   Section 5.  If any media captures have an area of capture attribute
568	   or point of capture attribute, then this scale attribute must also be
569	   defined.  The possible values for this attribute are:

571	      "millimeters"
572	      "unknown"
573	      "no scale"

575	6.3.  Simultaneous Transmission Set Constraints

577	   The provider may have constraints or limitations on its ability to
578	   send media captures.  One type is caused by the physical limitations
579	   of capture mechanisms; these constraints are represented by a
580	   simultaneous transmission set.  The second type of limitation
581	   reflects the encoding resources available - bandwidth and
582	   macroblocks/second.  This type of constraint is captured by encoding
583	   groups, discussed below.

585	   An endpoint or MCU can send multiple captures simultaneously, however
586	   sometimes there are constraints that limit which captures can be sent
587	   simultaneously with other captures.  A device may not be able to be
588	   used in different ways at the same time.  Provider advertisements are
589	   made so that the consumer will choose one of several possible
590	   mutually exclusive usages of the device.  This type of constraint is
591	   expressed in a Simultaneous Transmission Set, which lists all the
592	   media captures that can be sent at the same time.  This is easier to
593	   show in an example.

595	   Consider the example of a room system where there are 3 cameras each
596	   of which can send a separate capture covering 2 persons each- VC0,
597	   VC1, VC2.  The middle camera can also zoom out and show all 6
598	   persons, VC3.  But the middle camera cannot be used in both modes at
599	   the same time - it has to either show the space where 2 participants
600	   sit or the whole 6 seats, but not both at the same time.

602	   Simultaneous transmission sets are expressed as sets of the MCs that
603	   could physically be transmitted at the same time, (though it may not
604	   make sense to do so).  In this example the two simultaneous sets are
605	   shown in Table 1.  The consumer must make sure that it chooses one
606	   and not more of the mutually exclusive sets.

608	                           +-------------------+
609	                           | Simultaneous Sets |
610	                           +-------------------+
611	                           | {VC0, VC1, VC2}   |
612	                           | {VC0, VC3, VC2}   |
613	                           +-------------------+

615	                Table 1: Two Simultaneous Transmission Sets

617	   The Simultaneous Transmission Sets MUST allow all the Media Captures
618	   in a particular capture set entry to be used simultaneously.

620	7.  Encodings

622	   We have considered how providers can describe the content of media to
623	   consumers.  We will now consider how the providers communicate
624	   information about their abilities to send streams.  We introduce two
625	   constructs - individual encodings and encoding groups.  Consumers
626	   will then map the media captures they want onto the encodings with
627	   encoding parameters they want.  This process is then described.

629	7.1.  Individual Encodings

631	   An individual encoding represents a way to encode a media capture to
632	   become an encoded media stream sent from the media provider to the
633	   media consumer.  An individual encoding has a set of parameters
634	   characterizing how the media is encoded.  Different media types have
635	   different parameters, and different encoding algorithms may have
636	   different parameters.  An individual encoding can be used for only
637	   one actual encoded media stream at a time.

639	   The parameters of an individual encoding represent the maximimum
640	   values for certain aspects of the encoding.  A particular
641	   instantiation into an encoded stream might use lower values than
642	   these maximums.

644	   The following tables show the variables for audio and video encoding.

646	   +--------------+----------------------------------------------------+
647	   | Name         | Description                                        |
648	   +--------------+----------------------------------------------------+
649	   | encodeID     | A unique identifier for the individual encoding    |
650	   | maxBandwidth | Maximum number of bits per second                  |
651	   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  |
652	   |              | + 15) / 16) * ((height + 15) / 16) *               |
653	   |              | framesPerSecond                                    |
654	   | maxWidth     | Video resolution's maximum supported width,        |
655	   |              | expressed in pixels                                |
656	   | maxHeight    | Video resolution's maximum supported height,       |
657	   |              | expressed in pixels                                |
658	   | maxFrameRate | Maximum supported frame rate                       |
659	   +--------------+----------------------------------------------------+

661	               Table 2: Individual Video Encoding Parameters

663	           +--------------+-----------------------------------+
664	           | Name         | Description                       |
665	           +--------------+-----------------------------------+
666	           | maxBandwidth | Maximum number of bits per second |
667	           +--------------+-----------------------------------+

669	               Table 3: Individual Audio Encoding Parameters

671	7.2.  Encoding Group

673	   An encoding group includes a set of one or more individual encodings,
674	   plus some parameters that apply to the group as a whole.  By grouping
675	   multiple individual encodings together, an encoding group describes
676	   additional constraints on bandwidth and other parameters for the
677	   group.  Table 4 shows the parameters and individual encoding sets
678	   that are part of an encoding group.

680	   +-------------------+-----------------------------------------------+
681	   | Name              | Description                                   |
682	   +-------------------+-----------------------------------------------+
683	   | encodeGroupID     | A unique identifier for the encoding group    |
684	   | maxGroupBandwidth | Maximum number of bits per second relating to |
685	   |                   | all encodings combined                        |
686	   | maxGroupH264Mbps  | Maximum number of macroblocks per second      |
687	   |                   | relating to all video encodings combined      |
688	   | videoEncodings[]  | Set of potential encodings (list of           |
689	   |                   | encodeIDs)                                    |
690	   | audioEncodings[]  | Set of potential encodings (list of           |
691	   |                   | encodeIDs)                                    |
692	   +-------------------+-----------------------------------------------+

694	                          Table 4: Encoding Group

696	   When the individual encodings in a group are instantiated into actual
697	   encoded media streams, each stream has a bandwidth that must be less
698	   than or equal to the maxBandwidth for the particular individual
699	   encoding.  The maxGroupBandwidth parameter gives the additional
700	   restriction that the sum of all the individual instantiated
701	   bandwidths must be less than or equal to the maxGroupBandwidth value.

703	   Likewise, the sum of the macroblocks per second of each instantiated
704	   encoding in the group must not exceed the maxGroupH264Mbps value.

706	   The following diagram illustrates the structure of a media provider's
707	   Encoding Groups and their contents.

709	   ,-------------------------------------------------.
710	   |             Media Provider                      |
711	   |                                                 |
712	   |  ,--------------------------------------.       |
713	   |  | ,--------------------------------------.     |
714	   |  | | ,--------------------------------------.   |
715	   |  | | |          Encoding Group              |   |
716	   |  | | | ,-----------.                        |   |
717	   |  | | | |           | ,---------.            |   |
718	   |  | | | |           | |         | ,---------.|   |
719	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
720	   |  `.| | |           | |         | `---------'|   |
721	   |    `.| `-----------' `---------'            |   |
722	   |      `--------------------------------------'   |
723	   `-------------------------------------------------'

725	                    Figure 1: Encoding Group Structure

727	   A media provider advertises one or more encoding groups.  Each
728	   encoding group includes one or more individual encodings.  Each
729	   individual encoding can represent a different way of encoding media.
730	   For example one individual encoding may be 1080p60 video, another
731	   could be 720p30, with a third being CIF.

733	   While a typical 3 codec/display system might have one encoding group
734	   per "codec box", there are many possibilities for the number of
735	   encoding groups a provider may be able to offer and for the encoding
736	   values in each encoding group.

738	   There is no requirement for all encodings within an encoding group to
739	   be instantiated at once.

741	8.  Associating Media Captures with Encoding Groups

743	   Every media capture is associated with an encoding group, which is
744	   used to instantiate that media capture into one or more encoded
745	   streams.  Each media capture has an encoding group attribute.  The
746	   value of this attribute is the encodeGroupID for the encoding group
747	   with which it is associated.  More than one media capture may use the
748	   same encoding group.

750	   The maximum number of streams that can result from a particular
751	   encoding group constraint is equal to the number of individual
752	   encodings in the group.  The actual number of streams used at any
753	   time may be less than this maximum.  Any of the media captures that
754	   use a particular encoding group can be encoded according to any of
755	   the individual encodings in the group.  If there are multiple
756	   individual encodings in the group, then a single media capture can be
757	   encoded into multiple different streams at the same time, with each
758	   stream following the constraints of a different individual encoding.

760	   The Encoding Groups MUST allow all the media captures in a particular
761	   capture set entry to be used simultaneously.

763	9.  Consumer's Choice of Streams to Receive from the Provider

765	   After receiving the provider's advertised media captures and
766	   associated constraints, the consumer must choose which media captures
767	   it wishes to receive, and which individual encodings from the
768	   provider it wants to use to encode the capture.  Each media capture
769	   has an encoding group ID attribute which specifies which individual
770	   encodings are available to be used for that media capture.

772	   For each media capture the consumer wants to receive, it configures
773	   one or more of the encodings in that capture's encoding group.  The
774	   consumer does this by telling the provider the resolution, frame
775	   rate, bandwidth, etc. when asking for streams for its chosen
776	   captures.  Upon receipt of this configuration command from the
777	   consumer, the provider generates streams for each such configured
778	   encoding and sends those streams to the consumer.

780	   The consumer must have received at least one capture advertisement
781	   from the provider to be able to configure the provider's generation
782	   of media streams.

784	   The consumer is able to change its configuration of the provider's
785	   encodings any number of times during the call, either in response to
786	   a new capture advertisement from the provider or autonomously.  The
787	   consumer need not send a new configure message to the provider when
788	   it receives a new capture advertisement from the provider unless the
789	   contents of the new capture advertisement cause the consumer's
790	   current configure message to become invalid.

792	   When choosing which streams to receive from the provider, and the
793	   encoding characteristics of those streams, the consumer needs to take
794	   several things into account its local preference, simultaneity
795	   restrictions, and encoding limits.

797	9.1.  Local preference

799	   A variety of local factors will influence the consumer's choice of
800	   streams to be received from the provider:

802	   o  if the consumer is an endpoint, it is likely that it would choose,
803	      where possible, to receive video and audio captures that match the
804	      number of display devices and audio system it has

806	   o  if the consumer is a middle box such as an MCU, it may choose to
807	      receive loudest speaker streams (in order to perform its own media
808	      composition) and avoid pre-composed video captures

810	   o  user choice (for instance, selection of a new layout) may result
811	      in a different set of media captures, or different encoding
812	      characteristics, being required by the consumer

814	9.2.  Physical simultaneity restrictions

816	   There may be physical simultaneity constraints imposed by the
817	   provider that affect the provider's ability to simultaneously send
818	   all of the captures the consumer would wish to receive.  For
819	   instance, a middle box such as an MCU, when connected to a multi-
820	   camera room system, might prefer to receive both individual camera
821	   streams of the people present in the room and an overall view of the
822	   room from a single camera.  Some endpoint systems might be able to
823	   provide both of these sets of streams simultaneously, whereas others
824	   may not (if the overall room view were produced by changing the zoom
825	   level on the center camera, for instance).

827	9.3.  Encoding and encoding group limits

829	   Each of the provider's encoding groups has limits on bandwidth and
830	   macroblocks per second, and the constituent potential encodings have
831	   limits on the bandwidth, macroblocks per second, video frame rate,
832	   and resolution that can be provided.  When choosing the media
833	   captures to be received from a provider, a consumer device must
834	   ensure that the encoding characteristics requested for each
835	   individual media capture fits within the capability of the encoding
836	   it is being configured to use, as well as ensuring that the combined
837	   encoding characteristics for media captures fit within the
838	   capabilities of their associated encoding groups.  In some cases,
839	   this could cause an otherwise "preferred" choice of streams to be
840	   passed over in favour of different streams - for instance, if a set
841	   of 3 media captures could only be provided at a low resolution then a
842	   3 screen device could switch to favoring a single, higher quality,
843	   stream.

845	9.4.  Message Flow

847	   The following diagram shows the basic flow of messages between a
848	   media provider and a media consumer.  The usage of the "capture
849	   advertisement" and "configure encodings" message is described above.

851	   The consumer also sends its own capability message to the provider
852	   which may contain information about its own capabilities or
853	   restrictions.

855	   Diagram for Message Flow

857	            Media Consumer                         Media Provider
858	            --------------                         ------------
859	                  |                                     |
860	                  |----- Consumer Capability ---------->|
861	                  |                                     |
862	                  |                                     |
863	                  |<---- Capture advertisement ---------|
864	                  |                                     |
865	                  |                                     |
866	                  |------ Configure encodings --------->|
867	                  |                                     |

869	   In order for a maximally-capable provider to be able to advertise a
870	   manageable number of video captures to a consumer, there is a
871	   potential use for the consumer, at the start of CLUE, to be able to
872	   inform the provider of its capabilities.  One example here would be
873	   the video capture attribute set - a consumer could tell the provider
874	   the complete set of video capture attributes it is able to understand
875	   and so the provider would be able to reduce the capture set it
876	   advertises to be tailored to the consumer.

878	   TBD - the content of this message needs to be better defined.  The
879	   authors believe there is a need for this message, but have not worked
880	   out the details yet.

882	10.  Extensibility

884	   One of the most important characteristics of the Framework is its
885	   extensibility.  Telepresence is a relatively new industry and while
886	   we can foresee certain directions, we also do not know everything
887	   about how it will develop.  The standard for interoperability and
888	   handling multiple streams must be future-proof.

890	   The framework itself is inherently extensible through expanding the
891	   data model types.  For example:

893	   o  Adding more types of media, such as telemetry, can done by
894	      defining additional types of captures in addition to audio and
895	      video.

897	   o  Adding new functionalities , such as 3-D, say, will require
898	      additional attributes describing the captures.

900	   o  Adding a new codecs, such as H.265, can be accomplished by
901	      defining new encoding variables.

903	   The infrastructure is designed to be extended rather than requiring
904	   new infrastructure elements.  Extension comes through adding to
905	   defined types.

907	   Assuming the implementation is in something like XML, adding data
908	   elements and attributes makes extensibility easy.

910	11.  Examples - Using the Framework

912	   This section shows some examples in more detail how to use the
913	   framework to represent a typical case for telepresence rooms.  First
914	   an endpoint is illustrated, then an MCU case is shown.

916	11.1.  Three screen endpoint media provider

918	   Consider an endpoint with the following description:

920	   o  3 cameras, 3 displays, a 6 person table

922	   o  Each video device can provide one capture for each 1/3 section of
923	      the table

925	   o  A single capture representing the active speaker can be provided

927	   o  A single capture representing the active speaker with the other 2
928	      captures shown picture in picture within the stream can be
929	      provided

931	   o  A capture showing a zoomed out view of all 6 seats in the room can
932	      be provided

934	   The audio and video captures for this endpoint can be described as
935	   follows.

937	   Video Captures:

939	   o  VC0- (the camera-left camera stream), encoding group=EG0,
940	      purpose=main;auto-switched:no

942	   o  VC1- (the center camera stream), encoding group=EG1, purpose=main;
943	      auto-switched:no

945	   o  VC2- (the camera-right camera stream), encoding group=EG2,
946	      purpose=main;auto-switched:no

948	   o  VC3- (the loudest panel stream), encoding group=EG1,
949	      purpose=main;auto-switched:yes

951	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
952	      purpose=main; composed=true; auto-switched:yes

954	   o  VC5- (the zoomed out view of all people in the room), encoding
955	      group=EG1, purpose=main; composed=no; auto-switched:no

957	   o  VC6- (presentation stream), encoding group=EG1,
958	      purpose=presentation;auto-switched:no

960	   The following diagram is a top view of the room with 3 cameras, 3
961	   displays, and 6 seats.  Each camera is capturing 2 people.  The six
962	   seats are not all in a straight line.

964	      ,-. d
965	     (   )`--.__        +---+
966	      `-' /     `--.__  |   |
967	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
968	   (   ).'        ___..-+-''`+-+
969	    `-' |_...---''      |   |
970	    ,-.c+-..__          +---+
971	   (   )|     ``--..__  |   |
972	    `-' |             ``+-..|_-+Camera 1 (VC1)
973	    ,-. |            __..--'|+-+
974	   (   )|     __..--'   |   |
975	    `-'b|..--'          +---+
976	    ,-. |``---..___     |   |
977	   (   )\          ```--..._|_-+Camera 0 (VC0)
978	    `-'  \             _..-''`-+
979	     ,-. \      __.--'' |   |
980	    (   ) |..-''        +---+
981	     `-' a

983	   The two points labeled b and c are intended to be at the midpoint
984	   between the seating positions, and where the fields of view of the
985	   cameras intersect.
986	   The plane of interest for VC0 is a vertical plane that intersects
987	   points 'a' and 'b'.
988	   The plane of interest for VC1 intersects points 'b' and 'c'.
989	   The plane of interest for VC2 intersects points 'c' and 'd'.
990	   This example uses an area scale of millimeters.

992	   Areas of capture:
993	       bottom left    bottom right  top left         top right
994	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
995	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
996	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
997	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
998	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
999	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1000	   VC6 none

1002	   Points of capture:
1003	   VC0 (-1678,0,800)
1004	   VC1 (0,0,800)
1005	   VC2 (1678,0,800)
1006	   VC3 none
1007	   VC4 none
1008	   VC5 (0,0,800)
1009	   VC6 none

1011	   In this example, the right edge of the VC0 area lines up with the
1012	   left edge of the VC1 area.  It doesn't have to be this way.  There
1013	   could be a gap or an overlap.  One additional thing to note for this
1014	   example is the distance from a to b is equal to the distance from b
1015	   to c and the distance from c to d.  All these distances are 1346 mm.
1016	   This is the planar width of each area of capture for VC0, VC1, and
1017	   VC2.

1019	   Note the text in parentheses (e.g. "the camera-left camera stream")
1020	   is not explicitly part of the model, it is just explanatory text for
1021	   this example, and is not included in the model with the media
1022	   captures and attributes.

1024	   Audio Captures:

1026	   o  AC0 (camera-left), encoding group=EG3, purpose=main, channel
1027	      format=mono

1029	   o  AC1 (camera-right), encoding group=EG3, purpose=main, channel
1030	      format=mono

1032	   o  AC2 (center) encoding group=EG3, purpose=main, channel format=mono

1034	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1035	      encoding group=EG3, purpose=main, channel format=mono

1037	   o  AC4 audio stream associated with the presentation video (mono)
1038	      encoding group=EG3, purpose=presentation, channel format=mono

1040	   Areas of capture:
1041	       bottom left    bottom right  top left         top right
1042	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1043	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1044	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1045	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1046	   AC4 none

1048	   The physical simultaneity information is:

1050	      {VC0, VC1, VC2, VC3, VC4, VC6}

1052	      {VC0, VC2, VC5, VC6}

1054	   This constraint indicates it is not possible to use all the VCs at
1055	   the same time.  VC5 can not be used at the same time as VC1 or VC3 or
1056	   VC4.  Also, using every member in the set simultaneously may not make
1057	   sense - for example VC3(loudest) and VC4 (loudest with PIP).  (In
1058	   addition, there are encoding constraints that make choosing all of
1059	   the VCs in a set impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and
1060	   EG1 has only 3 ENCs.  This constraint shows up in the encoding
1061	   groups, not in the simultaneous transmission sets.)

1063	   In this example there are no restrictions on which audio captures can
1064	   be sent simultaneously.

1066	   Encoding Groups:

1068	   This example has three encoding groups associated with the video
1069	   captures.  Each group can have 3 encodings, but with each potential
1070	   encoding having a progressively lower specification.  In this
1071	   example, 1080p60 transmission is possible (as ENC0 has a maxMbps
1072	   value compatible with that) as long as it is the only active encoding
1073	   in the group(as maxMbps for the entire encoding group is also
1074	   489600).  Significantly, as up to 3 encodings are available per
1075	   group, it is possible to transmit some video captures simultaneously
1076	   that are not in the same entry in the capture set.  For example VC1
1077	   and VC3 at the same time.

1079	   It is also possible to transmit multiple encodings of a single video
1080	   capture.  For example VC0 can be encoded using ENC0 and ENC1 at the
1081	   same time, as long as the encoding parameters satisfy the constraints
1082	   of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30.

1084	   encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1085	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1086	                      maxH264Mbps=489600, maxBandwidth=4000000
1087	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1088	                      maxH264Mbps=108000, maxBandwidth=4000000
1089	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1090	                      maxH264Mbps=61200, maxBandwidth=4000000

1092	   encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1093	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1094	                      maxH264Mbps=489600, maxBandwidth=4000000
1095	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1096	                      maxH264Mbps=108000, maxBandwidth=4000000
1097	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1098	                      maxH264Mbps=61200, maxBandwidth=4000000

1100	   encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1101	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1102	                      maxH264Mbps=489600, maxBandwidth=4000000
1103	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1104	                      maxH264Mbps=108000, maxBandwidth=4000000
1105	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1106	                      maxH264Mbps=61200, maxBandwidth=4000000

1108	                Figure 2: Example Encoding Groups for Video

1110	   For audio, there are five potential encodings available, so all five
1111	   audio captures can be encoded at the same time.

1113	   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
1114	       encodeID=ENC9, maxBandwidth=64000
1115	       encodeID=ENC10, maxBandwidth=64000
1116	       encodeID=ENC11, maxBandwidth=64000
1117	       encodeID=ENC12, maxBandwidth=64000
1118	       encodeID=ENC13, maxBandwidth=64000

1120	                Figure 3: Example Encoding Group for Audio

1122	   Capture Sets:

1124	   The following table represents the capture sets for this provider.
1125	   Recall that a capture set is composed of alternative captures
1126	   covering the same scene.  Capture Set #1 is for the main people
1127	   captures, and Capture Set #2 is for presentation.

1129	       Each row in the table is a separate entry in the capture set

1131	                            +----------------+
1132	                            | Capture Set #1 |
1133	                            +----------------+
1134	                            | VC0, VC1, VC2  |
1135	                            | VC3            |
1136	                            | VC4            |
1137	                            | VC5            |
1138	                            | AC0, AC1, AC2  |
1139	                            | AC3            |
1140	                            +----------------+

1142	                            +----------------+
1143	                            | Capture Set #2 |
1144	                            +----------------+
1145	                            | VC6            |
1146	                            | AC4            |
1147	                            +----------------+

1149	   Different capture sets are unique to each other, non-overlapping.  A
1150	   consumer can choose an entry from each capture set.  In this case the
1151	   three captures VC0, VC1, and VC2 are one way of representing the
1152	   video from the endpoint.  These three captures should appear adjacent
1153	   next to each other.  Alternatively, another way of representing the
1154	   Capture Scene is with the capture VC3, which automatically shows the
1155	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1157	   As in the video case, the different entries of audio in Capture Set
1158	   #1 represent the "same thing", in that one way to receive the audio
1159	   is with the 3 audio captures (AC0, AC1, AC2), and another way is with
1160	   the mixed AC3.  The Media Consumer can choose an audio capture entry
1161	   it is capable of receiving.

1163	   The spatial ordering is understood by the media capture attributes
1164	   area and point of capture.

1166	   A Media Consumer would likely want to choose a capture set entry to
1167	   receive based in part on how many streams it can simultaneously
1168	   receive.  A consumer that can receive three people streams would
1169	   probably prefer to receive the first entry of Capture Set #1 (VC0,
1170	   VC1, VC2) and not receive the other entries.  A consumer that can
1171	   receive only one people stream would probably choose one of the other
1172	   entries.

1174	   If the consumer can receive a presentation stream too, it would also
1175	   choose to receive the only entry from Capture Set #2 (VC6).

1177	11.2.  Encoding Group Example

1179	   This is an example of an encoding group to illustrate how it can
1180	   express dependencies between encodings.

1182	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1183	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1184	                         maxH264Mbps=244800, maxBandwidth=4000000
1185	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1186	                         maxH264Mbps=244800, maxBandwidth=4000000
1187	       encodeID=AUDENC0, maxBandwidth=96000
1188	       encodeID=AUDENC1, maxBandwidth=96000
1189	       encodeID=AUDENC2, maxBandwidth=96000

1191	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
1192	   encodings (Mbps for 1080p = 244800), but it is capable of
1193	   transmitting a maxFrameRate of 60 frames per second (fps).  To
1194	   achieve the maximum resolution (1920 x 1088) the frame rate is
1195	   limited to 30 fps.  However 60 fps can be achieved at a lower
1196	   resolution if required by the consumer.  Although the encoding group
1197	   is capable of transmitting up to 6Mbit/s, no individual video
1198	   encoding can exceed 4Mbit/s.

1200	   This encoding group also allows up to 3 audio encodings, AUDENC<0-2>.
1201	   It is not required that audio and video encodings reside within the
1202	   same encoding group, but if so then the group's overall maxBandwidth
1203	   value is a limit on the sum of all audio and video encodings
1204	   configured by the consumer.  A system that does not wish or need to
1205	   combine bandwidth limitations in this way should instead use separate
1206	   encoding groups for audio and video in order for the bandwidth
1207	   limitations on audio and video to not interact.

1209	   Audio and video can be expressed in separate encoding groups, as in
1210	   this illustration.

1212	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1213	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1214	                         maxH264Mbps=244800, maxBandwidth=4000000
1215	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1216	                         maxH264Mbps=244800, maxBandwidth=4000000

1218	  encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
1219	       encodeID=AUDENC0, maxBandwidth=96000
1220	       encodeID=AUDENC1, maxBandwidth=96000
1221	       encodeID=AUDENC2, maxBandwidth=96000

1223	11.3.  The MCU Case

1225	   This section shows how an MCU might express its Capture Sets,
1226	   intending to offer different choices for consumers that can handle
1227	   different numbers of streams.  A single audio capture stream is
1228	   provided for all single and multi-screen configurations that can be
1229	   associated (e.g. lip-synced) with any combination of video captures
1230	   at the consumer.

1232	   +--------------------+---------------------------------------------+
1233	   | Capture Set #1     | note                                        |
1234	   +--------------------+---------------------------------------------+
1235	   | VC0                | video capture for single screen consumer    |
1236	   | VC1, VC2           | video capture for 2 screen consumer         |
1237	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1238	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1239	   | AC0                | audio capture representing all participants |
1240	   +--------------------+---------------------------------------------+

1242	   If / when a presentation stream becomes active within the conference,
1243	   the MCU might re-advertise the available media as:

1245	         +----------------+--------------------------------------+
1246	         | Capture Set #2 | note                                 |
1247	         +----------------+--------------------------------------+
1248	         | VC10           | video capture for presentation       |
1249	         | AC1            | presentation audio to accompany VC10 |
1250	         +----------------+--------------------------------------+

1252	11.4.  Media Consumer Behavior

1254	   This section gives an example of how a media consumer might behave
1255	   when deciding how to request streams from the three screen endpoint
1256	   described in the previous section.

1258	   The receive side of a call needs to balance its requirements, based
1259	   on number of screens and speakers, its decoding capabilities and
1260	   available bandwidth, and the provider's capabilities in order to
1261	   optimally configure the provider's streams.  Typically it would want
1262	   to receive and decode media from each capture set advertised by the
1263	   provider.

1265	   A sane, basic, algorithm might be for the consumer to go through each
1266	   capture set in turn and find the collection of video captures that
1267	   best matches the number of screens it has (this might include
1268	   consideration of screens dedicated to presentation video display
1269	   rather than "people" video) and then decide between alternative
1270	   entries in the video capture sets based either on hard-coded
1271	   preferences or user choice.  Once this choice has been made, the
1272	   consumer would then decide how to configure the provider's encoding
1273	   groups in order to make best use of the available network bandwidth
1274	   and its own decoding capabilities.

1276	11.4.1.  One screen consumer

1278	   VC3, VC4 and VC5 are all different entries by themselves, not grouped
1279	   together in a single entry, so the receiving device should choose
1280	   between one of those.  The choice would come down to whether to see
1281	   the greatest number of participants simultaneously at roughly equal
1282	   precedence (VC5), a switched view of just the loudest region (VC3) or
1283	   a switched view with PiPs (VC4).  An endpoint device with a small
1284	   amount of knowledge of these differences could offer a dynamic choice
1285	   of these options, in-call, to the user.

1287	11.4.2.  Two screen consumer configuring the example

1289	   Mixing systems with an even number of screens, "2n", and those with
1290	   "2n+1" cameras (and vice versa) is always likely to be the
1291	   problematic case.  In this instance, the behavior is likely to be
1292	   determined by whether a "2 screen" system is really a "2 decoder"
1293	   system, i.e., whether only one received stream can be displayed per
1294	   screen or whether more than 2 streams can be received and spread
1295	   across the available screen area.  To enumerate 3 possible behaviors
1296	   here for the 2 screen system when it learns that the far end is
1297	   "ideally" expressed via 3 capture streams:

1299	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1300	       per the 1 screen consumer case above) and either leave one screen
1301	       blank or use it for presentation if / when a presentation becomes
1302	       active

1304	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1305	       (either with each capture being scaled to 2/3 of a screen and the
1306	       centre capture being split across 2 screens) or, as would be
1307	       necessary if there were large bezels on the screens, with each
1308	       stream being scaled to 1/2 the screen width and height and there
1309	       being a 4th "blank" panel.  This 4th panel could potentially be
1310	       used for any presentation that became active during the call.

1312	   3.  Receive 3 streams, decode all 3, and use control information
1313	       indicating which was the most active to switch between showing
1314	       the left and centre streams (one per screen) and the centre and
1315	       right streams.

1317	   For an endpoint capable of all 3 methods of working described above,
1318	   again it might be appropriate to offer the user the choice of display
1319	   mode.

1321	11.4.3.  Three screen consumer configuring the example

1323	   This is the most straightforward case - the consumer would look to
1324	   identify a set of streams to receive that best matched its available
1325	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1326	   spatial ordering would give sufficient information for the correct
1327	   video capture to be shown on the correct screen, and the consumer
1328	   would either need to divide a single encoding group's capability by 3
1329	   to determine what resolution and frame rate to configure the provider
1330	   with or to configure the individual video captures' encoding groups
1331	   with what makes most sense (taking into account the receive side
1332	   decode capabilities, overall call bandwidth, the resolution of the
1333	   screens plus any user preferences such as motion vs sharpness).

1335	12.  Acknowledgements

1337	   Mark Gorzyinski contributed much to the approach.  We want to thank
1338	   Stephen Botzko for helpful discussions on audio.

1340	13.  IANA Considerations

1342	   TBD

1344	14.  Security Considerations

1346	   TBD

1348	15.  Informative References

1350	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1351	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1353	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1354	              A., Peterson, J., Sparks, R., Handley, M., and E.
1355	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1356	              June 2002.

1358	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1359	              Jacobson, "RTP: A Transport Protocol for Real-Time
1360	              Applications", STD 64, RFC 3550, July 2003.

1362	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1363	              Session Initiation Protocol (SIP)", RFC 4353,
1364	              February 2006.

1366	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1367	              January 2008.

1369	Appendix A.  Open Issues

1371	A.1.  Video layout arrangements and centralized composition

1373	   In the context of a conference with a central MCU, there has been
1374	   discussion about a consumer requesting the provider to provide a
1375	   certain type of layout arrangement or perform a certain composition
1376	   algorithm, such as combining some number of most recent talkers, or
1377	   producing a video layout using a 2x2 grid or 1 large cell with 5
1378	   smaller cells around it.  The current framework does not address
1379	   this.  It isn't clear if this topic should be included in this
1380	   framework, or maybe a different part of CLUE, or maybe outside of
1381	   CLUE altogether.

1383	A.2.  Source is selectable

1385	   A Boolean variable.  True indicates the media consumer can request a
1386	   particular media source be mapped to a media capture.  Default is
1387	   false.

1389	   TBD - how does the consumer make the request for a particular source?
1390	   How does the consumer know what is available?  Need to explain better
1391	   how multiple media captures are different from a single media capture
1392	   with choices for the source, and when each concept should be used.

1394	A.3.  Media Source Selection

1396	   The use cases include a case where the person at a receiving endpoint
1397	   can request to receive media from a particular other endpoint, for
1398	   example in a multipoint call to request to receive the video from a
1399	   certain section of a certain room, whether or not people there are
1400	   talking.

1402	   TBD - this framework should address this case.  Maybe need a roster
1403	   list of rooms or people in the conference, with a mechanism to select
1404	   from the roster and associate it with media captures.  This is
1405	   different from selecting a particular media capture from a capture
1406	   set.  The mechanism to do this will probably need to be different
1407	   than selecting media captures based on capture sets and attributes.

1409	A.4.  Endpoint requesting many streams from MCU

1411	   TBD - how to do VC selection for a system where the endpoint media
1412	   consumers want to receive lots of streams and do their own
1413	   composition, rather than MCU doing transcoding and composing.
1414	   Example is 3 screen consumer that wants 3 large loudest speaker
1415	   streams, and a bunch of small ones to render as PiP.  How the small
1416	   ones are chosen, which could potentially be chosen by either the
1417	   endpoint or MCU.  There are other more complicated examples also.  Is
1418	   the current framework adequate to support this?

1420	A.5.  VAD (voice activity detection) tagging of audio streams

1422	   TBD - do we want to have VAD be mandatory?  All audio streams
1423	   originating from a media provider must be tagged with VAD
1424	   information.  This tagging would include an overall energy value for
1425	   the stream plus information on which sections of the capture scene
1426	   are "active".

1428	   Each audio stream which forms a constituent of an entry within a
1429	   capture set should include this tagging, and the energy value within
1430	   it calculated using a fixed, consistent algorithm.

1432	   When a system determines the most active area of a capture scene
1433	   (either "loudest", or determined by other means such as a button
1434	   press) it should convey that information to the corresponding media
1435	   stream consumer via any audio streams being sent within that capture
1436	   set.  Specifically, there should be a list of active coordinates and
1437	   their VAD characteristics within the audio stream in addition to the
1438	   overall VAD information for the capture set.  This is to ensure all
1439	   media stream consumers receive the same, consistent, audio energy
1440	   information whichever audio capture or captures they choose to
1441	   receive for a capture set.  Additionally, coordinate information can
1442	   be mapped to video captures by a media stream consumer in order that
1443	   it can perform "panel switching" if required.

1445	A.6.  Private Information

1447	   Do we want a way to include private information?

1449	Authors' Addresses

1451	   Allyn Romanow
1452	   Cisco Systems
1453	   San Jose, CA  95134
1454	   USA

1456	   Email: allyn@cisco.com

1458	   Mark Duckworth (editor)
1459	   Polycom
1460	   Andover, MA  01810
1461	   US

1463	   Email: mark.duckworth@polycom.com

1465	   Andrew Pepperell
1466	   Langley, England
1467	   UK

1469	   Email: apeppere@gmail.com

1471	   Brian Baldino
1472	   Cisco Systems
1473	   San Jose, CA  95134
1474	   US

1476	   Email: bbaldino@cisco.com