idnits 2.17.1 

draft-ietf-clue-framework-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1084 has weird spacing: '...om left    bot...'

  == Line 1135 has weird spacing: '...om left    bot...'

  -- The document date (May 25, 2012) is 4354 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                         M. Duckworth, Ed.
5	Expires: November 26, 2012                                       Polycom
6	                                                            A. Pepperell

8	                                                              B. Baldino
9	                                                           Cisco Systems
10	                                                            May 25, 2012

12	                Framework for Telepresence Multi-Streams
13	                    draft-ietf-clue-framework-05.txt

15	Abstract

17	   This memo offers a framework for a protocol that enables devices in a
18	   telepresence conference to interoperate by specifying the
19	   relationships between multiple media streams.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on November 26, 2012.

38	Copyright Notice

40	   Copyright (c) 2012 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
56	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
57	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	   4.  Overview of the Framework/Model  . . . . . . . . . . . . . . .  7
59	   5.  Spatial Relationships  . . . . . . . . . . . . . . . . . . . .  8
60	   6.  Media Captures and Capture Scenes  . . . . . . . . . . . . . .  9
61	     6.1.  Media Captures . . . . . . . . . . . . . . . . . . . . . .  9
62	       6.1.1.  Media Capture Attributes . . . . . . . . . . . . . . . 10
63	     6.2.  Capture Scene  . . . . . . . . . . . . . . . . . . . . . . 12
64	       6.2.1.  Capture scene attributes . . . . . . . . . . . . . . . 14
65	       6.2.2.  Capture scene entry attributes . . . . . . . . . . . . 14
66	     6.3.  Simultaneous Transmission Set Constraints  . . . . . . . . 15
67	   7.  Encodings  . . . . . . . . . . . . . . . . . . . . . . . . . . 16
68	     7.1.  Individual Encodings . . . . . . . . . . . . . . . . . . . 17
69	     7.2.  Encoding Group . . . . . . . . . . . . . . . . . . . . . . 18
70	   8.  Associating Media Captures with Encoding Groups  . . . . . . . 19
71	   9.  Consumer's Choice of Streams to Receive from the Provider  . . 20
72	     9.1.  Local preference . . . . . . . . . . . . . . . . . . . . . 20
73	     9.2.  Physical simultaneity restrictions . . . . . . . . . . . . 21
74	     9.3.  Encoding and encoding group limits . . . . . . . . . . . . 21
75	     9.4.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21
76	   10. Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 22
77	   11. Examples - Using the Framework . . . . . . . . . . . . . . . . 23
78	     11.1. Three screen endpoint media provider . . . . . . . . . . . 23
79	     11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29
80	     11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30
81	     11.4. Media Consumer Behavior  . . . . . . . . . . . . . . . . . 30
82	       11.4.1. One screen consumer  . . . . . . . . . . . . . . . . . 31
83	       11.4.2. Two screen consumer configuring the example  . . . . . 31
84	       11.4.3. Three screen consumer configuring the example  . . . . 32
85	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
86	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 32
87	   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 32
88	   15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32
89	   16. Informative References . . . . . . . . . . . . . . . . . . . . 33
90	   Appendix A.  Open Issues . . . . . . . . . . . . . . . . . . . . . 34
91	     A.1.  Video layout arrangements and centralized composition  . . 34
92	     A.2.  Source is selectable . . . . . . . . . . . . . . . . . . . 34
93	     A.3.  Media Source Selection . . . . . . . . . . . . . . . . . . 35
94	     A.4.  Endpoint requesting many streams from MCU  . . . . . . . . 35
95	     A.5.  VAD (voice activity detection) tagging of audio streams  . 35
96	     A.6.  Private Information  . . . . . . . . . . . . . . . . . . . 36
97	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36

99	1.  Introduction

101	   Current telepresence systems, though based on open standards such as
102	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
103	   other.  A major factor limiting the interoperability of telepresence
104	   systems is the lack of a standardized way to describe and negotiate
105	   the use of the multiple streams of audio and video comprising the
106	   media flows.  This draft provides a framework for a protocol to
107	   enable interoperability by handling multiple streams in a
108	   standardized way.  It is intended to support the use cases described
109	   in draft-ietf-clue-telepresence-use-cases-02 and to meet the
110	   requirements in draft-ietf-clue-telepresence-requirements-01.

112	   The solution described here is strongly focused on what is being done
113	   today, rather than on a vision of future conferencing.  At the same
114	   time, the highest priority has been given to creating an extensible
115	   framework to make it easy to accommodate future conferencing
116	   functionality as it evolves.

118	   The purpose of this effort is to make it possible to handle multiple
119	   streams of media in such a way that a satisfactory user experience is
120	   possible even when participants are using different vendor equipment,
121	   and also when they are using devices with different types of
122	   communication capabilities.  Information about the relationship of
123	   media streams at the provider's end must be communicated so that
124	   streams can be chosen and audio/video rendering can be done in the
125	   best possible manner.

127	   There is no attempt here to dictate to the renderer what it should
128	   do.  What the renderer does is up to the renderer.

130	   After the following Definitions, a short section introduces key
131	   concepts.  The body of the text comprises several sections about the
132	   key elements of the framework, how a consumer chooses streams to
133	   receive, and some examples.  The appendix describe topics that are
134	   under discussion for adding to the document.

136	2.  Terminology

138	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
139	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
140	   document are to be interpreted as described in RFC 2119 [RFC2119].

142	3.  Definitions

144	   The definitions marked with an "*" are new; all the others are from
145	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

147	   Camera-Left and Right: For media captures, camera-left and camera-
148	   right are from the point of view of a person observing the rendered
149	   media.  They are the opposite of stage-left and stage-right.

151	   Capture Device: A device that converts audio and video input into an
152	   electrical signal, in most cases to be fed into a media encoder.
153	   Cameras and microphones are examples for capture devices.

155	   *Capture Scene: a structure representing the scene that is captured
156	   by a collection of capture devices.  A capture scene includes
157	   attributes and one or more capture scene entries, with each entry
158	   including one or more media captures.

160	   *Capture Scene Entry: a list of media captures of the same media type
161	   that together form one way to represent the capture scene.

163	   Conference: used as defined in [RFC4353], A Framework for
164	   Conferencing within the Session Initiation Protocol (SIP).

166	   *Individual Encoding: A variable with a set of attributes that
167	   describes the maximum values of a single audio or video capture
168	   encoding.  The attributes include: maximum bandwidth- and for video
169	   maximum macroblocks (for H.264), maximum width, maximum height,
170	   maximum frame rate.

172	   *Encoding Group: A set of encoding parameters representing a media
173	   provider's encoding capabilities.  Media stream providers formed of
174	   multiple physical units, in each of which resides some encoding
175	   capability, would typically advertise themselves to the remote media
176	   stream consumer using multiple encoding groups.  Within each encoding
177	   group, multiple potential encodings are possible, with the sum of the
178	   chosen encodings' characteristics constrained to being less than or
179	   equal to the group-wide constraints.

181	   Endpoint: The logical point of final termination through receiving,
182	   decoding and rendering, and/or initiation through capturing,
183	   encoding, and sending of media streams.  An endpoint consists of one
184	   or more physical devices which source and sink media streams, and
185	   exactly one [RFC4353] Participant (which, in turn, includes exactly
186	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
187	   send and receive media streams, but it is not the initiator nor the
188	   final terminator in the sense that Media is Captured or Rendered.
189	   Endpoints can be anything from multiscreen/multicamera rooms to
190	   handheld devices.

192	   Front: the portion of the room closest to the cameras.  In going
193	   towards back you move away from the cameras.

195	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
196	   more endpoints together into one single multimedia conference
197	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
198	   tardy in requiring that media from the mixer be sent to EACH
199	   participant.  I think we have practical use cases where this is not
200	   the case.  But the bug (if it is one) is in 4353 and not herein.]

202	   Media: Any data that, after suitable encoding, can be conveyed over
203	   RTP, including audio, video or timed text.

205	   *Media Capture: a source of Media, such as from one or more Capture
206	   Devices.  A Media Capture (MC) may be the source of one or more Media
207	   streams.  A Media Capture may also be constructed from other Media
208	   streams.  A middle box can express Media Captures that it constructs
209	   from Media streams it receives.

211	   *Media Consumer: an Endpoint or middle box that receives media
212	   streams

214	   *Media Provider: an Endpoint or middle box that sends Media streams

216	   Model: a set of assumptions a telepresence system of a given vendor
217	   adheres to and expects the remote telepresence system(s) also to
218	   adhere to.

220	   *Plane of Interest: The spatial plane containing the most relevant
221	   subject matter.

223	   Render: the process of generating a representation from a media, such
224	   as displayed motion video or sound emitted from loudspeakers.

226	   *Simultaneous Transmission Set: a set of media captures that can be
227	   transmitted simultaneously from a Media Provider.

229	   Spatial Relation: The arrangement in space of two objects, in
230	   contrast to relation in time or other relationships.  See also
231	   Camera-Left and Right.

233	   Stage-Left and Right: For media captures, stage-left and stage-right
234	   are the opposite of camera-left and camera-right.  For the case of a
235	   person facing (and captured by) a camera, stage-left and stage-right
236	   are from the point of view of that person.

238	   *Stream: RTP stream as in [RFC3550].

240	   Stream Characteristics: the media stream attributes commonly used in
241	   non-CLUE SIP/SDP environments (such as: media codec, bit rate,
242	   resolution, profile/level etc.) as well as CLUE specific attributes,
243	   such as the ID of a capture or a spatial location.

245	   Telepresence: an environment that gives non co-located users or user
246	   groups a feeling of (co-located) presence - the feeling that a Local
247	   user is in the same room with other Local users and the Remote
248	   parties.  The inclusion of Remote parties is achieved through
249	   multimedia communication including at least audio and video signals
250	   of high fidelity.

252	   *Video Capture: Media Capture for video.  Denoted as VCn.

254	   Video composite: A single image that is formed from combining visual
255	   elements from separate sources.

257	4.  Overview of the Framework/Model

259	   The CLUE framework specifies how multiple media streams are to be
260	   handled in a telepresence conference.

262	   The main goals include:

264	   o  Interoperability

266	   o  Extensibility

268	   o  Flexibility

270	   Interoperability is achieved by the media provider describing the
271	   relationships between media streams in constructs that are understood
272	   by the consumer, who can then render the media.  Extensibility is
273	   achieved through abstractions and the generality of the model, making
274	   it easy to add new parameters.  Flexibility is achieved largely by
275	   having the consumer choose what content and format it wants to
276	   receive from what the provider is capable of sending.

278	   A transmitting endpoint or MCU describes specific aspects of the
279	   content of the media and the formatting of the media streams it can
280	   send (advertisement); and the receiving end responds to the provider
281	   by specifying which content and media streams it wants to receive
282	   (configuration).  The provider then transmits the asked for content
283	   in the specified streams.

285	   This advertisement and configuration occurs at call initiation but
286	   may also happen at any time throughout the conference, whenever there
287	   is a change in what the consumer wants or the provider can send.

289	   An endpoint or MCU typically acts as both provider and consumer at
290	   the same time, sending advertisements and sending configurations in
291	   response to receiving advertisements.  (It is possible to be just one
292	   or the other.)

294	   The data model is based around two main concepts: a capture and an
295	   encoding.  A media capture (MC), such as audio or video, describes
296	   the content a provider can send.  Media captures are described in
297	   terms of CLUE-defined attributes, such as spatial relationships and
298	   purpose of the capture.  Providers tell consumers which media
299	   captures they can provide, described in terms of the media capture
300	   attributes.

302	   A provider organizes its media captures that represent the same scene
303	   into capture scenes.  A consumer chooses which media captures it
304	   wants to receive according to the capture scenes sent by the
305	   provider.

307	   In addition, the provider sends the consumer a description of the
308	   streams it can send in terms of the media attributes of the stream,
309	   in particular, well-known audio and video parameters such as
310	   bandwidth, frame rate, macroblocks per second.

312	   The provider also specifies constraints on its ability to provide
313	   media, and the consumer must take these into account in choosing the
314	   content and streams it wants.  Some constraints are due to the
315	   physical limitations of devices - for example, a camera may not be
316	   able to provide zoom and non-zoom views simultaneously.  Other
317	   constraints are system based constraints, such as maximum bandwidth
318	   and maximum macroblocks/second.

320	   The following sections discuss these constructs and processes in
321	   detail, followed by use cases showing how the framework specification
322	   can be used.

324	5.  Spatial Relationships

326	   In order for a consumer to perform a proper rendering, it is often
327	   necessary to provide spatial information about the streams it is
328	   receiving.  CLUE defines a coordinate system that allows media
329	   providers to describe the spatial relationships of their media
330	   captures to enable proper scaling and spatial rendering of their
331	   streams.  The coordinate system is based on a few principles:

333	   o  Simple systems which do not have multiple Media Captures to
334	      associate spatially need not use the coordinate model.

336	   o  Coordinates can either be in real, physical units (millimeters),
337	      have an unknown scale or have no physical scale.  Systems which
338	      know their physical dimensions should always provide those real-
339	      world measurements.  Systems which don't know specific physical
340	      dimensions but still know relative distances should use 'unknown
341	      scale'.  'No scale' is intended to be used where Media Captures
342	      from different devices (with potentially different scales) will be
343	      forwarded alongside one another (e.g. in the case of a middle
344	      box).

346	      *  "millimeters" means the scale is in millimeters

348	      *  "Unknown" means the scale is not necessarily millimeters, but
349	         the scale is the same for every capture in the capture scene.

351	      *  "No Scale" means the scale could be different for each capture
352	         - an MCU provider that advertises two adjacent captures and
353	         picks sources (which can change quickly) from different
354	         endpoints might use this value; the scale could be different
355	         and changing for each capture.  But the areas of capture still
356	         represent a spatial relation between captures.

358	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
359	      spot of the provider's choosing.  The provider must use the same
360	      origin for all coordinates within the same capture scene.

362	   The direction of increasing coordinate values is:
363	   X increases from camera left to camera right
364	   Y increases from front to back
365	   Z increases from low to high

367	6.  Media Captures and Capture Scenes

369	   This section describes how media providers can describe the content
370	   of media to consumers.

372	6.1.  Media Captures

374	   Media captures are the fundamental representations of streams that a
375	   device can transmit.  What a Media Capture actually represents is
376	   flexible:

378	   o  It can represent the immediate output of a physical source (e.g.
379	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
380	      DVD player).

382	   o  It can represent the output of an audio mixer or video composer

384	   o  It can represent a concept such as 'the loudest speaker'

386	   o  It can represent a conceptual position such as 'the leftmost
387	      stream'

389	   To distinguish between multiple instances, video and audio captures
390	   are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 refer to
391	   two different video captures and AC1 and AC2 refer to two different
392	   audio captures.

394	   Each Media Capture can be associated with attributes to describe what
395	   it represents.

397	6.1.1.  Media Capture Attributes

399	   Media Capture Attributes describe static information about the
400	   captures.  A provider uses the media capture attributes to describe
401	   the media captures to the consumer.  The consumer will select the
402	   captures it wants to receive.  Attributes are defined by a variable
403	   and its value.  The currently defined attributes and their values
404	   are:

406	   Content: {slides, speaker, sl, main, alt}

408	   A field with enumerated values which describes the role of the media
409	   capture and can be applied to any media type.  The enumerated values
410	   are defined by [RFC4796].  The values for this attribute are the same
411	   as the mediacnt values for the content attribute in [RFC4796].  This
412	   attribute can have multiple values, for example content={main,
413	   speaker}.

415	   Composed: {true, false}

417	   A field with a Boolean value which indicates whether or not the Media
418	   Capture is a mix (audio) or composition (video) of streams.

420	   This attribute is useful for a media consumer to avoid nesting a
421	   composed video capture into another composed capture or rendering.
422	   This attribute is not intended to describe the layout a media
423	   provider uses when composing video streams.

425	   Audio Channel Format: {mono, stereo} A field with enumerated values
426	   which describes the method of encoding used for audio.

428	   A value of 'mono' means the Audio Capture has one channel.

430	   A value of 'stereo' means the Audio Capture has two audio channels,
431	   left and right.

433	   This attribute applies only to Audio Captures.  A single stereo
434	   capture is different from two mono captures that have a left-right
435	   spatial relationship.  A stereo capture maps to a single RTP stream,
436	   while each mono audio capture maps to a separate RTP stream.

438	   Switched: {true, false}

440	   A field with a Boolean value which indicates whether or not the Media
441	   Capture represents the (dynamic) most appropriate subset of a
442	   'whole'.  What is 'most appropriate' is up to the provider and could
443	   be the active speaker, a lecturer or a VIP.

445	   Point of Capture: {(X, Y, Z)} A field with a single Cartesian (X, Y,
446	   Z) point value which describes the spatial location, virtual or
447	   physical, of the capturing device (such as camera).

449	   When the Point of Capture attribute is specified, it must include X,
450	   Y and Z coordinates.  If the point of capture is not specified, it
451	   means the consumer should not assume anything about the spatial
452	   location of the capturing device.  Even if the provider specifies an
453	   area of capture attribute, it does not need to specify the point of
454	   capture.

456	   Area of Capture:

458	   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3,
459	   Z3), top right(X4, Y4, Z4)}

461	   A field with a set of four (X, Y, Z) points as a value which describe
462	   the spatial location of what is being "captured".  By comparing the
463	   Area of Capture for different Media Captures within the same capture
464	   scene a consumer can determine the spatial relationships between them
465	   and render them correctly.

467	   The four points should be co-planar.  The four points form a
468	   quadrilateral, not necessarily a rectangle.

470	   The quadrilateral described by the four (X, Y, Z) points defines the
471	   plane of interest for the particular media capture.

473	   If the area of capture attribute is specified, it must include X, Y
474	   and Z coordinates for all four points.  If the area of capture is not
475	   specified, it means the media capture is not spatially related to any
476	   other media capture (but this can change in a subsequent provider
477	   advertisement).

479	   For a switched capture that switches between different sections
480	   within a larger area, the area of capture should use coordinates for
481	   the larger potential area.

483	   EncodingGroup: {<encodeGroupID value>}

485	   A field with a value equal to the encodeGroupID of the encoding group
486	   associated with the media capture.

488	6.2.  Capture Scene

490	   In order for a provider's individual media captures to be used
491	   effectively by a consumer, the provider organizes the media captures
492	   into capture scenes, with the structure and contents of these capture
493	   scenes being sent from the provider to the consumer.

495	   A capture scene is a structure representing the scene that is
496	   captured by a collection of capture devices.  A capture scene
497	   includes one or more capture scene entries, with each entry including
498	   one or more media captures.  A capture scene represents, for example,
499	   the video image of a group of people seated next to each other, along
500	   with the sound of their voices, which could be represented by some
501	   number of VCs and ACs in the capture scene entries.  A middle box may
502	   also express capture scenes that it constructs from media streams it
503	   receives.

505	   A provider may advertise multiple capture scenes or just a single
506	   capture scene.  A media provider might typically use one capture
507	   scene for main participant media and another capture scene for a
508	   computer generated presentation.  A capture scene may include more
509	   than one type of media.  For example, a capture scene can include
510	   several capture scene entries for video captures, and several capture
511	   scene entries for audio captures.

513	   A provider can express spatial relationships between media captures
514	   that are included in the same capture scene.  But there is no spatial
515	   relationship between media captures that are in different capture
516	   scenes.

518	   A media provider arranges media captures in a capture scene to help
519	   the media consumer choose which captures it wants.  The capture scene
520	   entries in a capture scene are different alternatives the provider is
521	   suggesting for representing the capture scene.  The media consumer
522	   can choose to receive all media captures from one capture scene entry
523	   for each media type (e.g. audio and video), or it can pick and choose
524	   media captures regardless of how the provider arranges them in
525	   capture scene entries.

527	   Media captures within the same capture scene entry must be of the
528	   same media type - it is not possible to mix audio and video captures
529	   in the same capture scene entry, for instance.  The provider must be
530	   capable of encoding and sending all media captures in a single entry
531	   simultaneously.  A consumer may decide to receive all the media
532	   captures in a single capture scene entry, but a consumer could also
533	   decide to receive just a subset of those captures.  A consumer can
534	   also decide to receive media captures from different capture scene
535	   entries.

537	   When a provider advertises a capture scene with multiple entries, it
538	   is essentially signaling that there are multiple representations of
539	   the same scene available.  In some cases, these multiple
540	   representations would typically be used simultaneously (for instance
541	   a "video entry" and an "audio entry").  In some cases the entries
542	   would conceptually be alternatives (for instance an entry consisting
543	   of 3 video captures versus an entry consisting of just a single video
544	   capture).  In this latter example, the provider would in the simple
545	   case end up providing to the consumer the entry containing the number
546	   of video captures that most closely matched the media consumer's
547	   number of display devices.

549	   The following is an example of 4 potential capture scene entries for
550	   an endpoint-style media provider:

552	   1.  (VC0, VC1, VC2) - left, center and right camera video captures

554	   2.  (VC3) - video capture associated with loudest room segment

556	   3.  (VC4) - video capture zoomed out view of all people in the room

558	   4.  (AC0) - main audio

560	   The first entry in this capture scene example is a list of video
561	   captures with a spatial relationship to each other.  Determination of
562	   the order of these captures (VC0, VC1 and VC2) for rendering purposes
563	   is accomplished through use of their Area of Capture attributes.  The
564	   second entry (VC3) and the third entry (VC4) are additional
565	   alternatives of how to capture the same room in different ways.  The
566	   inclusion of the audio capture in the same capture scene indicates
567	   that AC0 is associated with those video captures, meaning it comes
568	   from the same scene.  The audio should be rendered in conjunction
569	   with any rendered video captures from the same capture scene.

571	6.2.1.  Capture scene attributes

573	   Attributes can be applied to capture scenes as well as to individual
574	   media captures.  Attributes specified at this level apply to all
575	   constituent media captures.

577	   Description attribute

579	   The description attribute is a human readable text string which
580	   describes the capture scene.  A provider that advertises multiple
581	   capture scenes may use different descriptions to differentiate
582	   between them.  This attribute can contain text in any language.

584	   Description Language attribute

586	   This attribute contains only one language, which is the language of
587	   the text in the description attribute.  The possible values of this
588	   element are the values of the 'Subtag' column of the "Language Subtag
589	   Registry" at [IANA-Lan] originally defined in [RFC5646].

591	   Area of Scene attribute

593	   The area of scene attribute for a capture scene has the same format
594	   as the area of capture attribute for a media capture.  The area of
595	   scene is for the entire scene, which is captured by the one or more
596	   media captures in the capture scene entries.  If the provider does
597	   not specify the area of scene, but does specify areas of capture,
598	   then the consumer may assume the area of scene is greater than or
599	   equal to the outer extents of the individual areas of capture.

601	   Scale attribute

603	   An optional attribute indicating if the numbers used for area of
604	   scene, area of capture and point of capture are in terms of
605	   millimeters, unknown scale factor, or not any scale, as described in
606	   Section 5.  If any media captures have an area of capture attribute
607	   or point of capture attribute, then this scale attribute must also be
608	   defined.  The possible values for this attribute are:

610	      "millimeters"
611	      "unknown"
612	      "no scale"

614	6.2.2.  Capture scene entry attributes

616	   Attributes can be applied to capture scene entries.  Attributes
617	   specified at this level apply to the capture scene entry as a whole.

619	   Scene-switch-policy: {site-switch, segment-switch}

621	   A media provider uses this scene-switch-policy attribute to indicate
622	   its support for different switching policies.  In the provider's
623	   advertisement, this attribute can have multiple values, which means
624	   the provider supports each of the indicated policies.  The consumer,
625	   when it requests media captures from this capture scene entry, should
626	   also include this attribute but with only the single value (from
627	   among the values indicated by the provider) indicating the consumer's
628	   choice for which policy it wants the provider to use.  If the
629	   provider does not support any of these policies, it should omit this
630	   attribute.

632	   The "site-switch" policy means all captures are switched at the same
633	   time to keep captures from the same endpoint site together.  Let's
634	   say the speaker is at site A and everyone else is at a "remote" site.
635	   When the room at site A shown, all the camera images from site A are
636	   forwarded to the remote sites.  Therefore at each receiving remote
637	   site, all the screens display camera images from site A. This can be
638	   used to preserve full size image display, and also provide full
639	   visual context of the displayed far end, site A. In site switching,
640	   there is a fixed relation between the cameras in each room and the
641	   displays in remote rooms.  The room or participants being shown is
642	   switched from time to time based on who is speaking or by manual
643	   control.

645	   The "segment-switch" policy means different captures can switch at
646	   different times, and can be coming from different endpoints.  Still
647	   using site A as where the speaker is, and "remote" to refer to all
648	   the other sites, in segment switching, rather than sending all the
649	   images from site A, only the image containing the speaker at site A
650	   is shown.  The camera images of the current speaker and previous
651	   speakers (if any) are forwarded to the other sites in the conference.
652	   Therefore the screens in each site are usually displaying images from
653	   different remote sites - the current speaker at site A and the
654	   previous ones.  This strategy can be used to preserve full size image
655	   display, and also capture the non-verbal communication between the
656	   speakers.  In segment switching, the display depends on the activity
657	   in the remote rooms - generally, but not necessarily based on audio /
658	   speech detection.

660	6.3.  Simultaneous Transmission Set Constraints

662	   The provider may have constraints or limitations on its ability to
663	   send media captures.  One type is caused by the physical limitations
664	   of capture mechanisms; these constraints are represented by a
665	   simultaneous transmission set.  The second type of limitation
666	   reflects the encoding resources available - bandwidth and
667	   macroblocks/second.  This type of constraint is captured by encoding
668	   groups, discussed below.

670	   An endpoint or MCU can send multiple captures simultaneously, however
671	   sometimes there are constraints that limit which captures can be sent
672	   simultaneously with other captures.  A device may not be able to be
673	   used in different ways at the same time.  Provider advertisements are
674	   made so that the consumer will choose one of several possible
675	   mutually exclusive usages of the device.  This type of constraint is
676	   expressed in a Simultaneous Transmission Set, which lists all the
677	   media captures that can be sent at the same time.  This is easier to
678	   show in an example.

680	   Consider the example of a room system where there are 3 cameras each
681	   of which can send a separate capture covering 2 persons each- VC0,
682	   VC1, VC2.  The middle camera can also zoom out and show all 6
683	   persons, VC3.  But the middle camera cannot be used in both modes at
684	   the same time - it has to either show the space where 2 participants
685	   sit or the whole 6 seats, but not both at the same time.

687	   Simultaneous transmission sets are expressed as sets of the MCs that
688	   could physically be transmitted at the same time, (though it may not
689	   make sense to do so).  In this example the two simultaneous sets are
690	   shown in Table 1.  The consumer must make sure that it chooses one
691	   and not more of the mutually exclusive sets.  A consumer may choose
692	   any subset of the media captures in a simultaneous set, it does not
693	   have to choose all the captures in a simultaneous set if it does not
694	   want to receive all of them.

696	                           +-------------------+
697	                           | Simultaneous Sets |
698	                           +-------------------+
699	                           | {VC0, VC1, VC2}   |
700	                           | {VC0, VC3, VC2}   |
701	                           +-------------------+

703	                Table 1: Two Simultaneous Transmission Sets

705	   A media provider includes the simultaneous sets in its provider
706	   advertisement.  These simultaneous set constraints apply across all
707	   the captures scenes in the advertisement.  The simultaneous
708	   transmission sets MUST allow all the media captures in a particular
709	   capture scene entry to be used simultaneously.

711	7.  Encodings

713	   We have considered how providers can describe the content of media to
714	   consumers.  We will now consider how the providers communicate
715	   information about their abilities to send streams.  We introduce two
716	   constructs - individual encodings and encoding groups.  Consumers
717	   will then map the media captures they want onto the encodings with
718	   encoding parameters they want.  This process is then described.

720	7.1.  Individual Encodings

722	   An individual encoding represents a way to encode a media capture to
723	   become an encoded media stream sent from the media provider to the
724	   media consumer.  An individual encoding has a set of parameters
725	   characterizing how the media is encoded.  Different media types have
726	   different parameters, and different encoding algorithms may have
727	   different parameters.  An individual encoding can be used for only
728	   one actual encoded media stream at a time.

730	   The parameters of an individual encoding represent the maximimum
731	   values for certain aspects of the encoding.  A particular
732	   instantiation into an encoded stream might use lower values than
733	   these maximums.

735	   The following tables show the variables for audio and video encoding.

737	   +--------------+----------------------------------------------------+
738	   | Name         | Description                                        |
739	   +--------------+----------------------------------------------------+
740	   | encodeID     | A unique identifier for the individual encoding    |
741	   | maxBandwidth | Maximum number of bits per second                  |
742	   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  |
743	   |              | + 15) / 16) * ((height + 15) / 16) *               |
744	   |              | framesPerSecond                                    |
745	   | maxWidth     | Video resolution's maximum supported width,        |
746	   |              | expressed in pixels                                |
747	   | maxHeight    | Video resolution's maximum supported height,       |
748	   |              | expressed in pixels                                |
749	   | maxFrameRate | Maximum supported frame rate                       |
750	   +--------------+----------------------------------------------------+

752	               Table 2: Individual Video Encoding Parameters

754	           +--------------+-----------------------------------+
755	           | Name         | Description                       |
756	           +--------------+-----------------------------------+
757	           | maxBandwidth | Maximum number of bits per second |
758	           +--------------+-----------------------------------+

760	               Table 3: Individual Audio Encoding Parameters

762	7.2.  Encoding Group

764	   An encoding group includes a set of one or more individual encodings,
765	   plus some parameters that apply to the group as a whole.  By grouping
766	   multiple individual encodings together, an encoding group describes
767	   additional constraints on bandwidth and other parameters for the
768	   group.  Table 4 shows the parameters and individual encoding sets
769	   that are part of an encoding group.

771	   +-------------------+-----------------------------------------------+
772	   | Name              | Description                                   |
773	   +-------------------+-----------------------------------------------+
774	   | encodeGroupID     | A unique identifier for the encoding group    |
775	   | maxGroupBandwidth | Maximum number of bits per second relating to |
776	   |                   | all encodings combined                        |
777	   | maxGroupH264Mbps  | Maximum number of macroblocks per second      |
778	   |                   | relating to all video encodings combined      |
779	   | videoEncodings[]  | Set of potential encodings (list of           |
780	   |                   | encodeIDs)                                    |
781	   | audioEncodings[]  | Set of potential encodings (list of           |
782	   |                   | encodeIDs)                                    |
783	   +-------------------+-----------------------------------------------+

785	                          Table 4: Encoding Group

787	   When the individual encodings in a group are instantiated into actual
788	   encoded media streams, each stream has a bandwidth that must be less
789	   than or equal to the maxBandwidth for the particular individual
790	   encoding.  The maxGroupBandwidth parameter gives the additional
791	   restriction that the sum of all the individual instantiated
792	   bandwidths must be less than or equal to the maxGroupBandwidth value.

794	   Likewise, the sum of the macroblocks per second of each instantiated
795	   encoding in the group must not exceed the maxGroupH264Mbps value.

797	   The following diagram illustrates the structure of a media provider's
798	   Encoding Groups and their contents.

800	   ,-------------------------------------------------.
801	   |             Media Provider                      |
802	   |                                                 |
803	   |  ,--------------------------------------.       |
804	   |  | ,--------------------------------------.     |
805	   |  | | ,--------------------------------------.   |
806	   |  | | |          Encoding Group              |   |
807	   |  | | | ,-----------.                        |   |
808	   |  | | | |           | ,---------.            |   |
809	   |  | | | |           | |         | ,---------.|   |
810	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
811	   |  `.| | |           | |         | `---------'|   |
812	   |    `.| `-----------' `---------'            |   |
813	   |      `--------------------------------------'   |
814	   `-------------------------------------------------'

816	                    Figure 1: Encoding Group Structure

818	   A media provider advertises one or more encoding groups.  Each
819	   encoding group includes one or more individual encodings.  Each
820	   individual encoding can represent a different way of encoding media.
821	   For example one individual encoding may be 1080p60 video, another
822	   could be 720p30, with a third being CIF.

824	   While a typical 3 codec/display system might have one encoding group
825	   per "codec box", there are many possibilities for the number of
826	   encoding groups a provider may be able to offer and for the encoding
827	   values in each encoding group.

829	   There is no requirement for all encodings within an encoding group to
830	   be instantiated at once.

832	8.  Associating Media Captures with Encoding Groups

834	   Every media capture is associated with an encoding group, which is
835	   used to instantiate that media capture into one or more encoded
836	   streams.  Each media capture has an encoding group attribute.  The
837	   value of this attribute is the encodeGroupID for the encoding group
838	   with which it is associated.  More than one media capture may use the
839	   same encoding group.

841	   The maximum number of streams that can result from a particular
842	   encoding group constraint is equal to the number of individual
843	   encodings in the group.  The actual number of streams used at any
844	   time may be less than this maximum.  Any of the media captures that
845	   use a particular encoding group can be encoded according to any of
846	   the individual encodings in the group.  If there are multiple
847	   individual encodings in the group, then a single media capture can be
848	   encoded into multiple different streams at the same time, with each
849	   stream following the constraints of a different individual encoding.

851	   The Encoding Groups MUST allow all the media captures in a particular
852	   capture scene entry to be used simultaneously.

854	9.  Consumer's Choice of Streams to Receive from the Provider

856	   After receiving the provider's advertised media captures and
857	   associated constraints, the consumer must choose which media captures
858	   it wishes to receive, and which individual encodings from the
859	   provider it wants to use to encode the capture.  Each media capture
860	   has an encoding group ID attribute which specifies which individual
861	   encodings are available to be used for that media capture.

863	   For each media capture the consumer wants to receive, it configures
864	   one or more of the encodings in that capture's encoding group.  The
865	   consumer does this by telling the provider the resolution, frame
866	   rate, bandwidth, etc. when asking for streams for its chosen
867	   captures.  Upon receipt of this configuration command from the
868	   consumer, the provider generates streams for each such configured
869	   encoding and sends those streams to the consumer.

871	   The consumer must have received at least one capture advertisement
872	   from the provider to be able to configure the provider's generation
873	   of media streams.

875	   The consumer is able to change its configuration of the provider's
876	   encodings any number of times during the call, either in response to
877	   a new capture advertisement from the provider or autonomously.  The
878	   consumer need not send a new configure message to the provider when
879	   it receives a new capture advertisement from the provider unless the
880	   contents of the new capture advertisement cause the consumer's
881	   current configure message to become invalid.

883	   When choosing which streams to receive from the provider, and the
884	   encoding characteristics of those streams, the consumer needs to take
885	   several things into account its local preference, simultaneity
886	   restrictions, and encoding limits.

888	9.1.  Local preference

890	   A variety of local factors will influence the consumer's choice of
891	   streams to be received from the provider:

893	   o  if the consumer is an endpoint, it is likely that it would choose,
894	      where possible, to receive video and audio captures that match the
895	      number of display devices and audio system it has

897	   o  if the consumer is a middle box such as an MCU, it may choose to
898	      receive loudest speaker streams (in order to perform its own media
899	      composition) and avoid pre-composed video captures

901	   o  user choice (for instance, selection of a new layout) may result
902	      in a different set of media captures, or different encoding
903	      characteristics, being required by the consumer

905	9.2.  Physical simultaneity restrictions

907	   There may be physical simultaneity constraints imposed by the
908	   provider that affect the provider's ability to simultaneously send
909	   all of the captures the consumer would wish to receive.  For
910	   instance, a middle box such as an MCU, when connected to a multi-
911	   camera room system, might prefer to receive both individual camera
912	   streams of the people present in the room and an overall view of the
913	   room from a single camera.  Some endpoint systems might be able to
914	   provide both of these sets of streams simultaneously, whereas others
915	   may not (if the overall room view were produced by changing the zoom
916	   level on the center camera, for instance).

918	9.3.  Encoding and encoding group limits

920	   Each of the provider's encoding groups has limits on bandwidth and
921	   macroblocks per second, and the constituent potential encodings have
922	   limits on the bandwidth, macroblocks per second, video frame rate,
923	   and resolution that can be provided.  When choosing the media
924	   captures to be received from a provider, a consumer device must
925	   ensure that the encoding characteristics requested for each
926	   individual media capture fits within the capability of the encoding
927	   it is being configured to use, as well as ensuring that the combined
928	   encoding characteristics for media captures fit within the
929	   capabilities of their associated encoding groups.  In some cases,
930	   this could cause an otherwise "preferred" choice of streams to be
931	   passed over in favour of different streams - for instance, if a set
932	   of 3 media captures could only be provided at a low resolution then a
933	   3 screen device could switch to favoring a single, higher quality,
934	   stream.

936	9.4.  Message Flow

938	   The following diagram shows the basic flow of messages between a
939	   media provider and a media consumer.  The usage of the "capture
940	   advertisement" and "configure encodings" message is described above.

942	   The consumer also sends its own capability message to the provider
943	   which may contain information about its own capabilities or
944	   restrictions.

946	   Diagram for Message Flow

948	            Media Consumer                         Media Provider
949	            --------------                         ------------
950	                  |                                     |
951	                  |----- Consumer Capability ---------->|
952	                  |                                     |
953	                  |                                     |
954	                  |<---- Capture advertisement ---------|
955	                  |                                     |
956	                  |                                     |
957	                  |------ Configure encodings --------->|
958	                  |                                     |

960	   In order for a maximally-capable provider to be able to advertise a
961	   manageable number of video captures to a consumer, there is a
962	   potential use for the consumer, at the start of CLUE, to be able to
963	   inform the provider of its capabilities.  One example here would be
964	   the video capture attribute set - a consumer could tell the provider
965	   the complete set of video capture attributes it is able to understand
966	   and so the provider would be able to reduce the capture scene it
967	   advertises to be tailored to the consumer.

969	   TBD - the content of the consumer capability message needs to be
970	   better defined.  The authors believe there is a need for this
971	   message, but have not worked out the details yet.

973	10.  Extensibility

975	   One of the most important characteristics of the Framework is its
976	   extensibility.  Telepresence is a relatively new industry and while
977	   we can foresee certain directions, we also do not know everything
978	   about how it will develop.  The standard for interoperability and
979	   handling multiple streams must be future-proof.

981	   The framework itself is inherently extensible through expanding the
982	   data model types.  For example:

984	   o  Adding more types of media, such as telemetry, can done by
985	      defining additional types of captures in addition to audio and
986	      video.

988	   o  Adding new functionalities , such as 3-D, say, will require
989	      additional attributes describing the captures.

991	   o  Adding a new codecs, such as H.265, can be accomplished by
992	      defining new encoding variables.

994	   The infrastructure is designed to be extended rather than requiring
995	   new infrastructure elements.  Extension comes through adding to
996	   defined types.

998	   Assuming the implementation is in something like XML, adding data
999	   elements and attributes makes extensibility easy.

1001	11.  Examples - Using the Framework

1003	   This section shows some examples in more detail how to use the
1004	   framework to represent a typical case for telepresence rooms.  First
1005	   an endpoint is illustrated, then an MCU case is shown.

1007	11.1.  Three screen endpoint media provider

1009	   Consider an endpoint with the following description:

1011	   o  3 cameras, 3 displays, a 6 person table

1013	   o  Each video device can provide one capture for each 1/3 section of
1014	      the table

1016	   o  A single capture representing the active speaker can be provided

1018	   o  A single capture representing the active speaker with the other 2
1019	      captures shown picture in picture within the stream can be
1020	      provided

1022	   o  A capture showing a zoomed out view of all 6 seats in the room can
1023	      be provided

1025	   The audio and video captures for this endpoint can be described as
1026	   follows.

1028	   Video Captures:

1030	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1031	      content=main, switched=false

1033	   o  VC1- (the center camera stream), encoding group=EG1, content=main,
1034	      switched=false

1036	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1037	      content=main, switched=false

1039	   o  VC3- (the loudest panel stream), encoding group=EG1, content=main,
1040	      switched=true

1042	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1043	      content=main, composed=true, switched=true

1045	   o  VC5- (the zoomed out view of all people in the room), encoding
1046	      group=EG1, content=main, composed=false, switched=false

1048	   o  VC6- (presentation stream), encoding group=EG1, content=slides,
1049	      switched=false

1051	   The following diagram is a top view of the room with 3 cameras, 3
1052	   displays, and 6 seats.  Each camera is capturing 2 people.  The six
1053	   seats are not all in a straight line.

1055	      ,-. d
1056	     (   )`--.__        +---+
1057	      `-' /     `--.__  |   |
1058	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1059	   (   ).'        ___..-+-''`+-+
1060	    `-' |_...---''      |   |
1061	    ,-.c+-..__          +---+
1062	   (   )|     ``--..__  |   |
1063	    `-' |             ``+-..|_-+Camera 1 (VC1)
1064	    ,-. |            __..--'|+-+
1065	   (   )|     __..--'   |   |
1066	    `-'b|..--'          +---+
1067	    ,-. |``---..___     |   |
1068	   (   )\          ```--..._|_-+Camera 0 (VC0)
1069	    `-'  \             _..-''`-+
1070	     ,-. \      __.--'' |   |
1071	    (   ) |..-''        +---+
1072	     `-' a

1074	   The two points labeled b and c are intended to be at the midpoint
1075	   between the seating positions, and where the fields of view of the
1076	   cameras intersect.
1077	   The plane of interest for VC0 is a vertical plane that intersects
1078	   points 'a' and 'b'.
1079	   The plane of interest for VC1 intersects points 'b' and 'c'.
1080	   The plane of interest for VC2 intersects points 'c' and 'd'.
1081	   This example uses an area scale of millimeters.

1083	   Areas of capture:
1084	       bottom left    bottom right  top left         top right
1085	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1086	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1087	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1088	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1089	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1090	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1091	   VC6 none

1093	   Points of capture:
1094	   VC0 (-1678,0,800)
1095	   VC1 (0,0,800)
1096	   VC2 (1678,0,800)
1097	   VC3 none
1098	   VC4 none
1099	   VC5 (0,0,800)
1100	   VC6 none

1102	   In this example, the right edge of the VC0 area lines up with the
1103	   left edge of the VC1 area.  It doesn't have to be this way.  There
1104	   could be a gap or an overlap.  One additional thing to note for this
1105	   example is the distance from a to b is equal to the distance from b
1106	   to c and the distance from c to d.  All these distances are 1346 mm.
1107	   This is the planar width of each area of capture for VC0, VC1, and
1108	   VC2.

1110	   Note the text in parentheses (e.g. "the camera-left camera stream")
1111	   is not explicitly part of the model, it is just explanatory text for
1112	   this example, and is not included in the model with the media
1113	   captures and attributes.  Also, the "composed" boolean attribute
1114	   doesn't say anything about how a capture is composed, so the media
1115	   consumer can't tell based on this attribute that VC4 is composed of a
1116	   "loudest panel with PiPs".

1118	   Audio Captures:

1120	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1121	      format=mono

1123	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1124	      format=mono

1126	   o  AC2 (center) encoding group=EG3, content=main, channel format=mono

1128	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1129	      encoding group=EG3, content=main, channel format=mono

1131	   o  AC4 audio stream associated with the presentation video (mono)
1132	      encoding group=EG3, content=slides, channel format=mono

1134	   Areas of capture:
1135	       bottom left    bottom right  top left         top right
1136	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1137	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1138	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1139	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1140	   AC4 none

1142	   The physical simultaneity information is:

1144	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1146	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1148	   This constraint indicates it is not possible to use all the VCs at
1149	   the same time.  VC5 can not be used at the same time as VC1 or VC3 or
1150	   VC4.  Also, using every member in the set simultaneously may not make
1151	   sense - for example VC3(loudest) and VC4 (loudest with PIP).  (In
1152	   addition, there are encoding constraints that make choosing all of
1153	   the VCs in a set impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and
1154	   EG1 has only 3 ENCs.  This constraint shows up in the encoding
1155	   groups, not in the simultaneous transmission sets.)

1157	   In this example there are no restrictions on which audio captures can
1158	   be sent simultaneously.

1160	   Encoding Groups:

1162	   This example has three encoding groups associated with the video
1163	   captures.  Each group can have 3 encodings, but with each potential
1164	   encoding having a progressively lower specification.  In this
1165	   example, 1080p60 transmission is possible (as ENC0 has a maxMbps
1166	   value compatible with that) as long as it is the only active encoding
1167	   in the group(as maxMbps for the entire encoding group is also
1168	   489600).  Significantly, as up to 3 encodings are available per
1169	   group, it is possible to transmit some video captures simultaneously
1170	   that are not in the same entry in the capture scene.  For example VC1
1171	   and VC3 at the same time.

1173	   It is also possible to transmit multiple encodings of a single video
1174	   capture.  For example VC0 can be encoded using ENC0 and ENC1 at the
1175	   same time, as long as the encoding parameters satisfy the constraints
1176	   of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30.

1178	   encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1179	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1180	                      maxH264Mbps=489600, maxBandwidth=4000000
1181	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1182	                      maxH264Mbps=108000, maxBandwidth=4000000
1183	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1184	                      maxH264Mbps=61200, maxBandwidth=4000000

1186	   encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1187	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1188	                      maxH264Mbps=489600, maxBandwidth=4000000
1189	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1190	                      maxH264Mbps=108000, maxBandwidth=4000000
1191	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1192	                      maxH264Mbps=61200, maxBandwidth=4000000

1194	   encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1195	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1196	                      maxH264Mbps=489600, maxBandwidth=4000000
1197	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1198	                      maxH264Mbps=108000, maxBandwidth=4000000
1199	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1200	                      maxH264Mbps=61200, maxBandwidth=4000000

1202	                Figure 2: Example Encoding Groups for Video

1204	   For audio, there are five potential encodings available, so all five
1205	   audio captures can be encoded at the same time.

1207	   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
1208	       encodeID=ENC9, maxBandwidth=64000
1209	       encodeID=ENC10, maxBandwidth=64000
1210	       encodeID=ENC11, maxBandwidth=64000
1211	       encodeID=ENC12, maxBandwidth=64000
1212	       encodeID=ENC13, maxBandwidth=64000

1214	                Figure 3: Example Encoding Group for Audio

1216	   Capture Scenes:

1218	   The following table represents the capture scenes for this provider.
1219	   Recall that a capture scene is composed of alternative capture scene
1220	   entries covering the same scene.  Capture Scene #1 is for the main
1221	   people captures, and Capture Scene #2 is for presentation.

1223	      Each row in the table is a separate entry in the capture scene

1225	                           +------------------+
1226	                           | Capture Scene #1 |
1227	                           +------------------+
1228	                           | VC0, VC1, VC2    |
1229	                           | VC3              |
1230	                           | VC4              |
1231	                           | VC5              |
1232	                           | AC0, AC1, AC2    |
1233	                           | AC3              |
1234	                           +------------------+

1236	                           +------------------+
1237	                           | Capture Scene #2 |
1238	                           +------------------+
1239	                           | VC6              |
1240	                           | AC4              |
1241	                           +------------------+

1243	   Different capture scenes are unique to each other, non-overlapping.
1244	   A consumer can choose an entry from each capture scene.  In this case
1245	   the three captures VC0, VC1, and VC2 are one way of representing the
1246	   video from the endpoint.  These three captures should appear adjacent
1247	   next to each other.  Alternatively, another way of representing the
1248	   Capture Scene is with the capture VC3, which automatically shows the
1249	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1251	   As in the video case, the different entries of audio in Capture Scene
1252	   #1 represent the "same thing", in that one way to receive the audio
1253	   is with the 3 audio captures (AC0, AC1, AC2), and another way is with
1254	   the mixed AC3.  The Media Consumer can choose an audio capture entry
1255	   it is capable of receiving.

1257	   The spatial ordering is understood by the media capture attributes
1258	   area and point of capture.

1260	   A Media Consumer would likely want to choose a capture scene entry to
1261	   receive based in part on how many streams it can simultaneously
1262	   receive.  A consumer that can receive three people streams would
1263	   probably prefer to receive the first entry of Capture Scene #1 (VC0,
1264	   VC1, VC2) and not receive the other entries.  A consumer that can
1265	   receive only one people stream would probably choose one of the other
1266	   entries.

1268	   If the consumer can receive a presentation stream too, it would also
1269	   choose to receive the only entry from Capture Scene #2 (VC6).

1271	11.2.  Encoding Group Example

1273	   This is an example of an encoding group to illustrate how it can
1274	   express dependencies between encodings.

1276	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1277	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1278	                         maxH264Mbps=244800, maxBandwidth=4000000
1279	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1280	                         maxH264Mbps=244800, maxBandwidth=4000000
1281	       encodeID=AUDENC0, maxBandwidth=96000
1282	       encodeID=AUDENC1, maxBandwidth=96000
1283	       encodeID=AUDENC2, maxBandwidth=96000

1285	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
1286	   encodings (Mbps for 1080p = 244800), but it is capable of
1287	   transmitting a maxFrameRate of 60 frames per second (fps).  To
1288	   achieve the maximum resolution (1920 x 1088) the frame rate is
1289	   limited to 30 fps.  However 60 fps can be achieved at a lower
1290	   resolution if required by the consumer.  Although the encoding group
1291	   is capable of transmitting up to 6Mbit/s, no individual video
1292	   encoding can exceed 4Mbit/s.

1294	   This encoding group also allows up to 3 audio encodings, AUDENC<0-2>.
1295	   It is not required that audio and video encodings reside within the
1296	   same encoding group, but if so then the group's overall maxBandwidth
1297	   value is a limit on the sum of all audio and video encodings
1298	   configured by the consumer.  A system that does not wish or need to
1299	   combine bandwidth limitations in this way should instead use separate
1300	   encoding groups for audio and video in order for the bandwidth
1301	   limitations on audio and video to not interact.

1303	   Audio and video can be expressed in separate encoding groups, as in
1304	   this illustration.

1306	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1307	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1308	                         maxH264Mbps=244800, maxBandwidth=4000000
1309	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1310	                         maxH264Mbps=244800, maxBandwidth=4000000

1312	  encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
1313	       encodeID=AUDENC0, maxBandwidth=96000
1314	       encodeID=AUDENC1, maxBandwidth=96000
1315	       encodeID=AUDENC2, maxBandwidth=96000

1317	11.3.  The MCU Case

1319	   This section shows how an MCU might express its Capture Scenes,
1320	   intending to offer different choices for consumers that can handle
1321	   different numbers of streams.  A single audio capture stream is
1322	   provided for all single and multi-screen configurations that can be
1323	   associated (e.g. lip-synced) with any combination of video captures
1324	   at the consumer.

1326	   +--------------------+---------------------------------------------+
1327	   | Capture Scene #1   | note                                        |
1328	   +--------------------+---------------------------------------------+
1329	   | VC0                | video capture for single screen consumer    |
1330	   | VC1, VC2           | video capture for 2 screen consumer         |
1331	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1332	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1333	   | AC0                | audio capture representing all participants |
1334	   +--------------------+---------------------------------------------+

1336	   If / when a presentation stream becomes active within the conference,
1337	   the MCU might re-advertise the available media as:

1339	        +------------------+--------------------------------------+
1340	        | Capture Scene #2 | note                                 |
1341	        +------------------+--------------------------------------+
1342	        | VC10             | video capture for presentation       |
1343	        | AC1              | presentation audio to accompany VC10 |
1344	        +------------------+--------------------------------------+

1346	11.4.  Media Consumer Behavior

1348	   This section gives an example of how a media consumer might behave
1349	   when deciding how to request streams from the three screen endpoint
1350	   described in the previous section.

1352	   The receive side of a call needs to balance its requirements, based
1353	   on number of screens and speakers, its decoding capabilities and
1354	   available bandwidth, and the provider's capabilities in order to
1355	   optimally configure the provider's streams.  Typically it would want
1356	   to receive and decode media from each capture scene advertised by the
1357	   provider.

1359	   A sane, basic, algorithm might be for the consumer to go through each
1360	   capture scene in turn and find the collection of video captures that
1361	   best matches the number of screens it has (this might include
1362	   consideration of screens dedicated to presentation video display
1363	   rather than "people" video) and then decide between alternative
1364	   entries in the video capture scenes based either on hard-coded
1365	   preferences or user choice.  Once this choice has been made, the
1366	   consumer would then decide how to configure the provider's encoding
1367	   groups in order to make best use of the available network bandwidth
1368	   and its own decoding capabilities.

1370	11.4.1.  One screen consumer

1372	   VC3, VC4 and VC5 are all different entries by themselves, not grouped
1373	   together in a single entry, so the receiving device should choose
1374	   between one of those.  The choice would come down to whether to see
1375	   the greatest number of participants simultaneously at roughly equal
1376	   precedence (VC5), a switched view of just the loudest region (VC3) or
1377	   a switched view with PiPs (VC4).  An endpoint device with a small
1378	   amount of knowledge of these differences could offer a dynamic choice
1379	   of these options, in-call, to the user.

1381	11.4.2.  Two screen consumer configuring the example

1383	   Mixing systems with an even number of screens, "2n", and those with
1384	   "2n+1" cameras (and vice versa) is always likely to be the
1385	   problematic case.  In this instance, the behavior is likely to be
1386	   determined by whether a "2 screen" system is really a "2 decoder"
1387	   system, i.e., whether only one received stream can be displayed per
1388	   screen or whether more than 2 streams can be received and spread
1389	   across the available screen area.  To enumerate 3 possible behaviors
1390	   here for the 2 screen system when it learns that the far end is
1391	   "ideally" expressed via 3 capture streams:

1393	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1394	       per the 1 screen consumer case above) and either leave one screen
1395	       blank or use it for presentation if / when a presentation becomes
1396	       active

1398	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1399	       (either with each capture being scaled to 2/3 of a screen and the
1400	       centre capture being split across 2 screens) or, as would be
1401	       necessary if there were large bezels on the screens, with each
1402	       stream being scaled to 1/2 the screen width and height and there
1403	       being a 4th "blank" panel.  This 4th panel could potentially be
1404	       used for any presentation that became active during the call.

1406	   3.  Receive 3 streams, decode all 3, and use control information
1407	       indicating which was the most active to switch between showing
1408	       the left and centre streams (one per screen) and the centre and
1409	       right streams.

1411	   For an endpoint capable of all 3 methods of working described above,
1412	   again it might be appropriate to offer the user the choice of display
1413	   mode.

1415	11.4.3.  Three screen consumer configuring the example

1417	   This is the most straightforward case - the consumer would look to
1418	   identify a set of streams to receive that best matched its available
1419	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1420	   spatial ordering would give sufficient information for the correct
1421	   video capture to be shown on the correct screen, and the consumer
1422	   would either need to divide a single encoding group's capability by 3
1423	   to determine what resolution and frame rate to configure the provider
1424	   with or to configure the individual video captures' encoding groups
1425	   with what makes most sense (taking into account the receive side
1426	   decode capabilities, overall call bandwidth, the resolution of the
1427	   screens plus any user preferences such as motion vs sharpness).

1429	12.  Acknowledgements

1431	   Mark Gorzyinski contributed much to the approach.  We want to thank
1432	   Stephen Botzko for helpful discussions on audio.

1434	13.  IANA Considerations

1436	   TBD

1438	14.  Security Considerations

1440	   TBD

1442	15.  Changes Since Last Version

1444	   NOTE TO THE RFC-Editor: Please remove this section prior to
1445	   publication as an RFC.

1447	   Changes from 04 to 05:

1449	   1.  Clarify limitations of "composed" attribute.

1451	   2.  Add new section "capture scene entry attributes" and add the
1452	       attribute "scene-switch-policy".

1454	   3.  Add capture scene description attribute and description language
1455	       attribute.

1457	   4.  Editorial changes to examples section for consistency with the
1458	       rest of the document.

1460	   Changes from 03 to 04:

1462	   1.   Remove sentence from overview - "This constitutes a significant
1463	        change ..."

1465	   2.   Clarify a consumer can choose a subset of captures from a
1466	        capture scene entry or a simultaneous set (in section "capture
1467	        scene" and "consumer's choice...").

1469	   3.   Reword first paragraph of Media Capture Attributes section.

1471	   4.   Clarify a stereo audio capture is different from two mono audio
1472	        captures (description of audio channel format attribute).

1474	   5.   Clarify what it means when coordinate information is not
1475	        specified for area of capture, point of capture, area of scene.

1477	   6.   Change the term "producer" to "provider" to be consistent (it
1478	        was just in two places).

1480	   7.   Change name of "purpose" attribute to "content" and refer to
1481	        RFC4796 for values.

1483	   8.   Clarify simultaneous sets are part of a provider advertisement,
1484	        and apply across all capture scenes in the advertisement.

1486	   9.   Remove sentence about lip-sync between all media captures in a
1487	        capture scene.

1489	   10.  Combine the concepts of "capture scene" and "capture set" into a
1490	        single concept, using the term "capture scene" to replace the
1491	        previous term "capture set", and eliminating the original
1492	        separate capture scene concept.

1494	16.  Informative References

1496	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1497	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1499	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1500	              A., Peterson, J., Sparks, R., Handley, M., and E.
1501	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1502	              June 2002.

1504	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1505	              Jacobson, "RTP: A Transport Protocol for Real-Time
1506	              Applications", STD 64, RFC 3550, July 2003.

1508	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1509	              Session Initiation Protocol (SIP)", RFC 4353,
1510	              February 2006.

1512	   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
1513	              Protocol (SDP) Content Attribute", RFC 4796,
1514	              February 2007.

1516	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1517	              January 2008.

1519	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
1520	              Languages", BCP 47, RFC 5646, September 2009.

1522	   [IANA-Lan]
1523	              IANA, "Language Subtag Registry",
1524	              <http://www.iana.org/assignments/
1525	              language-subtag-registry>.

1527	Appendix A.  Open Issues

1529	A.1.  Video layout arrangements and centralized composition

1531	   In the context of a conference with a central MCU, there has been
1532	   discussion about a consumer requesting the provider to provide a
1533	   certain type of layout arrangement or perform a certain composition
1534	   algorithm, such as combining some number of most recent talkers, or
1535	   producing a video layout using a 2x2 grid or 1 large cell with 5
1536	   smaller cells around it.  The current framework does not address
1537	   this.  It isn't clear if this topic should be included in this
1538	   framework, or maybe a different part of CLUE, or maybe outside of
1539	   CLUE altogether.

1541	A.2.  Source is selectable

1543	   A Boolean variable.  True indicates the media consumer can request a
1544	   particular media source be mapped to a media capture.  Default is
1545	   false.

1547	   TBD - how does the consumer make the request for a particular source?
1548	   How does the consumer know what is available?  Need to explain better
1549	   how multiple media captures are different from a single media capture
1550	   with choices for the source, and when each concept should be used.

1552	A.3.  Media Source Selection

1554	   The use cases include a case where the person at a receiving endpoint
1555	   can request to receive media from a particular other endpoint, for
1556	   example in a multipoint call to request to receive the video from a
1557	   certain section of a certain room, whether or not people there are
1558	   talking.

1560	   TBD - this framework should address this case.  Maybe need a roster
1561	   list of rooms or people in the conference, with a mechanism to select
1562	   from the roster and associate it with media captures.  This is
1563	   different from selecting a particular media capture from a capture
1564	   scene.  The mechanism to do this will probably need to be different
1565	   than selecting media captures based on capture scenes and attributes.

1567	A.4.  Endpoint requesting many streams from MCU

1569	   TBD - how to do VC selection for a system where the endpoint media
1570	   consumers want to receive lots of streams and do their own
1571	   composition, rather than MCU doing transcoding and composing.
1572	   Example is 3 screen consumer that wants 3 large loudest speaker
1573	   streams, and a bunch of small ones to render as PiP.  How the small
1574	   ones are chosen, which could potentially be chosen by either the
1575	   endpoint or MCU.  There are other more complicated examples also.  Is
1576	   the current framework adequate to support this?

1578	A.5.  VAD (voice activity detection) tagging of audio streams

1580	   TBD - do we want to have VAD be mandatory?  All audio streams
1581	   originating from a media provider must be tagged with VAD
1582	   information.  This tagging would include an overall energy value for
1583	   the stream plus information on which sections of the capture scene
1584	   are "active".

1586	   Each audio stream which forms a constituent of an entry within a
1587	   capture scene should include this tagging, and the energy value
1588	   within it calculated using a fixed, consistent algorithm.

1590	   When a system determines the most active area of a capture scene
1591	   (either "loudest", or determined by other means such as a button
1592	   press) it should convey that information to the corresponding media
1593	   stream consumer via any audio streams being sent within that capture
1594	   scene.  Specifically, there should be a list of active coordinates
1595	   and their VAD characteristics within the audio stream in addition to
1596	   the overall VAD information for the capture scene.  This is to ensure
1597	   all media stream consumers receive the same, consistent, audio energy
1598	   information whichever audio capture or captures they choose to
1599	   receive for a capture scene.  Additionally, coordinate information
1600	   can be mapped to video captures by a media stream consumer in order
1601	   that it can perform "panel switching" if required.

1603	A.6.  Private Information

1605	   Do we want a way to include private information?

1607	Authors' Addresses

1609	   Allyn Romanow
1610	   Cisco Systems
1611	   San Jose, CA  95134
1612	   USA

1614	   Email: allyn@cisco.com

1616	   Mark Duckworth (editor)
1617	   Polycom
1618	   Andover, MA  01810
1619	   US

1621	   Email: mark.duckworth@polycom.com

1623	   Andrew Pepperell
1624	   Langley, England
1625	   UK

1627	   Email: apeppere@gmail.com

1629	   Brian Baldino
1630	   Cisco Systems
1631	   San Jose, CA  95134
1632	   US

1634	   Email: bbaldino@cisco.com