idnits 2.17.1 

draft-ietf-clue-framework-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1112 has weird spacing: '...om left    bot...'

  == Line 1163 has weird spacing: '...om left    bot...'

  -- The document date (October 22, 2012) is 4205 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                         M. Duckworth, Ed.
5	Expires: April 25, 2013                                          Polycom
6	                                                            A. Pepperell
7	                                                             Silverflare
8	                                                              B. Baldino
9	                                                           Cisco Systems
10	                                                        October 22, 2012

12	                Framework for Telepresence Multi-Streams
13	                    draft-ietf-clue-framework-07.txt

15	Abstract

17	   This memo offers a framework for a protocol that enables devices in a
18	   telepresence conference to interoperate by specifying the
19	   relationships between multiple media streams.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on April 25, 2013.

38	Copyright Notice

40	   Copyright (c) 2012 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
56	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
58	   4.  Overview of the Framework/Model  . . . . . . . . . . . . . . .  6
59	   5.  Spatial Relationships  . . . . . . . . . . . . . . . . . . . .  7
60	   6.  Media Captures and Capture Scenes  . . . . . . . . . . . . . .  8
61	     6.1.  Media Captures . . . . . . . . . . . . . . . . . . . . . .  9
62	       6.1.1.  Media Capture Attributes . . . . . . . . . . . . . . .  9
63	     6.2.  Capture Scene  . . . . . . . . . . . . . . . . . . . . . . 12
64	       6.2.1.  Capture scene attributes . . . . . . . . . . . . . . . 13
65	       6.2.2.  Capture scene entry attributes . . . . . . . . . . . . 14
66	     6.3.  Simultaneous Transmission Set Constraints  . . . . . . . . 15
67	   7.  Encodings  . . . . . . . . . . . . . . . . . . . . . . . . . . 16
68	     7.1.  Individual Encodings . . . . . . . . . . . . . . . . . . . 16
69	     7.2.  Encoding Group . . . . . . . . . . . . . . . . . . . . . . 17
70	   8.  Associating Media Captures with Encoding Groups  . . . . . . . 19
71	   9.  Consumer's Choice of Streams to Receive from the Provider  . . 19
72	     9.1.  Local preference . . . . . . . . . . . . . . . . . . . . . 20
73	     9.2.  Physical simultaneity restrictions . . . . . . . . . . . . 20
74	     9.3.  Encoding and encoding group limits . . . . . . . . . . . . 21
75	     9.4.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21
76	   10. Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 22
77	   11. Examples - Using the Framework . . . . . . . . . . . . . . . . 22
78	     11.1. Three screen endpoint media provider . . . . . . . . . . . 23
79	     11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29
80	     11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30
81	     11.4. Media Consumer Behavior  . . . . . . . . . . . . . . . . . 30
82	       11.4.1. One screen consumer  . . . . . . . . . . . . . . . . . 31
83	       11.4.2. Two screen consumer configuring the example  . . . . . 31
84	       11.4.3. Three screen consumer configuring the example  . . . . 32
85	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
86	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 32
87	   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 32
88	   15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32
89	   16. Informative References . . . . . . . . . . . . . . . . . . . . 34
90	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35

92	1.  Introduction

94	   Current telepresence systems, though based on open standards such as
95	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
96	   other.  A major factor limiting the interoperability of telepresence
97	   systems is the lack of a standardized way to describe and negotiate
98	   the use of the multiple streams of audio and video comprising the
99	   media flows.  This draft provides a framework for a protocol to
100	   enable interoperability by handling multiple streams in a
101	   standardized way.  It is intended to support the use cases described
102	   in draft-ietf-clue-telepresence-use-cases-02 and to meet the
103	   requirements in draft-ietf-clue-telepresence-requirements-01.

105	   The solution described here is strongly focused on what is being done
106	   today, rather than on a vision of future conferencing.  At the same
107	   time, the highest priority has been given to creating an extensible
108	   framework to make it easy to accommodate future conferencing
109	   functionality as it evolves.

111	   The purpose of this effort is to make it possible to handle multiple
112	   streams of media in such a way that a satisfactory user experience is
113	   possible even when participants are using different vendor equipment,
114	   and also when they are using devices with different types of
115	   communication capabilities.  Information about the relationship of
116	   media streams at the provider's end must be communicated so that
117	   streams can be chosen and audio/video rendering can be done in the
118	   best possible manner.

120	   There is no attempt here to dictate to the renderer what it should
121	   do.  What the renderer does is up to the renderer.

123	   After the following Definitions, a short section introduces key
124	   concepts.  The body of the text comprises several sections about the
125	   key elements of the framework, how a consumer chooses streams to
126	   receive, and some examples.  The appendix describe topics that are
127	   under discussion for adding to the document.

129	2.  Terminology

131	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
132	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
133	   document are to be interpreted as described in RFC 2119 [RFC2119].

135	3.  Definitions

137	   The definitions marked with an "*" are new; all the others are from
138	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

140	   Camera-Left and Right: For media captures, camera-left and camera-
141	   right are from the point of view of a person observing the rendered
142	   media.  They are the opposite of stage-left and stage-right.

144	   Capture Device: A device that converts audio and video input into an
145	   electrical signal, in most cases to be fed into a media encoder.
146	   Cameras and microphones are examples for capture devices.

148	   *Capture Encoding: A specific encoding of a media capture, to be sent
149	   by a media provider to a media consumer via RTP.

151	   *Capture Scene: a structure representing the scene that is captured
152	   by a collection of capture devices.  A capture scene includes
153	   attributes and one or more capture scene entries, with each entry
154	   including one or more media captures.

156	   *Capture Scene Entry: a list of media captures of the same media type
157	   that together form one way to represent the capture scene.

159	   Conference: used as defined in [RFC4353], A Framework for
160	   Conferencing within the Session Initiation Protocol (SIP).

162	   *Individual Encoding: A variable with a set of attributes that
163	   describes the maximum values of a single audio or video capture
164	   encoding.  The attributes include: maximum bandwidth- and for video
165	   maximum macroblocks (for H.264), maximum width, maximum height,
166	   maximum frame rate.

168	   *Encoding Group: A set of encoding parameters representing a media
169	   provider's encoding capabilities.  Media stream providers formed of
170	   multiple physical units, in each of which resides some encoding
171	   capability, would typically advertise themselves to the remote media
172	   stream consumer using multiple encoding groups.  Within each encoding
173	   group, multiple potential encodings are possible, with the sum of the
174	   chosen encodings' characteristics constrained to being less than or
175	   equal to the group-wide constraints.

177	   Endpoint: The logical point of final termination through receiving,
178	   decoding and rendering, and/or initiation through capturing,
179	   encoding, and sending of media streams.  An endpoint consists of one
180	   or more physical devices which source and sink media streams, and
181	   exactly one [RFC4353] Participant (which, in turn, includes exactly
182	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
183	   send and receive media streams, but it is not the initiator nor the
184	   final terminator in the sense that Media is Captured or Rendered.
185	   Endpoints can be anything from multiscreen/multicamera rooms to
186	   handheld devices.

188	   Front: the portion of the room closest to the cameras.  In going
189	   towards back you move away from the cameras.

191	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
192	   more endpoints together into one single multimedia conference
193	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
194	   tardy in requiring that media from the mixer be sent to EACH
195	   participant.  I think we have practical use cases where this is not
196	   the case.  But the bug (if it is one) is in 4353 and not herein.]

198	   Media: Any data that, after suitable encoding, can be conveyed over
199	   RTP, including audio, video or timed text.

201	   *Media Capture: a source of Media, such as from one or more Capture
202	   Devices.  A Media Capture (MC) may be the source of one or more
203	   capture encodings.  A Media Capture may also be constructed from
204	   other Media streams.  A middle box can express Media Captures that it
205	   constructs from Media streams it receives.

207	   *Media Consumer: an Endpoint or middle box that receives media
208	   streams

210	   *Media Provider: an Endpoint or middle box that sends Media streams

212	   Model: a set of assumptions a telepresence system of a given vendor
213	   adheres to and expects the remote telepresence system(s) also to
214	   adhere to.

216	   *Plane of Interest: The spatial plane containing the most relevant
217	   subject matter.

219	   Render: the process of generating a representation from a media, such
220	   as displayed motion video or sound emitted from loudspeakers.

222	   *Simultaneous Transmission Set: a set of media captures that can be
223	   transmitted simultaneously from a Media Provider.

225	   Spatial Relation: The arrangement in space of two objects, in
226	   contrast to relation in time or other relationships.  See also
227	   Camera-Left and Right.

229	   Stage-Left and Right: For media captures, stage-left and stage-right
230	   are the opposite of camera-left and camera-right.  For the case of a
231	   person facing (and captured by) a camera, stage-left and stage-right
232	   are from the point of view of that person.

234	   *Stream: a capture encoding sent from a media provider to a media
235	   consumer via RTP [RFC3550].

237	   Stream Characteristics: the media stream attributes commonly used in
238	   non-CLUE SIP/SDP environments (such as: media codec, bit rate,
239	   resolution, profile/level etc.) as well as CLUE specific attributes,
240	   such as the ID of a capture or a spatial location.

242	   Telepresence: an environment that gives non co-located users or user
243	   groups a feeling of (co-located) presence - the feeling that a Local
244	   user is in the same room with other Local users and the Remote
245	   parties.  The inclusion of Remote parties is achieved through
246	   multimedia communication including at least audio and video signals
247	   of high fidelity.

249	   *Video Capture: Media Capture for video.  Denoted as VCn.

251	   Video composite: A single image that is formed from combining visual
252	   elements from separate sources.

254	4.  Overview of the Framework/Model

256	   The CLUE framework specifies how multiple media streams are to be
257	   handled in a telepresence conference.

259	   The main goals include:

261	   o  Interoperability

263	   o  Extensibility

265	   o  Flexibility

267	   Interoperability is achieved by the media provider describing the
268	   relationships between media streams in constructs that are understood
269	   by the consumer, who can then render the media.  Extensibility is
270	   achieved through abstractions and the generality of the model, making
271	   it easy to add new parameters.  Flexibility is achieved largely by
272	   having the consumer choose what content and format it wants to
273	   receive from what the provider is capable of sending.

275	   A transmitting endpoint or MCU describes specific aspects of the
276	   content of the media and the formatting of the media streams it can
277	   send (advertisement); and the receiving end responds to the provider
278	   by specifying which content and media streams it wants to receive
279	   (configuration).  The provider then transmits the asked for content
280	   in the specified streams.

282	   This advertisement and configuration occurs at call initiation but
283	   may also happen at any time throughout the conference, whenever there
284	   is a change in what the consumer wants or the provider can send.

286	   An endpoint or MCU typically acts as both provider and consumer at
287	   the same time, sending advertisements and sending configurations in
288	   response to receiving advertisements.  (It is possible to be just one
289	   or the other.)

291	   The data model is based around two main concepts: a capture and an
292	   encoding.  A media capture (MC), such as audio or video, describes
293	   the content a provider can send.  Media captures are described in
294	   terms of CLUE-defined attributes, such as spatial relationships and
295	   purpose of the capture.  Providers tell consumers which media
296	   captures they can provide, described in terms of the media capture
297	   attributes.

299	   A provider organizes its media captures that represent the same scene
300	   into capture scenes.  A consumer chooses which media captures it
301	   wants to receive according to the capture scenes sent by the
302	   provider.

304	   In addition, the provider sends the consumer a description of the
305	   individual encodings it can send in terms of the media attributes of
306	   the encodings, in particular, well-known audio and video parameters
307	   such as bandwidth, frame rate, macroblocks per second.

309	   The provider also specifies constraints on its ability to provide
310	   media, and the consumer must take these into account in choosing the
311	   content and capture encodings it wants.  Some constraints are due to
312	   the physical limitations of devices - for example, a camera may not
313	   be able to provide zoom and non-zoom views simultaneously.  Other
314	   constraints are system based constraints, such as maximum bandwidth
315	   and maximum macroblocks/second.

317	   The following sections discuss these constructs and processes in
318	   detail, followed by use cases showing how the framework specification
319	   can be used.

321	5.  Spatial Relationships

323	   In order for a consumer to perform a proper rendering, it is often
324	   necessary to provide spatial information about the streams it is
325	   receiving.  CLUE defines a coordinate system that allows media
326	   providers to describe the spatial relationships of their media
327	   captures to enable proper scaling and spatial rendering of their
328	   streams.  The coordinate system is based on a few principles:

330	   o  Simple systems which do not have multiple Media Captures to
331	      associate spatially need not use the coordinate model.

333	   o  Coordinates can either be in real, physical units (millimeters),
334	      have an unknown scale or have no physical scale.  Systems which
335	      know their physical dimensions should always provide those real-
336	      world measurements.  Systems which don't know specific physical
337	      dimensions but still know relative distances should use 'unknown
338	      scale'.  'No scale' is intended to be used where Media Captures
339	      from different devices (with potentially different scales) will be
340	      forwarded alongside one another (e.g. in the case of a middle
341	      box).

343	      *  "millimeters" means the scale is in millimeters

345	      *  "Unknown" means the scale is not necessarily millimeters, but
346	         the scale is the same for every capture in the capture scene.

348	      *  "No Scale" means the scale could be different for each capture
349	         - an MCU provider that advertises two adjacent captures and
350	         picks sources (which can change quickly) from different
351	         endpoints might use this value; the scale could be different
352	         and changing for each capture.  But the areas of capture still
353	         represent a spatial relation between captures.

355	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
356	      spot of the provider's choosing.  The provider must use the same
357	      coordinate system with same scale and origin for all coordinates
358	      within the same capture scene.

360	   The direction of increasing coordinate values is:
361	   X increases from camera left to camera right
362	   Y increases from front to back
363	   Z increases from low to high

365	6.  Media Captures and Capture Scenes

367	   This section describes how media providers can describe the content
368	   of media to consumers.

370	6.1.  Media Captures

372	   Media captures are the fundamental representations of streams that a
373	   device can transmit.  What a Media Capture actually represents is
374	   flexible:

376	   o  It can represent the immediate output of a physical source (e.g.
377	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
378	      DVD player).

380	   o  It can represent the output of an audio mixer or video composer

382	   o  It can represent a concept such as 'the loudest speaker'

384	   o  It can represent a conceptual position such as 'the leftmost
385	      stream'

387	   To distinguish between multiple instances, video and audio captures
388	   are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 refer to
389	   two different video captures and AC1 and AC2 refer to two different
390	   audio captures.

392	   Each Media Capture can be associated with attributes to describe what
393	   it represents.

395	6.1.1.  Media Capture Attributes

397	   Media Capture Attributes describe static information about the
398	   captures.  A provider uses the media capture attributes to describe
399	   the media captures to the consumer.  The consumer will select the
400	   captures it wants to receive.  Attributes are defined by a variable
401	   and its value.  The currently defined attributes and their values
402	   are:

404	   Content: {slides, speaker, sl, main, alt}

406	   A field with enumerated values which describes the role of the media
407	   capture and can be applied to any media type.  The enumerated values
408	   are defined by [RFC4796].  The values for this attribute are the same
409	   as the mediacnt values for the content attribute in [RFC4796].  This
410	   attribute can have multiple values, for example content={main,
411	   speaker}.

413	   Composed: {true, false}

415	   A field with a Boolean value which indicates whether or not the Media
416	   Capture is a mix (audio) or composition (video) of streams.

418	   This attribute is useful for a media consumer to avoid nesting a
419	   composed video capture into another composed capture or rendering.
420	   This attribute is not intended to describe the layout a media
421	   provider uses when composing video streams.

423	   Audio Channel Format: {mono, stereo} A field with enumerated values
424	   which describes the method of encoding used for audio.

426	   A value of 'mono' means the Audio Capture has one channel.

428	   A value of 'stereo' means the Audio Capture has two audio channels,
429	   left and right.

431	   This attribute applies only to Audio Captures.  A single stereo
432	   capture is different from two mono captures that have a left-right
433	   spatial relationship.  A stereo capture maps to a single RTP stream,
434	   while each mono audio capture maps to a separate RTP stream.

436	   Switched: {true, false}

438	   A field with a Boolean value which indicates whether or not the Media
439	   Capture represents the (dynamic) most appropriate subset of a
440	   'whole'.  What is 'most appropriate' is up to the provider and could
441	   be the active speaker, a lecturer or a VIP.

443	   Point of Capture: {(X, Y, Z)}

445	   A field with a single Cartesian (X, Y, Z) point value which describes
446	   the spatial location, virtual or physical, of the capturing device
447	   (such as camera).

449	   When the Point of Capture attribute is specified, it must include X,
450	   Y and Z coordinates.  If the point of capture is not specified, it
451	   means the consumer should not assume anything about the spatial
452	   location of the capturing device.  Even if the provider specifies an
453	   area of capture attribute, it does not need to specify the point of
454	   capture.

456	   Point on Line of Capture: {(X,Y,Z)}

458	   A field with a single Cartesian (X, Y, Z) point value (virtual or
459	   physical) which describes a position in space of a second point on
460	   the axis of the capturing device; the first point being the Point of
461	   Capture (see above).  This point MUST lie between the Point of
462	   Capture and the Area of Capture.

464	   The Point on Line of Capture MUST be ignored if the Point of Capture
465	   is not present for this capture device.  When the Point on Line of
466	   Capture attribute is specified, it must include X, Y and Z
467	   coordinates.  These coordinates MUST NOT be identical to the Point of
468	   Capture coordinates.  If the Point on Line of Capture is not
469	   specified, no assumptions are made about the axis of the capturing
470	   device.

472	   Area of Capture:

474	   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3,
475	   Z3), top right(X4, Y4, Z4)}

477	   A field with a set of four (X, Y, Z) points as a value which describe
478	   the spatial location of what is being "captured".  By comparing the
479	   Area of Capture for different Media Captures within the same capture
480	   scene a consumer can determine the spatial relationships between them
481	   and render them correctly.

483	   The four points should be co-planar.  The four points form a
484	   quadrilateral, not necessarily a rectangle.

486	   The quadrilateral described by the four (X, Y, Z) points defines the
487	   plane of interest for the particular media capture.

489	   If the area of capture attribute is specified, it must include X, Y
490	   and Z coordinates for all four points.  If the area of capture is not
491	   specified, it means the media capture is not spatially related to any
492	   other media capture (but this can change in a subsequent provider
493	   advertisement).

495	   For a switched capture that switches between different sections
496	   within a larger area, the area of capture should use coordinates for
497	   the larger potential area.

499	   EncodingGroup: {<encodeGroupID value>}

501	   A field with a value equal to the encodeGroupID of the encoding group
502	   associated with the media capture.

504	   Max Capture Encodings: {unsigned integer}

506	   An optional attribute indicating the maximum number of capture
507	   encodings that can be simultaneously active for the media capture.
508	   If absent, this parameter defaults to 1.  The minimum value for this
509	   attribute is 1.  The number of simultaneous capture encodings is also
510	   limited by the restrictions of the encoding group for the media
511	   capture.

513	6.2.  Capture Scene

515	   In order for a provider's individual media captures to be used
516	   effectively by a consumer, the provider organizes the media captures
517	   into capture scenes, with the structure and contents of these capture
518	   scenes being sent from the provider to the consumer.

520	   A capture scene is a structure representing the scene that is
521	   captured by a collection of capture devices.  A capture scene
522	   includes one or more capture scene entries, with each entry including
523	   one or more media captures.  A capture scene represents, for example,
524	   the video image of a group of people seated next to each other, along
525	   with the sound of their voices, which could be represented by some
526	   number of VCs and ACs in the capture scene entries.  A middle box may
527	   also express capture scenes that it constructs from media streams it
528	   receives.

530	   A provider may advertise multiple capture scenes or just a single
531	   capture scene.  A media provider might typically use one capture
532	   scene for main participant media and another capture scene for a
533	   computer generated presentation.  A capture scene may include more
534	   than one type of media.  For example, a capture scene can include
535	   several capture scene entries for video captures, and several capture
536	   scene entries for audio captures.

538	   A provider can express spatial relationships between media captures
539	   that are included in the same capture scene.  But there is no spatial
540	   relationship between media captures that are in different capture
541	   scenes.

543	   A media provider arranges media captures in a capture scene to help
544	   the media consumer choose which captures it wants.  The capture scene
545	   entries in a capture scene are different alternatives the provider is
546	   suggesting for representing the capture scene.  The media consumer
547	   can choose to receive all media captures from one capture scene entry
548	   for each media type (e.g. audio and video), or it can pick and choose
549	   media captures regardless of how the provider arranges them in
550	   capture scene entries.  Different capture scene entries of the same
551	   media type are not necessarily mutually exclusive alternatives.

553	   Media captures within the same capture scene entry must be of the
554	   same media type - it is not possible to mix audio and video captures
555	   in the same capture scene entry, for instance.  The provider must be
556	   capable of encoding and sending all media captures in a single entry
557	   simultaneously.  A consumer may decide to receive all the media
558	   captures in a single capture scene entry, but a consumer could also
559	   decide to receive just a subset of those captures.  A consumer can
560	   also decide to receive media captures from different capture scene
561	   entries.

563	   When a provider advertises a capture scene with multiple entries, it
564	   is essentially signaling that there are multiple representations of
565	   the same scene available.  In some cases, these multiple
566	   representations would typically be used simultaneously (for instance
567	   a "video entry" and an "audio entry").  In some cases the entries
568	   would conceptually be alternatives (for instance an entry consisting
569	   of 3 video captures versus an entry consisting of just a single video
570	   capture).  In this latter example, the provider would in the simple
571	   case end up providing to the consumer the entry containing the number
572	   of video captures that most closely matched the media consumer's
573	   number of display devices.

575	   The following is an example of 4 potential capture scene entries for
576	   an endpoint-style media provider:

578	   1.  (VC0, VC1, VC2) - left, center and right camera video captures

580	   2.  (VC3) - video capture associated with loudest room segment

582	   3.  (VC4) - video capture zoomed out view of all people in the room

584	   4.  (AC0) - main audio

586	   The first entry in this capture scene example is a list of video
587	   captures with a spatial relationship to each other.  Determination of
588	   the order of these captures (VC0, VC1 and VC2) for rendering purposes
589	   is accomplished through use of their Area of Capture attributes.  The
590	   second entry (VC3) and the third entry (VC4) are additional
591	   alternatives of how to capture the same room in different ways.  The
592	   inclusion of the audio capture in the same capture scene indicates
593	   that AC0 is associated with those video captures, meaning it comes
594	   from the same scene.  The audio should be rendered in conjunction
595	   with any rendered video captures from the same capture scene.

597	6.2.1.  Capture scene attributes

599	   Attributes can be applied to capture scenes as well as to individual
600	   media captures.  Attributes specified at this level apply to all
601	   constituent media captures.

603	   Description attribute - list of {<description text>, <language tag>}

605	   The optional description attribute is a list of human readable text
606	   strings which describe the capture scene.  If there is more than one
607	   string in the list, then each string in the list should contain the
608	   same description, but in a different language.  A provider that
609	   advertises multiple capture scenes can provide descriptions for each
610	   of them.  This attribute can contain text in any number of languages.

612	   The language tag identifies the language of the corresponding
613	   description text.  The possible values for a language tag are the
614	   values of the 'Subtag' column for the "Type: language" entries in the
615	   "Language Subtag Registry" at [IANA-Lan] originally defined in
616	   [RFC5646].  A particular language tag value MUST NOT be used more
617	   than once in the description attribute list.

619	   Area of Scene attribute

621	   The area of scene attribute for a capture scene has the same format
622	   as the area of capture attribute for a media capture.  The area of
623	   scene is for the entire scene, which is captured by the one or more
624	   media captures in the capture scene entries.  If the provider does
625	   not specify the area of scene, but does specify areas of capture,
626	   then the consumer may assume the area of scene is greater than or
627	   equal to the outer extents of the individual areas of capture.

629	   Scale attribute

631	   An optional attribute indicating if the numbers used for area of
632	   scene, area of capture and point of capture are in terms of
633	   millimeters, unknown scale factor, or not any scale, as described in
634	   Section 5.  If any media captures have an area of capture attribute
635	   or point of capture attribute, then this scale attribute must also be
636	   defined.  The possible values for this attribute are:

638	      "millimeters"
639	      "unknown"
640	      "no scale"

642	6.2.2.  Capture scene entry attributes

644	   Attributes can be applied to capture scene entries.  Attributes
645	   specified at this level apply to the capture scene entry as a whole.

647	   Scene-switch-policy: {site-switch, segment-switch}

649	   A media provider uses this scene-switch-policy attribute to indicate
650	   its support for different switching policies.  In the provider's
651	   advertisement, this attribute can have multiple values, which means
652	   the provider supports each of the indicated policies.  The consumer,
653	   when it requests media captures from this capture scene entry, should
654	   also include this attribute but with only the single value (from
655	   among the values indicated by the provider) indicating the consumer's
656	   choice for which policy it wants the provider to use.  If the
657	   provider does not support any of these policies, it should omit this
658	   attribute.

660	   The "site-switch" policy means all captures are switched at the same
661	   time to keep captures from the same endpoint site together.  Let's
662	   say the speaker is at site A and everyone else is at a "remote" site.
663	   When the room at site A shown, all the camera images from site A are
664	   forwarded to the remote sites.  Therefore at each receiving remote
665	   site, all the screens display camera images from site A. This can be
666	   used to preserve full size image display, and also provide full
667	   visual context of the displayed far end, site A. In site switching,
668	   there is a fixed relation between the cameras in each room and the
669	   displays in remote rooms.  The room or participants being shown is
670	   switched from time to time based on who is speaking or by manual
671	   control.

673	   The "segment-switch" policy means different captures can switch at
674	   different times, and can be coming from different endpoints.  Still
675	   using site A as where the speaker is, and "remote" to refer to all
676	   the other sites, in segment switching, rather than sending all the
677	   images from site A, only the image containing the speaker at site A
678	   is shown.  The camera images of the current speaker and previous
679	   speakers (if any) are forwarded to the other sites in the conference.
680	   Therefore the screens in each site are usually displaying images from
681	   different remote sites - the current speaker at site A and the
682	   previous ones.  This strategy can be used to preserve full size image
683	   display, and also capture the non-verbal communication between the
684	   speakers.  In segment switching, the display depends on the activity
685	   in the remote rooms - generally, but not necessarily based on audio /
686	   speech detection.

688	6.3.  Simultaneous Transmission Set Constraints

690	   The provider may have constraints or limitations on its ability to
691	   send media captures.  One type is caused by the physical limitations
692	   of capture mechanisms; these constraints are represented by a
693	   simultaneous transmission set.  The second type of limitation
694	   reflects the encoding resources available - bandwidth and
695	   macroblocks/second.  This type of constraint is captured by encoding
696	   groups, discussed below.

698	   An endpoint or MCU can send multiple captures simultaneously, however
699	   sometimes there are constraints that limit which captures can be sent
700	   simultaneously with other captures.  A device may not be able to be
701	   used in different ways at the same time.  Provider advertisements are
702	   made so that the consumer will choose one of several possible
703	   mutually exclusive usages of the device.  This type of constraint is
704	   expressed in a Simultaneous Transmission Set, which lists all the
705	   media captures that can be sent at the same time.  This is easier to
706	   show in an example.

708	   Consider the example of a room system where there are 3 cameras each
709	   of which can send a separate capture covering 2 persons each- VC0,
710	   VC1, VC2.  The middle camera can also zoom out and show all 6
711	   persons, VC3.  But the middle camera cannot be used in both modes at
712	   the same time - it has to either show the space where 2 participants
713	   sit or the whole 6 seats, but not both at the same time.

715	   Simultaneous transmission sets are expressed as sets of the MCs that
716	   could physically be transmitted at the same time, (though it may not
717	   make sense to do so).  In this example the two simultaneous sets are
718	   shown in Table 1.  The consumer must make sure that it chooses one
719	   and not more of the mutually exclusive sets.  A consumer may choose
720	   any subset of the media captures in a simultaneous set, it does not
721	   have to choose all the captures in a simultaneous set if it does not
722	   want to receive all of them.

724	                           +-------------------+
725	                           | Simultaneous Sets |
726	                           +-------------------+
727	                           | {VC0, VC1, VC2}   |
728	                           | {VC0, VC3, VC2}   |
729	                           +-------------------+

731	                Table 1: Two Simultaneous Transmission Sets

733	   A media provider includes the simultaneous sets in its provider
734	   advertisement.  These simultaneous set constraints apply across all
735	   the captures scenes in the advertisement.  The simultaneous
736	   transmission sets MUST allow all the media captures in a particular
737	   capture scene entry to be used simultaneously.

739	7.  Encodings

741	   We have considered how providers can describe the content of media to
742	   consumers.  We will now consider how the providers communicate
743	   information about their abilities to send streams.  We introduce two
744	   constructs - individual encodings and encoding groups.  Consumers
745	   will then map the media captures they want onto the encodings with
746	   encoding parameters they want.  This process is then described.

748	7.1.  Individual Encodings

750	   An individual encoding represents a way to encode a media capture to
751	   become a capture encoding, to be sent as an encoded media stream from
752	   the media provider to the media consumer.  An individual encoding has
753	   a set of parameters characterizing how the media is encoded.
754	   Different media types have different parameters, and different
755	   encoding algorithms may have different parameters.  An individual
756	   encoding can be assigned to only one capture encoding at a time.

758	   The parameters of an individual encoding represent the maximum values
759	   for certain aspects of the encoding.  A particular instantiation into
760	   a capture encoding might use lower values than these maximums.

762	   The following tables show the variables for audio and video encoding.

764	   +--------------+----------------------------------------------------+
765	   | Name         | Description                                        |
766	   +--------------+----------------------------------------------------+
767	   | encodeID     | A unique identifier for the individual encoding    |
768	   | maxBandwidth | Maximum number of bits per second                  |
769	   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  |
770	   |              | + 15) / 16) * ((height + 15) / 16) *               |
771	   |              | framesPerSecond                                    |
772	   | maxWidth     | Video resolution's maximum supported width,        |
773	   |              | expressed in pixels                                |
774	   | maxHeight    | Video resolution's maximum supported height,       |
775	   |              | expressed in pixels                                |
776	   | maxFrameRate | Maximum supported frame rate                       |
777	   +--------------+----------------------------------------------------+

779	               Table 2: Individual Video Encoding Parameters

781	           +--------------+-----------------------------------+
782	           | Name         | Description                       |
783	           +--------------+-----------------------------------+
784	           | maxBandwidth | Maximum number of bits per second |
785	           +--------------+-----------------------------------+

787	               Table 3: Individual Audio Encoding Parameters

789	7.2.  Encoding Group

791	   An encoding group includes a set of one or more individual encodings,
792	   plus some parameters that apply to the group as a whole.  By grouping
793	   multiple individual encodings together, an encoding group describes
794	   additional constraints on bandwidth and other parameters for the
795	   group.  Table 4 shows the parameters and individual encoding sets
796	   that are part of an encoding group.

798	   +-------------------+-----------------------------------------------+
799	   | Name              | Description                                   |
800	   +-------------------+-----------------------------------------------+
801	   | encodeGroupID     | A unique identifier for the encoding group    |
802	   | maxGroupBandwidth | Maximum number of bits per second relating to |
803	   |                   | all encodings combined                        |
804	   | maxGroupH264Mbps  | Maximum number of macroblocks per second      |
805	   |                   | relating to all video encodings combined      |
806	   | videoEncodings[]  | Set of potential encodings (list of           |
807	   |                   | encodeIDs)                                    |
808	   | audioEncodings[]  | Set of potential encodings (list of           |
809	   |                   | encodeIDs)                                    |
810	   +-------------------+-----------------------------------------------+

812	                          Table 4: Encoding Group

814	   When the individual encodings in a group are instantiated into
815	   capture encodings, each capture encoding has a bandwidth that must be
816	   less than or equal to the maxBandwidth for the particular individual
817	   encoding.  The maxGroupBandwidth parameter gives the additional
818	   restriction that the sum of all the individual capture encoding
819	   bandwidths must be less than or equal to the maxGroupBandwidth value.

821	   Likewise, the sum of the macroblocks per second of each instantiated
822	   encoding in the group must not exceed the maxGroupH264Mbps value.

824	   The following diagram illustrates the structure of a media provider's
825	   Encoding Groups and their contents.

827	   ,-------------------------------------------------.
828	   |             Media Provider                      |
829	   |                                                 |
830	   |  ,--------------------------------------.       |
831	   |  | ,--------------------------------------.     |
832	   |  | | ,--------------------------------------.   |
833	   |  | | |          Encoding Group              |   |
834	   |  | | | ,-----------.                        |   |
835	   |  | | | |           | ,---------.            |   |
836	   |  | | | |           | |         | ,---------.|   |
837	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
838	   |  `.| | |           | |         | `---------'|   |
839	   |    `.| `-----------' `---------'            |   |
840	   |      `--------------------------------------'   |
841	   `-------------------------------------------------'

843	                    Figure 1: Encoding Group Structure

845	   A media provider advertises one or more encoding groups.  Each
846	   encoding group includes one or more individual encodings.  Each
847	   individual encoding can represent a different way of encoding media.
848	   For example one individual encoding may be 1080p60 video, another
849	   could be 720p30, with a third being CIF.

851	   While a typical 3 codec/display system might have one encoding group
852	   per "codec box", there are many possibilities for the number of
853	   encoding groups a provider may be able to offer and for the encoding
854	   values in each encoding group.

856	   There is no requirement for all encodings within an encoding group to
857	   be instantiated at once.

859	8.  Associating Media Captures with Encoding Groups

861	   Every media capture is associated with an encoding group, which is
862	   used to instantiate that media capture into one or more capture
863	   encodings.  Each media capture has an encoding group attribute.  The
864	   value of this attribute is the encodeGroupID for the encoding group
865	   with which it is associated.  More than one media capture may use the
866	   same encoding group.

868	   The maximum number of streams that can result from a particular
869	   encoding group constraint is equal to the number of individual
870	   encodings in the group.  The actual number of capture encodings used
871	   at any time may be less than this maximum.  Any of the media captures
872	   that use a particular encoding group can be encoded according to any
873	   of the individual encodings in the group.  If there are multiple
874	   individual encodings in the group, then the media consumer can
875	   configure the media provider to encode a single media capture into
876	   multiple different capture encodings at the same time, subject to the
877	   Max Capture Encodings constraint, with each capture encoding
878	   following the constraints of a different individual encoding.

880	   The Encoding Groups MUST allow all the media captures in a particular
881	   capture scene entry to be used simultaneously.

883	9.  Consumer's Choice of Streams to Receive from the Provider

885	   After receiving the provider's advertised media captures and
886	   associated constraints, the consumer must choose which media captures
887	   it wishes to receive, and which individual encodings from the
888	   provider it wants to use to encode the captures.  Each media capture
889	   has an encoding group ID attribute which specifies which individual
890	   encodings are available to be used for that media capture.

892	   For each media capture the consumer wants to receive, it configures
893	   one or more of the encodings in that capture's encoding group.  The
894	   consumer does this by telling the provider the resolution, frame
895	   rate, bandwidth, etc. when asking for capture encodings for its
896	   chosen captures.  Upon receipt of this configuration command from the
897	   consumer, the provider generates a stream for each such configured
898	   capture encoding and sends those streams to the consumer.

900	   The consumer must have received at least one capture advertisement
901	   from the provider to be able to configure the provider's generation
902	   of media streams.

904	   The consumer is able to change its configuration of the provider's
905	   encodings any number of times during the call, either in response to
906	   a new capture advertisement from the provider or autonomously.  The
907	   consumer need not send a new configure message to the provider when
908	   it receives a new capture advertisement from the provider unless the
909	   contents of the new capture advertisement cause the consumer's
910	   current configure message to become invalid.

912	   When choosing which streams to receive from the provider, and the
913	   encoding characteristics of those streams, the consumer needs to take
914	   several things into account: its local preference, simultaneity
915	   restrictions, and encoding limits.

917	9.1.  Local preference

919	   A variety of local factors will influence the consumer's choice of
920	   streams to be received from the provider:

922	   o  if the consumer is an endpoint, it is likely that it would choose,
923	      where possible, to receive video and audio captures that match the
924	      number of display devices and audio system it has

926	   o  if the consumer is a middle box such as an MCU, it may choose to
927	      receive loudest speaker streams (in order to perform its own media
928	      composition) and avoid pre-composed video captures

930	   o  user choice (for instance, selection of a new layout) may result
931	      in a different set of media captures, or different encoding
932	      characteristics, being required by the consumer

934	9.2.  Physical simultaneity restrictions

936	   There may be physical simultaneity constraints imposed by the
937	   provider that affect the provider's ability to simultaneously send
938	   all of the captures the consumer would wish to receive.  For
939	   instance, a middle box such as an MCU, when connected to a multi-
940	   camera room system, might prefer to receive both individual camera
941	   streams of the people present in the room and an overall view of the
942	   room from a single camera.  Some endpoint systems might be able to
943	   provide both of these sets of streams simultaneously, whereas others
944	   may not (if the overall room view were produced by changing the zoom
945	   level on the center camera, for instance).

947	9.3.  Encoding and encoding group limits

949	   Each of the provider's encoding groups has limits on bandwidth and
950	   macroblocks per second, and the constituent potential encodings have
951	   limits on the bandwidth, macroblocks per second, video frame rate,
952	   and resolution that can be provided.  When choosing the media
953	   captures to be received from a provider, a consumer device must
954	   ensure that the encoding characteristics requested for each
955	   individual media capture fits within the capability of the encoding
956	   it is being configured to use, as well as ensuring that the combined
957	   encoding characteristics for media captures fit within the
958	   capabilities of their associated encoding groups.  In some cases,
959	   this could cause an otherwise "preferred" choice of capture encodings
960	   to be passed over in favour of different capture encodings - for
961	   instance, if a set of 3 media captures could only be provided at a
962	   low resolution then a 3 screen device could switch to favoring a
963	   single, higher quality, capture encoding.

965	9.4.  Message Flow

967	   The following diagram shows the basic flow of messages between a
968	   media provider and a media consumer.  The usage of the "capture
969	   advertisement" and "configure encodings" message is described above.
970	   The consumer also sends its own capability message to the provider
971	   which may contain information about its own capabilities or
972	   restrictions.

974	   Diagram for Message Flow

976	            Media Consumer                         Media Provider
977	            --------------                         ------------
978	                  |                                     |
979	                  |----- Consumer Capability ---------->|
980	                  |                                     |
981	                  |                                     |
982	                  |<---- Capture advertisement ---------|
983	                  |                                     |
984	                  |                                     |
985	                  |------ Configure encodings --------->|
986	                  |                                     |

988	   In order for a maximally-capable provider to be able to advertise a
989	   manageable number of video captures to a consumer, there is a
990	   potential use for the consumer, at the start of CLUE, to be able to
991	   inform the provider of its capabilities.  One example here would be
992	   the video capture attribute set - a consumer could tell the provider
993	   the complete set of video capture attributes it is able to understand
994	   and so the provider would be able to reduce the capture scene it
995	   advertises to be tailored to the consumer.

997	   TBD - the content of the consumer capability message needs to be
998	   better defined.  The authors believe there is a need for this
999	   message, but have not worked out the details yet.

1001	10.  Extensibility

1003	   One of the most important characteristics of the Framework is its
1004	   extensibility.  Telepresence is a relatively new industry and while
1005	   we can foresee certain directions, we also do not know everything
1006	   about how it will develop.  The standard for interoperability and
1007	   handling multiple streams must be future-proof.

1009	   The framework itself is inherently extensible through expanding the
1010	   data model types.  For example:

1012	   o  Adding more types of media, such as telemetry, can done by
1013	      defining additional types of captures in addition to audio and
1014	      video.

1016	   o  Adding new functionalities , such as 3-D, say, will require
1017	      additional attributes describing the captures.

1019	   o  Adding a new codecs, such as H.265, can be accomplished by
1020	      defining new encoding variables.

1022	   The infrastructure is designed to be extended rather than requiring
1023	   new infrastructure elements.  Extension comes through adding to
1024	   defined types.

1026	   Assuming the implementation is in something like XML, adding data
1027	   elements and attributes makes extensibility easy.

1029	11.  Examples - Using the Framework

1031	   This section shows some examples in more detail how to use the
1032	   framework to represent a typical case for telepresence rooms.  First
1033	   an endpoint is illustrated, then an MCU case is shown.

1035	11.1.  Three screen endpoint media provider

1037	   Consider an endpoint with the following description:

1039	   o  3 cameras, 3 displays, a 6 person table

1041	   o  Each video device can provide one capture for each 1/3 section of
1042	      the table

1044	   o  A single capture representing the active speaker can be provided

1046	   o  A single capture representing the active speaker with the other 2
1047	      captures shown picture in picture within the stream can be
1048	      provided

1050	   o  A capture showing a zoomed out view of all 6 seats in the room can
1051	      be provided

1053	   The audio and video captures for this endpoint can be described as
1054	   follows.

1056	   Video Captures:

1058	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1059	      content=main, switched=false

1061	   o  VC1- (the center camera stream), encoding group=EG1, content=main,
1062	      switched=false

1064	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1065	      content=main, switched=false

1067	   o  VC3- (the loudest panel stream), encoding group=EG1, content=main,
1068	      switched=true

1070	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1071	      content=main, composed=true, switched=true

1073	   o  VC5- (the zoomed out view of all people in the room), encoding
1074	      group=EG1, content=main, composed=false, switched=false

1076	   o  VC6- (presentation stream), encoding group=EG1, content=slides,
1077	      switched=false

1079	   The following diagram is a top view of the room with 3 cameras, 3
1080	   displays, and 6 seats.  Each camera is capturing 2 people.  The six
1081	   seats are not all in a straight line.

1083	      ,-. d
1084	     (   )`--.__        +---+
1085	      `-' /     `--.__  |   |
1086	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1087	   (   ).'        ___..-+-''`+-+
1088	    `-' |_...---''      |   |
1089	    ,-.c+-..__          +---+
1090	   (   )|     ``--..__  |   |
1091	    `-' |             ``+-..|_-+Camera 1 (VC1)
1092	    ,-. |            __..--'|+-+
1093	   (   )|     __..--'   |   |
1094	    `-'b|..--'          +---+
1095	    ,-. |``---..___     |   |
1096	   (   )\          ```--..._|_-+Camera 0 (VC0)
1097	    `-'  \             _..-''`-+
1098	     ,-. \      __.--'' |   |
1099	    (   ) |..-''        +---+
1100	     `-' a

1102	   The two points labeled b and c are intended to be at the midpoint
1103	   between the seating positions, and where the fields of view of the
1104	   cameras intersect.
1105	   The plane of interest for VC0 is a vertical plane that intersects
1106	   points 'a' and 'b'.
1107	   The plane of interest for VC1 intersects points 'b' and 'c'.
1108	   The plane of interest for VC2 intersects points 'c' and 'd'.
1109	   This example uses an area scale of millimeters.

1111	   Areas of capture:
1112	       bottom left    bottom right  top left         top right
1113	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1114	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1115	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1116	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1117	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1118	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1119	   VC6 none

1121	   Points of capture:
1122	   VC0 (-1678,0,800)
1123	   VC1 (0,0,800)
1124	   VC2 (1678,0,800)
1125	   VC3 none
1126	   VC4 none
1127	   VC5 (0,0,800)
1128	   VC6 none

1130	   In this example, the right edge of the VC0 area lines up with the
1131	   left edge of the VC1 area.  It doesn't have to be this way.  There
1132	   could be a gap or an overlap.  One additional thing to note for this
1133	   example is the distance from a to b is equal to the distance from b
1134	   to c and the distance from c to d.  All these distances are 1346 mm.
1135	   This is the planar width of each area of capture for VC0, VC1, and
1136	   VC2.

1138	   Note the text in parentheses (e.g. "the camera-left camera stream")
1139	   is not explicitly part of the model, it is just explanatory text for
1140	   this example, and is not included in the model with the media
1141	   captures and attributes.  Also, the "composed" boolean attribute
1142	   doesn't say anything about how a capture is composed, so the media
1143	   consumer can't tell based on this attribute that VC4 is composed of a
1144	   "loudest panel with PiPs".

1146	   Audio Captures:

1148	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1149	      format=mono

1151	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1152	      format=mono

1154	   o  AC2 (center) encoding group=EG3, content=main, channel format=mono

1156	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1157	      encoding group=EG3, content=main, channel format=mono

1159	   o  AC4 audio stream associated with the presentation video (mono)
1160	      encoding group=EG3, content=slides, channel format=mono

1162	   Areas of capture:
1163	       bottom left    bottom right  top left         top right
1164	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1165	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1166	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1167	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1168	   AC4 none

1170	   The physical simultaneity information is:

1172	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1174	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1176	   This constraint indicates it is not possible to use all the VCs at
1177	   the same time.  VC5 can not be used at the same time as VC1 or VC3 or
1178	   VC4.  Also, using every member in the set simultaneously may not make
1179	   sense - for example VC3(loudest) and VC4 (loudest with PIP).  (In
1180	   addition, there are encoding constraints that make choosing all of
1181	   the VCs in a set impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and
1182	   EG1 has only 3 ENCs.  This constraint shows up in the encoding
1183	   groups, not in the simultaneous transmission sets.)

1185	   In this example there are no restrictions on which audio captures can
1186	   be sent simultaneously.

1188	   Encoding Groups:

1190	   This example has three encoding groups associated with the video
1191	   captures.  Each group can have 3 encodings, but with each potential
1192	   encoding having a progressively lower specification.  In this
1193	   example, 1080p60 transmission is possible (as ENC0 has a maxMbps
1194	   value compatible with that) as long as it is the only active encoding
1195	   in the group(as maxMbps for the entire encoding group is also
1196	   489600).  Significantly, as up to 3 encodings are available per
1197	   group, it is possible to transmit some video captures simultaneously
1198	   that are not in the same entry in the capture scene.  For example VC1
1199	   and VC3 at the same time.

1201	   It is also possible to transmit multiple capture encodings of a
1202	   single video capture.  For example VC0 can be encoded using ENC0 and
1203	   ENC1 at the same time, as long as the encoding parameters satisfy the
1204	   constraints of ENC0, ENC1, and EG0, such as one at 1080p30 and one at
1205	   720p30.

1207	   encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1208	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1209	                      maxH264Mbps=489600, maxBandwidth=4000000
1210	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1211	                      maxH264Mbps=108000, maxBandwidth=4000000
1212	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1213	                      maxH264Mbps=61200, maxBandwidth=4000000

1215	   encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1216	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1217	                      maxH264Mbps=489600, maxBandwidth=4000000
1218	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1219	                      maxH264Mbps=108000, maxBandwidth=4000000
1220	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1221	                      maxH264Mbps=61200, maxBandwidth=4000000

1223	   encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1224	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1225	                      maxH264Mbps=489600, maxBandwidth=4000000
1226	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1227	                      maxH264Mbps=108000, maxBandwidth=4000000
1228	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1229	                      maxH264Mbps=61200, maxBandwidth=4000000

1231	                Figure 2: Example Encoding Groups for Video

1233	   For audio, there are five potential encodings available, so all five
1234	   audio captures can be encoded at the same time.

1236	   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
1237	       encodeID=ENC9, maxBandwidth=64000
1238	       encodeID=ENC10, maxBandwidth=64000
1239	       encodeID=ENC11, maxBandwidth=64000
1240	       encodeID=ENC12, maxBandwidth=64000
1241	       encodeID=ENC13, maxBandwidth=64000

1243	                Figure 3: Example Encoding Group for Audio

1245	   Capture Scenes:

1247	   The following table represents the capture scenes for this provider.
1248	   Recall that a capture scene is composed of alternative capture scene
1249	   entries covering the same scene.  Capture Scene #1 is for the main
1250	   people captures, and Capture Scene #2 is for presentation.

1252	      Each row in the table is a separate entry in the capture scene

1254	                           +------------------+
1255	                           | Capture Scene #1 |
1256	                           +------------------+
1257	                           | VC0, VC1, VC2    |
1258	                           | VC3              |
1259	                           | VC4              |
1260	                           | VC5              |
1261	                           | AC0, AC1, AC2    |
1262	                           | AC3              |
1263	                           +------------------+

1265	                           +------------------+
1266	                           | Capture Scene #2 |
1267	                           +------------------+
1268	                           | VC6              |
1269	                           | AC4              |
1270	                           +------------------+

1272	   Different capture scenes are unique to each other, non-overlapping.
1273	   A consumer can choose an entry from each capture scene.  In this case
1274	   the three captures VC0, VC1, and VC2 are one way of representing the
1275	   video from the endpoint.  These three captures should appear adjacent
1276	   next to each other.  Alternatively, another way of representing the
1277	   Capture Scene is with the capture VC3, which automatically shows the
1278	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1280	   As in the video case, the different entries of audio in Capture Scene
1281	   #1 represent the "same thing", in that one way to receive the audio
1282	   is with the 3 audio captures (AC0, AC1, AC2), and another way is with
1283	   the mixed AC3.  The Media Consumer can choose an audio capture entry
1284	   it is capable of receiving.

1286	   The spatial ordering is understood by the media capture attributes
1287	   area and point of capture.

1289	   A Media Consumer would likely want to choose a capture scene entry to
1290	   receive based in part on how many streams it can simultaneously
1291	   receive.  A consumer that can receive three people streams would
1292	   probably prefer to receive the first entry of Capture Scene #1 (VC0,
1293	   VC1, VC2) and not receive the other entries.  A consumer that can
1294	   receive only one people stream would probably choose one of the other
1295	   entries.

1297	   If the consumer can receive a presentation stream too, it would also
1298	   choose to receive the only entry from Capture Scene #2 (VC6).

1300	11.2.  Encoding Group Example

1302	   This is an example of an encoding group to illustrate how it can
1303	   express dependencies between encodings.

1305	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1306	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1307	                         maxH264Mbps=244800, maxBandwidth=4000000
1308	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1309	                         maxH264Mbps=244800, maxBandwidth=4000000
1310	       encodeID=AUDENC0, maxBandwidth=96000
1311	       encodeID=AUDENC1, maxBandwidth=96000
1312	       encodeID=AUDENC2, maxBandwidth=96000

1314	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
1315	   capture encodings (Mbps for 1080p = 244800), but it is capable of
1316	   transmitting a maxFrameRate of 60 frames per second (fps).  To
1317	   achieve the maximum resolution (1920 x 1088) the frame rate is
1318	   limited to 30 fps.  However 60 fps can be achieved at a lower
1319	   resolution if required by the consumer.  Although the encoding group
1320	   is capable of transmitting up to 6Mbit/s, no individual video
1321	   encoding can exceed 4Mbit/s.

1323	   This encoding group also allows up to 3 audio encodings, AUDENC<0-2>.
1324	   It is not required that audio and video encodings reside within the
1325	   same encoding group, but if so then the group's overall maxBandwidth
1326	   value is a limit on the sum of all audio and video encodings
1327	   configured by the consumer.  A system that does not wish or need to
1328	   combine bandwidth limitations in this way should instead use separate
1329	   encoding groups for audio and video in order for the bandwidth
1330	   limitations on audio and video to not interact.

1332	   Audio and video can be expressed in separate encoding groups, as in
1333	   this illustration.

1335	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1336	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1337	                         maxH264Mbps=244800, maxBandwidth=4000000
1338	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1339	                         maxH264Mbps=244800, maxBandwidth=4000000

1341	  encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
1342	       encodeID=AUDENC0, maxBandwidth=96000
1343	       encodeID=AUDENC1, maxBandwidth=96000
1344	       encodeID=AUDENC2, maxBandwidth=96000

1346	11.3.  The MCU Case

1348	   This section shows how an MCU might express its Capture Scenes,
1349	   intending to offer different choices for consumers that can handle
1350	   different numbers of streams.  A single audio capture stream is
1351	   provided for all single and multi-screen configurations that can be
1352	   associated (e.g. lip-synced) with any combination of video captures
1353	   at the consumer.

1355	   +--------------------+---------------------------------------------+
1356	   | Capture Scene #1   | note                                        |
1357	   +--------------------+---------------------------------------------+
1358	   | VC0                | video capture for single screen consumer    |
1359	   | VC1, VC2           | video capture for 2 screen consumer         |
1360	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1361	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1362	   | AC0                | audio capture representing all participants |
1363	   +--------------------+---------------------------------------------+

1365	   If / when a presentation stream becomes active within the conference,
1366	   the MCU might re-advertise the available media as:

1368	        +------------------+--------------------------------------+
1369	        | Capture Scene #2 | note                                 |
1370	        +------------------+--------------------------------------+
1371	        | VC10             | video capture for presentation       |
1372	        | AC1              | presentation audio to accompany VC10 |
1373	        +------------------+--------------------------------------+

1375	11.4.  Media Consumer Behavior

1377	   This section gives an example of how a media consumer might behave
1378	   when deciding how to request streams from the three screen endpoint
1379	   described in the previous section.

1381	   The receive side of a call needs to balance its requirements, based
1382	   on number of screens and speakers, its decoding capabilities and
1383	   available bandwidth, and the provider's capabilities in order to
1384	   optimally configure the provider's streams.  Typically it would want
1385	   to receive and decode media from each capture scene advertised by the
1386	   provider.

1388	   A sane, basic, algorithm might be for the consumer to go through each
1389	   capture scene in turn and find the collection of video captures that
1390	   best matches the number of screens it has (this might include
1391	   consideration of screens dedicated to presentation video display
1392	   rather than "people" video) and then decide between alternative
1393	   entries in the video capture scenes based either on hard-coded
1394	   preferences or user choice.  Once this choice has been made, the
1395	   consumer would then decide how to configure the provider's encoding
1396	   groups in order to make best use of the available network bandwidth
1397	   and its own decoding capabilities.

1399	11.4.1.  One screen consumer

1401	   VC3, VC4 and VC5 are all different entries by themselves, not grouped
1402	   together in a single entry, so the receiving device should choose
1403	   between one of those.  The choice would come down to whether to see
1404	   the greatest number of participants simultaneously at roughly equal
1405	   precedence (VC5), a switched view of just the loudest region (VC3) or
1406	   a switched view with PiPs (VC4).  An endpoint device with a small
1407	   amount of knowledge of these differences could offer a dynamic choice
1408	   of these options, in-call, to the user.

1410	11.4.2.  Two screen consumer configuring the example

1412	   Mixing systems with an even number of screens, "2n", and those with
1413	   "2n+1" cameras (and vice versa) is always likely to be the
1414	   problematic case.  In this instance, the behavior is likely to be
1415	   determined by whether a "2 screen" system is really a "2 decoder"
1416	   system, i.e., whether only one received stream can be displayed per
1417	   screen or whether more than 2 streams can be received and spread
1418	   across the available screen area.  To enumerate 3 possible behaviors
1419	   here for the 2 screen system when it learns that the far end is
1420	   "ideally" expressed via 3 capture streams:

1422	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1423	       per the 1 screen consumer case above) and either leave one screen
1424	       blank or use it for presentation if / when a presentation becomes
1425	       active

1427	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1428	       (either with each capture being scaled to 2/3 of a screen and the
1429	       centre capture being split across 2 screens) or, as would be
1430	       necessary if there were large bezels on the screens, with each
1431	       stream being scaled to 1/2 the screen width and height and there
1432	       being a 4th "blank" panel.  This 4th panel could potentially be
1433	       used for any presentation that became active during the call.

1435	   3.  Receive 3 streams, decode all 3, and use control information
1436	       indicating which was the most active to switch between showing
1437	       the left and centre streams (one per screen) and the centre and
1438	       right streams.

1440	   For an endpoint capable of all 3 methods of working described above,
1441	   again it might be appropriate to offer the user the choice of display
1442	   mode.

1444	11.4.3.  Three screen consumer configuring the example

1446	   This is the most straightforward case - the consumer would look to
1447	   identify a set of streams to receive that best matched its available
1448	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1449	   spatial ordering would give sufficient information for the correct
1450	   video capture to be shown on the correct screen, and the consumer
1451	   would either need to divide a single encoding group's capability by 3
1452	   to determine what resolution and frame rate to configure the provider
1453	   with or to configure the individual video captures' encoding groups
1454	   with what makes most sense (taking into account the receive side
1455	   decode capabilities, overall call bandwidth, the resolution of the
1456	   screens plus any user preferences such as motion vs sharpness).

1458	12.  Acknowledgements

1460	   Mark Gorzyinski contributed much to the approach.  We want to thank
1461	   Stephen Botzko for helpful discussions on audio.

1463	13.  IANA Considerations

1465	   TBD

1467	14.  Security Considerations

1469	   TBD

1471	15.  Changes Since Last Version

1473	   NOTE TO THE RFC-Editor: Please remove this section prior to
1474	   publication as an RFC.

1476	   Changes from 06 to 07:

1478	   1.  Ticket #9.  Rename Axis of Capture Point attribute to Point on
1479	       Line of Capture.  Clarify the description of this attribute.

1481	   2.  Ticket #17.  Add "capture encoding" definition.  Use this new
1482	       term throughout document as appropriate, replacing some usage of
1483	       the terms "stream" and "encoding".

1485	   3.  Ticket #18.  Add Max Capture Encodings media capture attribute.

1487	   4.  Add clarification that different capture scene entries are not
1488	       necessarily mutually exclusive.

1490	   Changes from 05 to 06:

1492	   1.  Capture scene description attribute is a list of text strings,
1493	       each in a different language, rather than just a single string.

1495	   2.  Add new Axis of Capture Point attribute.

1497	   3.  Remove appendices A.1 through A.6.

1499	   4.  Clarify that the provider must use the same coordinate system
1500	       with same scale and origin for all coordinates within the same
1501	       capture scene.

1503	   Changes from 04 to 05:

1505	   1.  Clarify limitations of "composed" attribute.

1507	   2.  Add new section "capture scene entry attributes" and add the
1508	       attribute "scene-switch-policy".

1510	   3.  Add capture scene description attribute and description language
1511	       attribute.

1513	   4.  Editorial changes to examples section for consistency with the
1514	       rest of the document.

1516	   Changes from 03 to 04:

1518	   1.   Remove sentence from overview - "This constitutes a significant
1519	        change ..."

1521	   2.   Clarify a consumer can choose a subset of captures from a
1522	        capture scene entry or a simultaneous set (in section "capture
1523	        scene" and "consumer's choice...").

1525	   3.   Reword first paragraph of Media Capture Attributes section.

1527	   4.   Clarify a stereo audio capture is different from two mono audio
1528	        captures (description of audio channel format attribute).

1530	   5.   Clarify what it means when coordinate information is not
1531	        specified for area of capture, point of capture, area of scene.

1533	   6.   Change the term "producer" to "provider" to be consistent (it
1534	        was just in two places).

1536	   7.   Change name of "purpose" attribute to "content" and refer to
1537	        RFC4796 for values.

1539	   8.   Clarify simultaneous sets are part of a provider advertisement,
1540	        and apply across all capture scenes in the advertisement.

1542	   9.   Remove sentence about lip-sync between all media captures in a
1543	        capture scene.

1545	   10.  Combine the concepts of "capture scene" and "capture set" into a
1546	        single concept, using the term "capture scene" to replace the
1547	        previous term "capture set", and eliminating the original
1548	        separate capture scene concept.

1550	16.  Informative References

1552	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1553	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1555	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1556	              A., Peterson, J., Sparks, R., Handley, M., and E.
1557	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1558	              June 2002.

1560	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1561	              Jacobson, "RTP: A Transport Protocol for Real-Time
1562	              Applications", STD 64, RFC 3550, July 2003.

1564	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1565	              Session Initiation Protocol (SIP)", RFC 4353,
1566	              February 2006.

1568	   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
1569	              Protocol (SDP) Content Attribute", RFC 4796,
1570	              February 2007.

1572	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1573	              January 2008.

1575	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
1576	              Languages", BCP 47, RFC 5646, September 2009.

1578	   [IANA-Lan]
1579	              IANA, "Language Subtag Registry",
1580	              <http://www.iana.org/assignments/
1581	              language-subtag-registry>.

1583	Authors' Addresses

1585	   Allyn Romanow
1586	   Cisco Systems
1587	   San Jose, CA  95134
1588	   USA

1590	   Email: allyn@cisco.com

1592	   Mark Duckworth (editor)
1593	   Polycom
1594	   Andover, MA  01810
1595	   USA

1597	   Email: mark.duckworth@polycom.com

1599	   Andrew Pepperell
1600	   Silverflare
1601	   Uxbridge, England
1602	   UK

1604	   Email: apeppere@gmail.com

1606	   Brian Baldino
1607	   Cisco Systems
1608	   San Jose, CA  95134
1609	   USA

1611	   Email: bbaldino@cisco.com