idnits 2.17.1 

draft-ietf-clue-framework-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1095 has weird spacing: '...om left    bot...'

  == Line 1146 has weird spacing: '...om left    bot...'

  -- The document date (July 6, 2012) is 4305 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                         M. Duckworth, Ed.
5	Expires: January 7, 2013                                         Polycom
6	                                                            A. Pepperell

8	                                                              B. Baldino
9	                                                           Cisco Systems
10	                                                            July 6, 2012

12	                Framework for Telepresence Multi-Streams
13	                    draft-ietf-clue-framework-06.txt

15	Abstract

17	   This memo offers a framework for a protocol that enables devices in a
18	   telepresence conference to interoperate by specifying the
19	   relationships between multiple media streams.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on January 7, 2013.

38	Copyright Notice

40	   Copyright (c) 2012 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
56	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
58	   4.  Overview of the Framework/Model  . . . . . . . . . . . . . . .  6
59	   5.  Spatial Relationships  . . . . . . . . . . . . . . . . . . . .  7
60	   6.  Media Captures and Capture Scenes  . . . . . . . . . . . . . .  8
61	     6.1.  Media Captures . . . . . . . . . . . . . . . . . . . . . .  8
62	       6.1.1.  Media Capture Attributes . . . . . . . . . . . . . . .  9
63	     6.2.  Capture Scene  . . . . . . . . . . . . . . . . . . . . . . 11
64	       6.2.1.  Capture scene attributes . . . . . . . . . . . . . . . 13
65	       6.2.2.  Capture scene entry attributes . . . . . . . . . . . . 14
66	     6.3.  Simultaneous Transmission Set Constraints  . . . . . . . . 15
67	   7.  Encodings  . . . . . . . . . . . . . . . . . . . . . . . . . . 16
68	     7.1.  Individual Encodings . . . . . . . . . . . . . . . . . . . 16
69	     7.2.  Encoding Group . . . . . . . . . . . . . . . . . . . . . . 17
70	   8.  Associating Media Captures with Encoding Groups  . . . . . . . 19
71	   9.  Consumer's Choice of Streams to Receive from the Provider  . . 19
72	     9.1.  Local preference . . . . . . . . . . . . . . . . . . . . . 20
73	     9.2.  Physical simultaneity restrictions . . . . . . . . . . . . 20
74	     9.3.  Encoding and encoding group limits . . . . . . . . . . . . 20
75	     9.4.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21
76	   10. Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 22
77	   11. Examples - Using the Framework . . . . . . . . . . . . . . . . 22
78	     11.1. Three screen endpoint media provider . . . . . . . . . . . 22
79	     11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29
80	     11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30
81	     11.4. Media Consumer Behavior  . . . . . . . . . . . . . . . . . 30
82	       11.4.1. One screen consumer  . . . . . . . . . . . . . . . . . 31
83	       11.4.2. Two screen consumer configuring the example  . . . . . 31
84	       11.4.3. Three screen consumer configuring the example  . . . . 32
85	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32
86	   13. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 32
87	   14. Security Considerations  . . . . . . . . . . . . . . . . . . . 32
88	   15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32
89	   16. Informative References . . . . . . . . . . . . . . . . . . . . 34
90	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34

92	1.  Introduction

94	   Current telepresence systems, though based on open standards such as
95	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
96	   other.  A major factor limiting the interoperability of telepresence
97	   systems is the lack of a standardized way to describe and negotiate
98	   the use of the multiple streams of audio and video comprising the
99	   media flows.  This draft provides a framework for a protocol to
100	   enable interoperability by handling multiple streams in a
101	   standardized way.  It is intended to support the use cases described
102	   in draft-ietf-clue-telepresence-use-cases-02 and to meet the
103	   requirements in draft-ietf-clue-telepresence-requirements-01.

105	   The solution described here is strongly focused on what is being done
106	   today, rather than on a vision of future conferencing.  At the same
107	   time, the highest priority has been given to creating an extensible
108	   framework to make it easy to accommodate future conferencing
109	   functionality as it evolves.

111	   The purpose of this effort is to make it possible to handle multiple
112	   streams of media in such a way that a satisfactory user experience is
113	   possible even when participants are using different vendor equipment,
114	   and also when they are using devices with different types of
115	   communication capabilities.  Information about the relationship of
116	   media streams at the provider's end must be communicated so that
117	   streams can be chosen and audio/video rendering can be done in the
118	   best possible manner.

120	   There is no attempt here to dictate to the renderer what it should
121	   do.  What the renderer does is up to the renderer.

123	   After the following Definitions, a short section introduces key
124	   concepts.  The body of the text comprises several sections about the
125	   key elements of the framework, how a consumer chooses streams to
126	   receive, and some examples.  The appendix describe topics that are
127	   under discussion for adding to the document.

129	2.  Terminology

131	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
132	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
133	   document are to be interpreted as described in RFC 2119 [RFC2119].

135	3.  Definitions

137	   The definitions marked with an "*" are new; all the others are from
138	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

140	   Camera-Left and Right: For media captures, camera-left and camera-
141	   right are from the point of view of a person observing the rendered
142	   media.  They are the opposite of stage-left and stage-right.

144	   Capture Device: A device that converts audio and video input into an
145	   electrical signal, in most cases to be fed into a media encoder.
146	   Cameras and microphones are examples for capture devices.

148	   *Capture Scene: a structure representing the scene that is captured
149	   by a collection of capture devices.  A capture scene includes
150	   attributes and one or more capture scene entries, with each entry
151	   including one or more media captures.

153	   *Capture Scene Entry: a list of media captures of the same media type
154	   that together form one way to represent the capture scene.

156	   Conference: used as defined in [RFC4353], A Framework for
157	   Conferencing within the Session Initiation Protocol (SIP).

159	   *Individual Encoding: A variable with a set of attributes that
160	   describes the maximum values of a single audio or video capture
161	   encoding.  The attributes include: maximum bandwidth- and for video
162	   maximum macroblocks (for H.264), maximum width, maximum height,
163	   maximum frame rate.

165	   *Encoding Group: A set of encoding parameters representing a media
166	   provider's encoding capabilities.  Media stream providers formed of
167	   multiple physical units, in each of which resides some encoding
168	   capability, would typically advertise themselves to the remote media
169	   stream consumer using multiple encoding groups.  Within each encoding
170	   group, multiple potential encodings are possible, with the sum of the
171	   chosen encodings' characteristics constrained to being less than or
172	   equal to the group-wide constraints.

174	   Endpoint: The logical point of final termination through receiving,
175	   decoding and rendering, and/or initiation through capturing,
176	   encoding, and sending of media streams.  An endpoint consists of one
177	   or more physical devices which source and sink media streams, and
178	   exactly one [RFC4353] Participant (which, in turn, includes exactly
179	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
180	   send and receive media streams, but it is not the initiator nor the
181	   final terminator in the sense that Media is Captured or Rendered.
182	   Endpoints can be anything from multiscreen/multicamera rooms to
183	   handheld devices.

185	   Front: the portion of the room closest to the cameras.  In going
186	   towards back you move away from the cameras.

188	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
189	   more endpoints together into one single multimedia conference
190	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
191	   tardy in requiring that media from the mixer be sent to EACH
192	   participant.  I think we have practical use cases where this is not
193	   the case.  But the bug (if it is one) is in 4353 and not herein.]

195	   Media: Any data that, after suitable encoding, can be conveyed over
196	   RTP, including audio, video or timed text.

198	   *Media Capture: a source of Media, such as from one or more Capture
199	   Devices.  A Media Capture (MC) may be the source of one or more Media
200	   streams.  A Media Capture may also be constructed from other Media
201	   streams.  A middle box can express Media Captures that it constructs
202	   from Media streams it receives.

204	   *Media Consumer: an Endpoint or middle box that receives media
205	   streams

207	   *Media Provider: an Endpoint or middle box that sends Media streams

209	   Model: a set of assumptions a telepresence system of a given vendor
210	   adheres to and expects the remote telepresence system(s) also to
211	   adhere to.

213	   *Plane of Interest: The spatial plane containing the most relevant
214	   subject matter.

216	   Render: the process of generating a representation from a media, such
217	   as displayed motion video or sound emitted from loudspeakers.

219	   *Simultaneous Transmission Set: a set of media captures that can be
220	   transmitted simultaneously from a Media Provider.

222	   Spatial Relation: The arrangement in space of two objects, in
223	   contrast to relation in time or other relationships.  See also
224	   Camera-Left and Right.

226	   Stage-Left and Right: For media captures, stage-left and stage-right
227	   are the opposite of camera-left and camera-right.  For the case of a
228	   person facing (and captured by) a camera, stage-left and stage-right
229	   are from the point of view of that person.

231	   *Stream: RTP stream as in [RFC3550].

233	   Stream Characteristics: the media stream attributes commonly used in
234	   non-CLUE SIP/SDP environments (such as: media codec, bit rate,
235	   resolution, profile/level etc.) as well as CLUE specific attributes,
236	   such as the ID of a capture or a spatial location.

238	   Telepresence: an environment that gives non co-located users or user
239	   groups a feeling of (co-located) presence - the feeling that a Local
240	   user is in the same room with other Local users and the Remote
241	   parties.  The inclusion of Remote parties is achieved through
242	   multimedia communication including at least audio and video signals
243	   of high fidelity.

245	   *Video Capture: Media Capture for video.  Denoted as VCn.

247	   Video composite: A single image that is formed from combining visual
248	   elements from separate sources.

250	4.  Overview of the Framework/Model

252	   The CLUE framework specifies how multiple media streams are to be
253	   handled in a telepresence conference.

255	   The main goals include:

257	   o  Interoperability

259	   o  Extensibility

261	   o  Flexibility

263	   Interoperability is achieved by the media provider describing the
264	   relationships between media streams in constructs that are understood
265	   by the consumer, who can then render the media.  Extensibility is
266	   achieved through abstractions and the generality of the model, making
267	   it easy to add new parameters.  Flexibility is achieved largely by
268	   having the consumer choose what content and format it wants to
269	   receive from what the provider is capable of sending.

271	   A transmitting endpoint or MCU describes specific aspects of the
272	   content of the media and the formatting of the media streams it can
273	   send (advertisement); and the receiving end responds to the provider
274	   by specifying which content and media streams it wants to receive
275	   (configuration).  The provider then transmits the asked for content
276	   in the specified streams.

278	   This advertisement and configuration occurs at call initiation but
279	   may also happen at any time throughout the conference, whenever there
280	   is a change in what the consumer wants or the provider can send.

282	   An endpoint or MCU typically acts as both provider and consumer at
283	   the same time, sending advertisements and sending configurations in
284	   response to receiving advertisements.  (It is possible to be just one
285	   or the other.)

287	   The data model is based around two main concepts: a capture and an
288	   encoding.  A media capture (MC), such as audio or video, describes
289	   the content a provider can send.  Media captures are described in
290	   terms of CLUE-defined attributes, such as spatial relationships and
291	   purpose of the capture.  Providers tell consumers which media
292	   captures they can provide, described in terms of the media capture
293	   attributes.

295	   A provider organizes its media captures that represent the same scene
296	   into capture scenes.  A consumer chooses which media captures it
297	   wants to receive according to the capture scenes sent by the
298	   provider.

300	   In addition, the provider sends the consumer a description of the
301	   streams it can send in terms of the media attributes of the stream,
302	   in particular, well-known audio and video parameters such as
303	   bandwidth, frame rate, macroblocks per second.

305	   The provider also specifies constraints on its ability to provide
306	   media, and the consumer must take these into account in choosing the
307	   content and streams it wants.  Some constraints are due to the
308	   physical limitations of devices - for example, a camera may not be
309	   able to provide zoom and non-zoom views simultaneously.  Other
310	   constraints are system based constraints, such as maximum bandwidth
311	   and maximum macroblocks/second.

313	   The following sections discuss these constructs and processes in
314	   detail, followed by use cases showing how the framework specification
315	   can be used.

317	5.  Spatial Relationships

319	   In order for a consumer to perform a proper rendering, it is often
320	   necessary to provide spatial information about the streams it is
321	   receiving.  CLUE defines a coordinate system that allows media
322	   providers to describe the spatial relationships of their media
323	   captures to enable proper scaling and spatial rendering of their
324	   streams.  The coordinate system is based on a few principles:

326	   o  Simple systems which do not have multiple Media Captures to
327	      associate spatially need not use the coordinate model.

329	   o  Coordinates can either be in real, physical units (millimeters),
330	      have an unknown scale or have no physical scale.  Systems which
331	      know their physical dimensions should always provide those real-
332	      world measurements.  Systems which don't know specific physical
333	      dimensions but still know relative distances should use 'unknown
334	      scale'.  'No scale' is intended to be used where Media Captures
335	      from different devices (with potentially different scales) will be
336	      forwarded alongside one another (e.g. in the case of a middle
337	      box).

339	      *  "millimeters" means the scale is in millimeters

341	      *  "Unknown" means the scale is not necessarily millimeters, but
342	         the scale is the same for every capture in the capture scene.

344	      *  "No Scale" means the scale could be different for each capture
345	         - an MCU provider that advertises two adjacent captures and
346	         picks sources (which can change quickly) from different
347	         endpoints might use this value; the scale could be different
348	         and changing for each capture.  But the areas of capture still
349	         represent a spatial relation between captures.

351	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
352	      spot of the provider's choosing.  The provider must use the same
353	      coordinate system with same scale and origin for all coordinates
354	      within the same capture scene.

356	   The direction of increasing coordinate values is:
357	   X increases from camera left to camera right
358	   Y increases from front to back
359	   Z increases from low to high

361	6.  Media Captures and Capture Scenes

363	   This section describes how media providers can describe the content
364	   of media to consumers.

366	6.1.  Media Captures

368	   Media captures are the fundamental representations of streams that a
369	   device can transmit.  What a Media Capture actually represents is
370	   flexible:

372	   o  It can represent the immediate output of a physical source (e.g.
373	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
374	      DVD player).

376	   o  It can represent the output of an audio mixer or video composer

378	   o  It can represent a concept such as 'the loudest speaker'

380	   o  It can represent a conceptual position such as 'the leftmost
381	      stream'

383	   To distinguish between multiple instances, video and audio captures
384	   are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2 refer to
385	   two different video captures and AC1 and AC2 refer to two different
386	   audio captures.

388	   Each Media Capture can be associated with attributes to describe what
389	   it represents.

391	6.1.1.  Media Capture Attributes

393	   Media Capture Attributes describe static information about the
394	   captures.  A provider uses the media capture attributes to describe
395	   the media captures to the consumer.  The consumer will select the
396	   captures it wants to receive.  Attributes are defined by a variable
397	   and its value.  The currently defined attributes and their values
398	   are:

400	   Content: {slides, speaker, sl, main, alt}

402	   A field with enumerated values which describes the role of the media
403	   capture and can be applied to any media type.  The enumerated values
404	   are defined by [RFC4796].  The values for this attribute are the same
405	   as the mediacnt values for the content attribute in [RFC4796].  This
406	   attribute can have multiple values, for example content={main,
407	   speaker}.

409	   Composed: {true, false}

411	   A field with a Boolean value which indicates whether or not the Media
412	   Capture is a mix (audio) or composition (video) of streams.

414	   This attribute is useful for a media consumer to avoid nesting a
415	   composed video capture into another composed capture or rendering.
416	   This attribute is not intended to describe the layout a media
417	   provider uses when composing video streams.

419	   Audio Channel Format: {mono, stereo} A field with enumerated values
420	   which describes the method of encoding used for audio.

422	   A value of 'mono' means the Audio Capture has one channel.

424	   A value of 'stereo' means the Audio Capture has two audio channels,
425	   left and right.

427	   This attribute applies only to Audio Captures.  A single stereo
428	   capture is different from two mono captures that have a left-right
429	   spatial relationship.  A stereo capture maps to a single RTP stream,
430	   while each mono audio capture maps to a separate RTP stream.

432	   Switched: {true, false}

434	   A field with a Boolean value which indicates whether or not the Media
435	   Capture represents the (dynamic) most appropriate subset of a
436	   'whole'.  What is 'most appropriate' is up to the provider and could
437	   be the active speaker, a lecturer or a VIP.

439	   Point of Capture: {(X, Y, Z)}

441	   A field with a single Cartesian (X, Y, Z) point value which describes
442	   the spatial location, virtual or physical, of the capturing device
443	   (such as camera).

445	   When the Point of Capture attribute is specified, it must include X,
446	   Y and Z coordinates.  If the point of capture is not specified, it
447	   means the consumer should not assume anything about the spatial
448	   location of the capturing device.  Even if the provider specifies an
449	   area of capture attribute, it does not need to specify the point of
450	   capture.

452	   Axis of Capture Point: {(X, Y, Z)}

454	   A field with a single Cartesian (X, Y, Z) point value (virtual or
455	   physical) which describes a position in space of a second point on
456	   the axis of capture of the capturing device; the first point being
457	   the Point of Capture (see above).

459	   The axis of capture point MUST NOT be specified if the Point of
460	   Capture is not present for this capture device.  When the Axis of
461	   Capture Point attribute is specified, it must include X, Y and Z
462	   coordinates.  These coordinates MUST NOT be identical to the Point of
463	   Capture coordinates.  If the Axis of Capture point is not specified,
464	   it means the consumer should not assume anything about the axis of
465	   Capture of the capturing device.

467	   Area of Capture:

469	   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3,
470	   Z3), top right(X4, Y4, Z4)}

472	   A field with a set of four (X, Y, Z) points as a value which describe
473	   the spatial location of what is being "captured".  By comparing the
474	   Area of Capture for different Media Captures within the same capture
475	   scene a consumer can determine the spatial relationships between them
476	   and render them correctly.

478	   The four points should be co-planar.  The four points form a
479	   quadrilateral, not necessarily a rectangle.

481	   The quadrilateral described by the four (X, Y, Z) points defines the
482	   plane of interest for the particular media capture.

484	   If the area of capture attribute is specified, it must include X, Y
485	   and Z coordinates for all four points.  If the area of capture is not
486	   specified, it means the media capture is not spatially related to any
487	   other media capture (but this can change in a subsequent provider
488	   advertisement).

490	   For a switched capture that switches between different sections
491	   within a larger area, the area of capture should use coordinates for
492	   the larger potential area.

494	   EncodingGroup: {<encodeGroupID value>}

496	   A field with a value equal to the encodeGroupID of the encoding group
497	   associated with the media capture.

499	6.2.  Capture Scene

501	   In order for a provider's individual media captures to be used
502	   effectively by a consumer, the provider organizes the media captures
503	   into capture scenes, with the structure and contents of these capture
504	   scenes being sent from the provider to the consumer.

506	   A capture scene is a structure representing the scene that is
507	   captured by a collection of capture devices.  A capture scene
508	   includes one or more capture scene entries, with each entry including
509	   one or more media captures.  A capture scene represents, for example,
510	   the video image of a group of people seated next to each other, along
511	   with the sound of their voices, which could be represented by some
512	   number of VCs and ACs in the capture scene entries.  A middle box may
513	   also express capture scenes that it constructs from media streams it
514	   receives.

516	   A provider may advertise multiple capture scenes or just a single
517	   capture scene.  A media provider might typically use one capture
518	   scene for main participant media and another capture scene for a
519	   computer generated presentation.  A capture scene may include more
520	   than one type of media.  For example, a capture scene can include
521	   several capture scene entries for video captures, and several capture
522	   scene entries for audio captures.

524	   A provider can express spatial relationships between media captures
525	   that are included in the same capture scene.  But there is no spatial
526	   relationship between media captures that are in different capture
527	   scenes.

529	   A media provider arranges media captures in a capture scene to help
530	   the media consumer choose which captures it wants.  The capture scene
531	   entries in a capture scene are different alternatives the provider is
532	   suggesting for representing the capture scene.  The media consumer
533	   can choose to receive all media captures from one capture scene entry
534	   for each media type (e.g. audio and video), or it can pick and choose
535	   media captures regardless of how the provider arranges them in
536	   capture scene entries.

538	   Media captures within the same capture scene entry must be of the
539	   same media type - it is not possible to mix audio and video captures
540	   in the same capture scene entry, for instance.  The provider must be
541	   capable of encoding and sending all media captures in a single entry
542	   simultaneously.  A consumer may decide to receive all the media
543	   captures in a single capture scene entry, but a consumer could also
544	   decide to receive just a subset of those captures.  A consumer can
545	   also decide to receive media captures from different capture scene
546	   entries.

548	   When a provider advertises a capture scene with multiple entries, it
549	   is essentially signaling that there are multiple representations of
550	   the same scene available.  In some cases, these multiple
551	   representations would typically be used simultaneously (for instance
552	   a "video entry" and an "audio entry").  In some cases the entries
553	   would conceptually be alternatives (for instance an entry consisting
554	   of 3 video captures versus an entry consisting of just a single video
555	   capture).  In this latter example, the provider would in the simple
556	   case end up providing to the consumer the entry containing the number
557	   of video captures that most closely matched the media consumer's
558	   number of display devices.

560	   The following is an example of 4 potential capture scene entries for
561	   an endpoint-style media provider:

563	   1.  (VC0, VC1, VC2) - left, center and right camera video captures

565	   2.  (VC3) - video capture associated with loudest room segment

567	   3.  (VC4) - video capture zoomed out view of all people in the room

569	   4.  (AC0) - main audio

571	   The first entry in this capture scene example is a list of video
572	   captures with a spatial relationship to each other.  Determination of
573	   the order of these captures (VC0, VC1 and VC2) for rendering purposes
574	   is accomplished through use of their Area of Capture attributes.  The
575	   second entry (VC3) and the third entry (VC4) are additional
576	   alternatives of how to capture the same room in different ways.  The
577	   inclusion of the audio capture in the same capture scene indicates
578	   that AC0 is associated with those video captures, meaning it comes
579	   from the same scene.  The audio should be rendered in conjunction
580	   with any rendered video captures from the same capture scene.

582	6.2.1.  Capture scene attributes

584	   Attributes can be applied to capture scenes as well as to individual
585	   media captures.  Attributes specified at this level apply to all
586	   constituent media captures.

588	   Description attribute - list of {<description text>, <language tag>}

590	   The optional description attribute is a list of human readable text
591	   strings which describe the capture scene.  If there is more than one
592	   string in the list, then each string in the list should contain the
593	   same description, but in a different language.  A provider that
594	   advertises multiple capture scenes can provide descriptions for each
595	   of them.  This attribute can contain text in any number of languages.

597	   The language tag identifies the language of the corresponding
598	   description text.  The possible values for a language tag are the
599	   values of the 'Subtag' column for the "Type: language" entries in the
600	   "Language Subtag Registry" at [IANA-Lan] originally defined in
601	   [RFC5646].  A particular language tag value MUST NOT be used more
602	   than once in the description attribute list.

604	   Area of Scene attribute

606	   The area of scene attribute for a capture scene has the same format
607	   as the area of capture attribute for a media capture.  The area of
608	   scene is for the entire scene, which is captured by the one or more
609	   media captures in the capture scene entries.  If the provider does
610	   not specify the area of scene, but does specify areas of capture,
611	   then the consumer may assume the area of scene is greater than or
612	   equal to the outer extents of the individual areas of capture.

614	   Scale attribute

616	   An optional attribute indicating if the numbers used for area of
617	   scene, area of capture and point of capture are in terms of
618	   millimeters, unknown scale factor, or not any scale, as described in
619	   Section 5.  If any media captures have an area of capture attribute
620	   or point of capture attribute, then this scale attribute must also be
621	   defined.  The possible values for this attribute are:

623	      "millimeters"
624	      "unknown"
625	      "no scale"

627	6.2.2.  Capture scene entry attributes

629	   Attributes can be applied to capture scene entries.  Attributes
630	   specified at this level apply to the capture scene entry as a whole.

632	   Scene-switch-policy: {site-switch, segment-switch}

634	   A media provider uses this scene-switch-policy attribute to indicate
635	   its support for different switching policies.  In the provider's
636	   advertisement, this attribute can have multiple values, which means
637	   the provider supports each of the indicated policies.  The consumer,
638	   when it requests media captures from this capture scene entry, should
639	   also include this attribute but with only the single value (from
640	   among the values indicated by the provider) indicating the consumer's
641	   choice for which policy it wants the provider to use.  If the
642	   provider does not support any of these policies, it should omit this
643	   attribute.

645	   The "site-switch" policy means all captures are switched at the same
646	   time to keep captures from the same endpoint site together.  Let's
647	   say the speaker is at site A and everyone else is at a "remote" site.
648	   When the room at site A shown, all the camera images from site A are
649	   forwarded to the remote sites.  Therefore at each receiving remote
650	   site, all the screens display camera images from site A. This can be
651	   used to preserve full size image display, and also provide full
652	   visual context of the displayed far end, site A. In site switching,
653	   there is a fixed relation between the cameras in each room and the
654	   displays in remote rooms.  The room or participants being shown is
655	   switched from time to time based on who is speaking or by manual
656	   control.

658	   The "segment-switch" policy means different captures can switch at
659	   different times, and can be coming from different endpoints.  Still
660	   using site A as where the speaker is, and "remote" to refer to all
661	   the other sites, in segment switching, rather than sending all the
662	   images from site A, only the image containing the speaker at site A
663	   is shown.  The camera images of the current speaker and previous
664	   speakers (if any) are forwarded to the other sites in the conference.
665	   Therefore the screens in each site are usually displaying images from
666	   different remote sites - the current speaker at site A and the
667	   previous ones.  This strategy can be used to preserve full size image
668	   display, and also capture the non-verbal communication between the
669	   speakers.  In segment switching, the display depends on the activity
670	   in the remote rooms - generally, but not necessarily based on audio /
671	   speech detection.

673	6.3.  Simultaneous Transmission Set Constraints

675	   The provider may have constraints or limitations on its ability to
676	   send media captures.  One type is caused by the physical limitations
677	   of capture mechanisms; these constraints are represented by a
678	   simultaneous transmission set.  The second type of limitation
679	   reflects the encoding resources available - bandwidth and
680	   macroblocks/second.  This type of constraint is captured by encoding
681	   groups, discussed below.

683	   An endpoint or MCU can send multiple captures simultaneously, however
684	   sometimes there are constraints that limit which captures can be sent
685	   simultaneously with other captures.  A device may not be able to be
686	   used in different ways at the same time.  Provider advertisements are
687	   made so that the consumer will choose one of several possible
688	   mutually exclusive usages of the device.  This type of constraint is
689	   expressed in a Simultaneous Transmission Set, which lists all the
690	   media captures that can be sent at the same time.  This is easier to
691	   show in an example.

693	   Consider the example of a room system where there are 3 cameras each
694	   of which can send a separate capture covering 2 persons each- VC0,
695	   VC1, VC2.  The middle camera can also zoom out and show all 6
696	   persons, VC3.  But the middle camera cannot be used in both modes at
697	   the same time - it has to either show the space where 2 participants
698	   sit or the whole 6 seats, but not both at the same time.

700	   Simultaneous transmission sets are expressed as sets of the MCs that
701	   could physically be transmitted at the same time, (though it may not
702	   make sense to do so).  In this example the two simultaneous sets are
703	   shown in Table 1.  The consumer must make sure that it chooses one
704	   and not more of the mutually exclusive sets.  A consumer may choose
705	   any subset of the media captures in a simultaneous set, it does not
706	   have to choose all the captures in a simultaneous set if it does not
707	   want to receive all of them.

709	                           +-------------------+
710	                           | Simultaneous Sets |
711	                           +-------------------+
712	                           | {VC0, VC1, VC2}   |
713	                           | {VC0, VC3, VC2}   |
714	                           +-------------------+

716	                Table 1: Two Simultaneous Transmission Sets

718	   A media provider includes the simultaneous sets in its provider
719	   advertisement.  These simultaneous set constraints apply across all
720	   the captures scenes in the advertisement.  The simultaneous
721	   transmission sets MUST allow all the media captures in a particular
722	   capture scene entry to be used simultaneously.

724	7.  Encodings

726	   We have considered how providers can describe the content of media to
727	   consumers.  We will now consider how the providers communicate
728	   information about their abilities to send streams.  We introduce two
729	   constructs - individual encodings and encoding groups.  Consumers
730	   will then map the media captures they want onto the encodings with
731	   encoding parameters they want.  This process is then described.

733	7.1.  Individual Encodings

735	   An individual encoding represents a way to encode a media capture to
736	   become an encoded media stream sent from the media provider to the
737	   media consumer.  An individual encoding has a set of parameters
738	   characterizing how the media is encoded.  Different media types have
739	   different parameters, and different encoding algorithms may have
740	   different parameters.  An individual encoding can be used for only
741	   one actual encoded media stream at a time.

743	   The parameters of an individual encoding represent the maximimum
744	   values for certain aspects of the encoding.  A particular
745	   instantiation into an encoded stream might use lower values than
746	   these maximums.

748	   The following tables show the variables for audio and video encoding.

750	   +--------------+----------------------------------------------------+
751	   | Name         | Description                                        |
752	   +--------------+----------------------------------------------------+
753	   | encodeID     | A unique identifier for the individual encoding    |
754	   | maxBandwidth | Maximum number of bits per second                  |
755	   | maxH264Mbps  | Maximum number of macroblocks per second: ((width  |
756	   |              | + 15) / 16) * ((height + 15) / 16) *               |
757	   |              | framesPerSecond                                    |
758	   | maxWidth     | Video resolution's maximum supported width,        |
759	   |              | expressed in pixels                                |
760	   | maxHeight    | Video resolution's maximum supported height,       |
761	   |              | expressed in pixels                                |
762	   | maxFrameRate | Maximum supported frame rate                       |
763	   +--------------+----------------------------------------------------+

765	               Table 2: Individual Video Encoding Parameters

767	           +--------------+-----------------------------------+
768	           | Name         | Description                       |
769	           +--------------+-----------------------------------+
770	           | maxBandwidth | Maximum number of bits per second |
771	           +--------------+-----------------------------------+

773	               Table 3: Individual Audio Encoding Parameters

775	7.2.  Encoding Group

777	   An encoding group includes a set of one or more individual encodings,
778	   plus some parameters that apply to the group as a whole.  By grouping
779	   multiple individual encodings together, an encoding group describes
780	   additional constraints on bandwidth and other parameters for the
781	   group.  Table 4 shows the parameters and individual encoding sets
782	   that are part of an encoding group.

784	   +-------------------+-----------------------------------------------+
785	   | Name              | Description                                   |
786	   +-------------------+-----------------------------------------------+
787	   | encodeGroupID     | A unique identifier for the encoding group    |
788	   | maxGroupBandwidth | Maximum number of bits per second relating to |
789	   |                   | all encodings combined                        |
790	   | maxGroupH264Mbps  | Maximum number of macroblocks per second      |
791	   |                   | relating to all video encodings combined      |
792	   | videoEncodings[]  | Set of potential encodings (list of           |
793	   |                   | encodeIDs)                                    |
794	   | audioEncodings[]  | Set of potential encodings (list of           |
795	   |                   | encodeIDs)                                    |
796	   +-------------------+-----------------------------------------------+
797	                          Table 4: Encoding Group

799	   When the individual encodings in a group are instantiated into actual
800	   encoded media streams, each stream has a bandwidth that must be less
801	   than or equal to the maxBandwidth for the particular individual
802	   encoding.  The maxGroupBandwidth parameter gives the additional
803	   restriction that the sum of all the individual instantiated
804	   bandwidths must be less than or equal to the maxGroupBandwidth value.

806	   Likewise, the sum of the macroblocks per second of each instantiated
807	   encoding in the group must not exceed the maxGroupH264Mbps value.

809	   The following diagram illustrates the structure of a media provider's
810	   Encoding Groups and their contents.

812	   ,-------------------------------------------------.
813	   |             Media Provider                      |
814	   |                                                 |
815	   |  ,--------------------------------------.       |
816	   |  | ,--------------------------------------.     |
817	   |  | | ,--------------------------------------.   |
818	   |  | | |          Encoding Group              |   |
819	   |  | | | ,-----------.                        |   |
820	   |  | | | |           | ,---------.            |   |
821	   |  | | | |           | |         | ,---------.|   |
822	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
823	   |  `.| | |           | |         | `---------'|   |
824	   |    `.| `-----------' `---------'            |   |
825	   |      `--------------------------------------'   |
826	   `-------------------------------------------------'

828	                    Figure 1: Encoding Group Structure

830	   A media provider advertises one or more encoding groups.  Each
831	   encoding group includes one or more individual encodings.  Each
832	   individual encoding can represent a different way of encoding media.
833	   For example one individual encoding may be 1080p60 video, another
834	   could be 720p30, with a third being CIF.

836	   While a typical 3 codec/display system might have one encoding group
837	   per "codec box", there are many possibilities for the number of
838	   encoding groups a provider may be able to offer and for the encoding
839	   values in each encoding group.

841	   There is no requirement for all encodings within an encoding group to
842	   be instantiated at once.

844	8.  Associating Media Captures with Encoding Groups

846	   Every media capture is associated with an encoding group, which is
847	   used to instantiate that media capture into one or more encoded
848	   streams.  Each media capture has an encoding group attribute.  The
849	   value of this attribute is the encodeGroupID for the encoding group
850	   with which it is associated.  More than one media capture may use the
851	   same encoding group.

853	   The maximum number of streams that can result from a particular
854	   encoding group constraint is equal to the number of individual
855	   encodings in the group.  The actual number of streams used at any
856	   time may be less than this maximum.  Any of the media captures that
857	   use a particular encoding group can be encoded according to any of
858	   the individual encodings in the group.  If there are multiple
859	   individual encodings in the group, then a single media capture can be
860	   encoded into multiple different streams at the same time, with each
861	   stream following the constraints of a different individual encoding.

863	   The Encoding Groups MUST allow all the media captures in a particular
864	   capture scene entry to be used simultaneously.

866	9.  Consumer's Choice of Streams to Receive from the Provider

868	   After receiving the provider's advertised media captures and
869	   associated constraints, the consumer must choose which media captures
870	   it wishes to receive, and which individual encodings from the
871	   provider it wants to use to encode the capture.  Each media capture
872	   has an encoding group ID attribute which specifies which individual
873	   encodings are available to be used for that media capture.

875	   For each media capture the consumer wants to receive, it configures
876	   one or more of the encodings in that capture's encoding group.  The
877	   consumer does this by telling the provider the resolution, frame
878	   rate, bandwidth, etc. when asking for streams for its chosen
879	   captures.  Upon receipt of this configuration command from the
880	   consumer, the provider generates streams for each such configured
881	   encoding and sends those streams to the consumer.

883	   The consumer must have received at least one capture advertisement
884	   from the provider to be able to configure the provider's generation
885	   of media streams.

887	   The consumer is able to change its configuration of the provider's
888	   encodings any number of times during the call, either in response to
889	   a new capture advertisement from the provider or autonomously.  The
890	   consumer need not send a new configure message to the provider when
891	   it receives a new capture advertisement from the provider unless the
892	   contents of the new capture advertisement cause the consumer's
893	   current configure message to become invalid.

895	   When choosing which streams to receive from the provider, and the
896	   encoding characteristics of those streams, the consumer needs to take
897	   several things into account its local preference, simultaneity
898	   restrictions, and encoding limits.

900	9.1.  Local preference

902	   A variety of local factors will influence the consumer's choice of
903	   streams to be received from the provider:

905	   o  if the consumer is an endpoint, it is likely that it would choose,
906	      where possible, to receive video and audio captures that match the
907	      number of display devices and audio system it has

909	   o  if the consumer is a middle box such as an MCU, it may choose to
910	      receive loudest speaker streams (in order to perform its own media
911	      composition) and avoid pre-composed video captures

913	   o  user choice (for instance, selection of a new layout) may result
914	      in a different set of media captures, or different encoding
915	      characteristics, being required by the consumer

917	9.2.  Physical simultaneity restrictions

919	   There may be physical simultaneity constraints imposed by the
920	   provider that affect the provider's ability to simultaneously send
921	   all of the captures the consumer would wish to receive.  For
922	   instance, a middle box such as an MCU, when connected to a multi-
923	   camera room system, might prefer to receive both individual camera
924	   streams of the people present in the room and an overall view of the
925	   room from a single camera.  Some endpoint systems might be able to
926	   provide both of these sets of streams simultaneously, whereas others
927	   may not (if the overall room view were produced by changing the zoom
928	   level on the center camera, for instance).

930	9.3.  Encoding and encoding group limits

932	   Each of the provider's encoding groups has limits on bandwidth and
933	   macroblocks per second, and the constituent potential encodings have
934	   limits on the bandwidth, macroblocks per second, video frame rate,
935	   and resolution that can be provided.  When choosing the media
936	   captures to be received from a provider, a consumer device must
937	   ensure that the encoding characteristics requested for each
938	   individual media capture fits within the capability of the encoding
939	   it is being configured to use, as well as ensuring that the combined
940	   encoding characteristics for media captures fit within the
941	   capabilities of their associated encoding groups.  In some cases,
942	   this could cause an otherwise "preferred" choice of streams to be
943	   passed over in favour of different streams - for instance, if a set
944	   of 3 media captures could only be provided at a low resolution then a
945	   3 screen device could switch to favoring a single, higher quality,
946	   stream.

948	9.4.  Message Flow

950	   The following diagram shows the basic flow of messages between a
951	   media provider and a media consumer.  The usage of the "capture
952	   advertisement" and "configure encodings" message is described above.
953	   The consumer also sends its own capability message to the provider
954	   which may contain information about its own capabilities or
955	   restrictions.

957	   Diagram for Message Flow

959	            Media Consumer                         Media Provider
960	            --------------                         ------------
961	                  |                                     |
962	                  |----- Consumer Capability ---------->|
963	                  |                                     |
964	                  |                                     |
965	                  |<---- Capture advertisement ---------|
966	                  |                                     |
967	                  |                                     |
968	                  |------ Configure encodings --------->|
969	                  |                                     |

971	   In order for a maximally-capable provider to be able to advertise a
972	   manageable number of video captures to a consumer, there is a
973	   potential use for the consumer, at the start of CLUE, to be able to
974	   inform the provider of its capabilities.  One example here would be
975	   the video capture attribute set - a consumer could tell the provider
976	   the complete set of video capture attributes it is able to understand
977	   and so the provider would be able to reduce the capture scene it
978	   advertises to be tailored to the consumer.

980	   TBD - the content of the consumer capability message needs to be
981	   better defined.  The authors believe there is a need for this
982	   message, but have not worked out the details yet.

984	10.  Extensibility

986	   One of the most important characteristics of the Framework is its
987	   extensibility.  Telepresence is a relatively new industry and while
988	   we can foresee certain directions, we also do not know everything
989	   about how it will develop.  The standard for interoperability and
990	   handling multiple streams must be future-proof.

992	   The framework itself is inherently extensible through expanding the
993	   data model types.  For example:

995	   o  Adding more types of media, such as telemetry, can done by
996	      defining additional types of captures in addition to audio and
997	      video.

999	   o  Adding new functionalities , such as 3-D, say, will require
1000	      additional attributes describing the captures.

1002	   o  Adding a new codecs, such as H.265, can be accomplished by
1003	      defining new encoding variables.

1005	   The infrastructure is designed to be extended rather than requiring
1006	   new infrastructure elements.  Extension comes through adding to
1007	   defined types.

1009	   Assuming the implementation is in something like XML, adding data
1010	   elements and attributes makes extensibility easy.

1012	11.  Examples - Using the Framework

1014	   This section shows some examples in more detail how to use the
1015	   framework to represent a typical case for telepresence rooms.  First
1016	   an endpoint is illustrated, then an MCU case is shown.

1018	11.1.  Three screen endpoint media provider

1020	   Consider an endpoint with the following description:

1022	   o  3 cameras, 3 displays, a 6 person table

1024	   o  Each video device can provide one capture for each 1/3 section of
1025	      the table

1027	   o  A single capture representing the active speaker can be provided

1029	   o  A single capture representing the active speaker with the other 2
1030	      captures shown picture in picture within the stream can be
1031	      provided

1033	   o  A capture showing a zoomed out view of all 6 seats in the room can
1034	      be provided

1036	   The audio and video captures for this endpoint can be described as
1037	   follows.

1039	   Video Captures:

1041	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1042	      content=main, switched=false

1044	   o  VC1- (the center camera stream), encoding group=EG1, content=main,
1045	      switched=false

1047	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1048	      content=main, switched=false

1050	   o  VC3- (the loudest panel stream), encoding group=EG1, content=main,
1051	      switched=true

1053	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1054	      content=main, composed=true, switched=true

1056	   o  VC5- (the zoomed out view of all people in the room), encoding
1057	      group=EG1, content=main, composed=false, switched=false

1059	   o  VC6- (presentation stream), encoding group=EG1, content=slides,
1060	      switched=false

1062	   The following diagram is a top view of the room with 3 cameras, 3
1063	   displays, and 6 seats.  Each camera is capturing 2 people.  The six
1064	   seats are not all in a straight line.

1066	      ,-. d
1067	     (   )`--.__        +---+
1068	      `-' /     `--.__  |   |
1069	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1070	   (   ).'        ___..-+-''`+-+
1071	    `-' |_...---''      |   |
1072	    ,-.c+-..__          +---+
1073	   (   )|     ``--..__  |   |
1074	    `-' |             ``+-..|_-+Camera 1 (VC1)
1075	    ,-. |            __..--'|+-+
1076	   (   )|     __..--'   |   |
1077	    `-'b|..--'          +---+
1078	    ,-. |``---..___     |   |
1079	   (   )\          ```--..._|_-+Camera 0 (VC0)
1080	    `-'  \             _..-''`-+
1081	     ,-. \      __.--'' |   |
1082	    (   ) |..-''        +---+
1083	     `-' a

1085	   The two points labeled b and c are intended to be at the midpoint
1086	   between the seating positions, and where the fields of view of the
1087	   cameras intersect.
1088	   The plane of interest for VC0 is a vertical plane that intersects
1089	   points 'a' and 'b'.
1090	   The plane of interest for VC1 intersects points 'b' and 'c'.
1091	   The plane of interest for VC2 intersects points 'c' and 'd'.
1092	   This example uses an area scale of millimeters.

1094	   Areas of capture:
1095	       bottom left    bottom right  top left         top right
1096	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1097	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1098	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1099	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1100	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1101	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1102	   VC6 none

1104	   Points of capture:
1105	   VC0 (-1678,0,800)
1106	   VC1 (0,0,800)
1107	   VC2 (1678,0,800)
1108	   VC3 none
1109	   VC4 none
1110	   VC5 (0,0,800)
1111	   VC6 none

1113	   In this example, the right edge of the VC0 area lines up with the
1114	   left edge of the VC1 area.  It doesn't have to be this way.  There
1115	   could be a gap or an overlap.  One additional thing to note for this
1116	   example is the distance from a to b is equal to the distance from b
1117	   to c and the distance from c to d.  All these distances are 1346 mm.
1118	   This is the planar width of each area of capture for VC0, VC1, and
1119	   VC2.

1121	   Note the text in parentheses (e.g. "the camera-left camera stream")
1122	   is not explicitly part of the model, it is just explanatory text for
1123	   this example, and is not included in the model with the media
1124	   captures and attributes.  Also, the "composed" boolean attribute
1125	   doesn't say anything about how a capture is composed, so the media
1126	   consumer can't tell based on this attribute that VC4 is composed of a
1127	   "loudest panel with PiPs".

1129	   Audio Captures:

1131	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1132	      format=mono

1134	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1135	      format=mono

1137	   o  AC2 (center) encoding group=EG3, content=main, channel format=mono

1139	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1140	      encoding group=EG3, content=main, channel format=mono

1142	   o  AC4 audio stream associated with the presentation video (mono)
1143	      encoding group=EG3, content=slides, channel format=mono

1145	   Areas of capture:
1146	       bottom left    bottom right  top left         top right
1147	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1148	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1149	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1150	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1151	   AC4 none

1153	   The physical simultaneity information is:

1155	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1157	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1159	   This constraint indicates it is not possible to use all the VCs at
1160	   the same time.  VC5 can not be used at the same time as VC1 or VC3 or
1161	   VC4.  Also, using every member in the set simultaneously may not make
1162	   sense - for example VC3(loudest) and VC4 (loudest with PIP).  (In
1163	   addition, there are encoding constraints that make choosing all of
1164	   the VCs in a set impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and
1165	   EG1 has only 3 ENCs.  This constraint shows up in the encoding
1166	   groups, not in the simultaneous transmission sets.)

1168	   In this example there are no restrictions on which audio captures can
1169	   be sent simultaneously.

1171	   Encoding Groups:

1173	   This example has three encoding groups associated with the video
1174	   captures.  Each group can have 3 encodings, but with each potential
1175	   encoding having a progressively lower specification.  In this
1176	   example, 1080p60 transmission is possible (as ENC0 has a maxMbps
1177	   value compatible with that) as long as it is the only active encoding
1178	   in the group(as maxMbps for the entire encoding group is also
1179	   489600).  Significantly, as up to 3 encodings are available per
1180	   group, it is possible to transmit some video captures simultaneously
1181	   that are not in the same entry in the capture scene.  For example VC1
1182	   and VC3 at the same time.

1184	   It is also possible to transmit multiple encodings of a single video
1185	   capture.  For example VC0 can be encoded using ENC0 and ENC1 at the
1186	   same time, as long as the encoding parameters satisfy the constraints
1187	   of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30.

1189	   encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1190	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1191	                      maxH264Mbps=489600, maxBandwidth=4000000
1192	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1193	                      maxH264Mbps=108000, maxBandwidth=4000000
1194	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1195	                      maxH264Mbps=61200, maxBandwidth=4000000

1197	   encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1198	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1199	                      maxH264Mbps=489600, maxBandwidth=4000000
1200	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1201	                      maxH264Mbps=108000, maxBandwidth=4000000
1202	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1203	                      maxH264Mbps=61200, maxBandwidth=4000000

1205	   encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000
1206	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1207	                      maxH264Mbps=489600, maxBandwidth=4000000
1208	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1209	                      maxH264Mbps=108000, maxBandwidth=4000000
1210	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1211	                      maxH264Mbps=61200, maxBandwidth=4000000

1213	                Figure 2: Example Encoding Groups for Video

1215	   For audio, there are five potential encodings available, so all five
1216	   audio captures can be encoded at the same time.

1218	   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
1219	       encodeID=ENC9, maxBandwidth=64000
1220	       encodeID=ENC10, maxBandwidth=64000
1221	       encodeID=ENC11, maxBandwidth=64000
1222	       encodeID=ENC12, maxBandwidth=64000
1223	       encodeID=ENC13, maxBandwidth=64000

1225	                Figure 3: Example Encoding Group for Audio

1227	   Capture Scenes:

1229	   The following table represents the capture scenes for this provider.
1230	   Recall that a capture scene is composed of alternative capture scene
1231	   entries covering the same scene.  Capture Scene #1 is for the main
1232	   people captures, and Capture Scene #2 is for presentation.

1234	      Each row in the table is a separate entry in the capture scene

1236	                           +------------------+
1237	                           | Capture Scene #1 |
1238	                           +------------------+
1239	                           | VC0, VC1, VC2    |
1240	                           | VC3              |
1241	                           | VC4              |
1242	                           | VC5              |
1243	                           | AC0, AC1, AC2    |
1244	                           | AC3              |
1245	                           +------------------+

1247	                           +------------------+
1248	                           | Capture Scene #2 |
1249	                           +------------------+
1250	                           | VC6              |
1251	                           | AC4              |
1252	                           +------------------+

1254	   Different capture scenes are unique to each other, non-overlapping.
1255	   A consumer can choose an entry from each capture scene.  In this case
1256	   the three captures VC0, VC1, and VC2 are one way of representing the
1257	   video from the endpoint.  These three captures should appear adjacent
1258	   next to each other.  Alternatively, another way of representing the
1259	   Capture Scene is with the capture VC3, which automatically shows the
1260	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1262	   As in the video case, the different entries of audio in Capture Scene
1263	   #1 represent the "same thing", in that one way to receive the audio
1264	   is with the 3 audio captures (AC0, AC1, AC2), and another way is with
1265	   the mixed AC3.  The Media Consumer can choose an audio capture entry
1266	   it is capable of receiving.

1268	   The spatial ordering is understood by the media capture attributes
1269	   area and point of capture.

1271	   A Media Consumer would likely want to choose a capture scene entry to
1272	   receive based in part on how many streams it can simultaneously
1273	   receive.  A consumer that can receive three people streams would
1274	   probably prefer to receive the first entry of Capture Scene #1 (VC0,
1275	   VC1, VC2) and not receive the other entries.  A consumer that can
1276	   receive only one people stream would probably choose one of the other
1277	   entries.

1279	   If the consumer can receive a presentation stream too, it would also
1280	   choose to receive the only entry from Capture Scene #2 (VC6).

1282	11.2.  Encoding Group Example

1284	   This is an example of an encoding group to illustrate how it can
1285	   express dependencies between encodings.

1287	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1288	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1289	                         maxH264Mbps=244800, maxBandwidth=4000000
1290	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1291	                         maxH264Mbps=244800, maxBandwidth=4000000
1292	       encodeID=AUDENC0, maxBandwidth=96000
1293	       encodeID=AUDENC1, maxBandwidth=96000
1294	       encodeID=AUDENC2, maxBandwidth=96000

1296	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
1297	   encodings (Mbps for 1080p = 244800), but it is capable of
1298	   transmitting a maxFrameRate of 60 frames per second (fps).  To
1299	   achieve the maximum resolution (1920 x 1088) the frame rate is
1300	   limited to 30 fps.  However 60 fps can be achieved at a lower
1301	   resolution if required by the consumer.  Although the encoding group
1302	   is capable of transmitting up to 6Mbit/s, no individual video
1303	   encoding can exceed 4Mbit/s.

1305	   This encoding group also allows up to 3 audio encodings, AUDENC<0-2>.
1306	   It is not required that audio and video encodings reside within the
1307	   same encoding group, but if so then the group's overall maxBandwidth
1308	   value is a limit on the sum of all audio and video encodings
1309	   configured by the consumer.  A system that does not wish or need to
1310	   combine bandwidth limitations in this way should instead use separate
1311	   encoding groups for audio and video in order for the bandwidth
1312	   limitations on audio and video to not interact.

1314	   Audio and video can be expressed in separate encoding groups, as in
1315	   this illustration.

1317	  encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000
1318	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1319	                         maxH264Mbps=244800, maxBandwidth=4000000
1320	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1321	                         maxH264Mbps=244800, maxBandwidth=4000000

1323	  encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
1324	       encodeID=AUDENC0, maxBandwidth=96000
1325	       encodeID=AUDENC1, maxBandwidth=96000
1326	       encodeID=AUDENC2, maxBandwidth=96000

1328	11.3.  The MCU Case

1330	   This section shows how an MCU might express its Capture Scenes,
1331	   intending to offer different choices for consumers that can handle
1332	   different numbers of streams.  A single audio capture stream is
1333	   provided for all single and multi-screen configurations that can be
1334	   associated (e.g. lip-synced) with any combination of video captures
1335	   at the consumer.

1337	   +--------------------+---------------------------------------------+
1338	   | Capture Scene #1   | note                                        |
1339	   +--------------------+---------------------------------------------+
1340	   | VC0                | video capture for single screen consumer    |
1341	   | VC1, VC2           | video capture for 2 screen consumer         |
1342	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1343	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1344	   | AC0                | audio capture representing all participants |
1345	   +--------------------+---------------------------------------------+

1347	   If / when a presentation stream becomes active within the conference,
1348	   the MCU might re-advertise the available media as:

1350	        +------------------+--------------------------------------+
1351	        | Capture Scene #2 | note                                 |
1352	        +------------------+--------------------------------------+
1353	        | VC10             | video capture for presentation       |
1354	        | AC1              | presentation audio to accompany VC10 |
1355	        +------------------+--------------------------------------+

1357	11.4.  Media Consumer Behavior

1359	   This section gives an example of how a media consumer might behave
1360	   when deciding how to request streams from the three screen endpoint
1361	   described in the previous section.

1363	   The receive side of a call needs to balance its requirements, based
1364	   on number of screens and speakers, its decoding capabilities and
1365	   available bandwidth, and the provider's capabilities in order to
1366	   optimally configure the provider's streams.  Typically it would want
1367	   to receive and decode media from each capture scene advertised by the
1368	   provider.

1370	   A sane, basic, algorithm might be for the consumer to go through each
1371	   capture scene in turn and find the collection of video captures that
1372	   best matches the number of screens it has (this might include
1373	   consideration of screens dedicated to presentation video display
1374	   rather than "people" video) and then decide between alternative
1375	   entries in the video capture scenes based either on hard-coded
1376	   preferences or user choice.  Once this choice has been made, the
1377	   consumer would then decide how to configure the provider's encoding
1378	   groups in order to make best use of the available network bandwidth
1379	   and its own decoding capabilities.

1381	11.4.1.  One screen consumer

1383	   VC3, VC4 and VC5 are all different entries by themselves, not grouped
1384	   together in a single entry, so the receiving device should choose
1385	   between one of those.  The choice would come down to whether to see
1386	   the greatest number of participants simultaneously at roughly equal
1387	   precedence (VC5), a switched view of just the loudest region (VC3) or
1388	   a switched view with PiPs (VC4).  An endpoint device with a small
1389	   amount of knowledge of these differences could offer a dynamic choice
1390	   of these options, in-call, to the user.

1392	11.4.2.  Two screen consumer configuring the example

1394	   Mixing systems with an even number of screens, "2n", and those with
1395	   "2n+1" cameras (and vice versa) is always likely to be the
1396	   problematic case.  In this instance, the behavior is likely to be
1397	   determined by whether a "2 screen" system is really a "2 decoder"
1398	   system, i.e., whether only one received stream can be displayed per
1399	   screen or whether more than 2 streams can be received and spread
1400	   across the available screen area.  To enumerate 3 possible behaviors
1401	   here for the 2 screen system when it learns that the far end is
1402	   "ideally" expressed via 3 capture streams:

1404	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1405	       per the 1 screen consumer case above) and either leave one screen
1406	       blank or use it for presentation if / when a presentation becomes
1407	       active

1409	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1410	       (either with each capture being scaled to 2/3 of a screen and the
1411	       centre capture being split across 2 screens) or, as would be
1412	       necessary if there were large bezels on the screens, with each
1413	       stream being scaled to 1/2 the screen width and height and there
1414	       being a 4th "blank" panel.  This 4th panel could potentially be
1415	       used for any presentation that became active during the call.

1417	   3.  Receive 3 streams, decode all 3, and use control information
1418	       indicating which was the most active to switch between showing
1419	       the left and centre streams (one per screen) and the centre and
1420	       right streams.

1422	   For an endpoint capable of all 3 methods of working described above,
1423	   again it might be appropriate to offer the user the choice of display
1424	   mode.

1426	11.4.3.  Three screen consumer configuring the example

1428	   This is the most straightforward case - the consumer would look to
1429	   identify a set of streams to receive that best matched its available
1430	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1431	   spatial ordering would give sufficient information for the correct
1432	   video capture to be shown on the correct screen, and the consumer
1433	   would either need to divide a single encoding group's capability by 3
1434	   to determine what resolution and frame rate to configure the provider
1435	   with or to configure the individual video captures' encoding groups
1436	   with what makes most sense (taking into account the receive side
1437	   decode capabilities, overall call bandwidth, the resolution of the
1438	   screens plus any user preferences such as motion vs sharpness).

1440	12.  Acknowledgements

1442	   Mark Gorzyinski contributed much to the approach.  We want to thank
1443	   Stephen Botzko for helpful discussions on audio.

1445	13.  IANA Considerations

1447	   TBD

1449	14.  Security Considerations

1451	   TBD

1453	15.  Changes Since Last Version

1455	   NOTE TO THE RFC-Editor: Please remove this section prior to
1456	   publication as an RFC.

1458	   Changes from 05 to 06:

1460	   1.  Capture scene description attribute is a list of text strings,
1461	       each in a different language, rather than just a single string.

1463	   2.  Add new Axis of Capture Point attribute.

1465	   3.  Remove appendices A.1 through A.6.

1467	   4.  Clarify that the provider must use the same coordinate system
1468	       with same scale and origin for all coordinates within the same
1469	       capture scene.

1471	   Changes from 04 to 05:

1473	   1.  Clarify limitations of "composed" attribute.

1475	   2.  Add new section "capture scene entry attributes" and add the
1476	       attribute "scene-switch-policy".

1478	   3.  Add capture scene description attribute and description language
1479	       attribute.

1481	   4.  Editorial changes to examples section for consistency with the
1482	       rest of the document.

1484	   Changes from 03 to 04:

1486	   1.   Remove sentence from overview - "This constitutes a significant
1487	        change ..."

1489	   2.   Clarify a consumer can choose a subset of captures from a
1490	        capture scene entry or a simultaneous set (in section "capture
1491	        scene" and "consumer's choice...").

1493	   3.   Reword first paragraph of Media Capture Attributes section.

1495	   4.   Clarify a stereo audio capture is different from two mono audio
1496	        captures (description of audio channel format attribute).

1498	   5.   Clarify what it means when coordinate information is not
1499	        specified for area of capture, point of capture, area of scene.

1501	   6.   Change the term "producer" to "provider" to be consistent (it
1502	        was just in two places).

1504	   7.   Change name of "purpose" attribute to "content" and refer to
1505	        RFC4796 for values.

1507	   8.   Clarify simultaneous sets are part of a provider advertisement,
1508	        and apply across all capture scenes in the advertisement.

1510	   9.   Remove sentence about lip-sync between all media captures in a
1511	        capture scene.

1513	   10.  Combine the concepts of "capture scene" and "capture set" into a
1514	        single concept, using the term "capture scene" to replace the
1515	        previous term "capture set", and eliminating the original
1516	        separate capture scene concept.

1518	16.  Informative References

1520	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1521	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1523	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1524	              A., Peterson, J., Sparks, R., Handley, M., and E.
1525	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1526	              June 2002.

1528	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1529	              Jacobson, "RTP: A Transport Protocol for Real-Time
1530	              Applications", STD 64, RFC 3550, July 2003.

1532	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1533	              Session Initiation Protocol (SIP)", RFC 4353,
1534	              February 2006.

1536	   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
1537	              Protocol (SDP) Content Attribute", RFC 4796,
1538	              February 2007.

1540	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1541	              January 2008.

1543	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
1544	              Languages", BCP 47, RFC 5646, September 2009.

1546	   [IANA-Lan]
1547	              IANA, "Language Subtag Registry",
1548	              <http://www.iana.org/assignments/
1549	              language-subtag-registry>.

1551	Authors' Addresses

1553	   Allyn Romanow
1554	   Cisco Systems
1555	   San Jose, CA  95134
1556	   USA

1558	   Email: allyn@cisco.com
1559	   Mark Duckworth (editor)
1560	   Polycom
1561	   Andover, MA  01810
1562	   USA

1564	   Email: mark.duckworth@polycom.com

1566	   Andrew Pepperell
1567	   Langley, England
1568	   UK

1570	   Email: apeppere@gmail.com

1572	   Brian Baldino
1573	   Cisco Systems
1574	   San Jose, CA  95134
1575	   USA

1577	   Email: bbaldino@cisco.com