idnits 2.17.1 

draft-romanow-clue-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 3, 2011) is 4583 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                              M. Duckworth
5	Expires: April 5, 2012                                           Polycom
6	                                                            A. Pepperell
7	                                                              B. Baldino
8	                                                           Cisco Systems
9	                                                         October 3, 2011

11	                Framework for Telepresence Multi-Streams
12	                  draft-romanow-clue-framework-01.txt

14	Abstract

16	   This memo offers a framework for a protocol that enables devices in a
17	   telepresence conference to interoperate by specifying the
18	   relationships between multiple RTP streams.

20	Status of this Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on April 5, 2012.

37	Copyright Notice

39	   Copyright (c) 2011 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
55	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
56	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  5
57	   4.  Framework Features . . . . . . . . . . . . . . . . . . . . . .  7
58	   5.  Stream Information . . . . . . . . . . . . . . . . . . . . . .  8
59	     5.1.  Media capture -- Audio and Video . . . . . . . . . . . . .  9
60	     5.2.  Attributes . . . . . . . . . . . . . . . . . . . . . . . .  9
61	       5.2.1.  Purpose  . . . . . . . . . . . . . . . . . . . . . . . 10
62	       5.2.2.  Audio mixed  . . . . . . . . . . . . . . . . . . . . . 10
63	       5.2.3.  Audio Channel Format . . . . . . . . . . . . . . . . . 10
64	       5.2.4.  Area of capture  . . . . . . . . . . . . . . . . . . . 11
65	       5.2.5.  Point of capture . . . . . . . . . . . . . . . . . . . 12
66	       5.2.6.  Area Scale Millimeters . . . . . . . . . . . . . . . . 12
67	       5.2.7.  Video composed . . . . . . . . . . . . . . . . . . . . 12
68	       5.2.8.  Auto-switched  . . . . . . . . . . . . . . . . . . . . 12
69	     5.3.  Capture Set  . . . . . . . . . . . . . . . . . . . . . . . 12
70	   6.  Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 14
71	     6.1.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 15
72	       6.1.1.  Provider Capabilities Announcement . . . . . . . . . . 15
73	       6.1.2.  Consumer Capability Message  . . . . . . . . . . . . . 16
74	       6.1.3.  Consumer Configure Request . . . . . . . . . . . . . . 16
75	     6.2.  Physical Simultaneity  . . . . . . . . . . . . . . . . . . 16
76	     6.3.  Encoding Groups  . . . . . . . . . . . . . . . . . . . . . 18
77	       6.3.1.  Encoding Group Structure . . . . . . . . . . . . . . . 19
78	       6.3.2.  Individual Encodes . . . . . . . . . . . . . . . . . . 19
79	       6.3.3.  More on Encoding Groups  . . . . . . . . . . . . . . . 20
80	       6.3.4.  Examples of Encoding Groups  . . . . . . . . . . . . . 21
81	   7.  Using the Framework  . . . . . . . . . . . . . . . . . . . . . 23
82	     7.1.  The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 27
83	     7.2.  Media Consumer Behavior  . . . . . . . . . . . . . . . . . 27
84	       7.2.1.  One screen consumer  . . . . . . . . . . . . . . . . . 28
85	       7.2.2.  Two screen consumer configuring the example  . . . . . 28
86	       7.2.3.  Three screen consumer configuring the example  . . . . 29
87	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29
88	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 29
89	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 29
90	   11. Informative References . . . . . . . . . . . . . . . . . . . . 29
91	   Appendix A.  Open Issues . . . . . . . . . . . . . . . . . . . . . 30
92	     A.1.  Video layout arrangements and centralized composition  . . 30
93	     A.2.  Source is selectable . . . . . . . . . . . . . . . . . . . 30
94	     A.3.  Media Source Selection . . . . . . . . . . . . . . . . . . 30
95	     A.4.  Endpoint requesting many streams from MCU  . . . . . . . . 31
96	     A.5.  VAD (voice activity detection) tagging of audio streams  . 31
97	     A.6.  Private Information  . . . . . . . . . . . . . . . . . . . 31
98	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 31

100	1.  Introduction

102	   Current telepresence systems, though based on open standards such as
103	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
104	   other.  A major factor limiting the interoperability of telepresence
105	   systems is the lack of a standardized way to describe and negotiate
106	   the use of the multiple streams of audio and video comprising the
107	   media flows.  This draft provides a framework for a protocol to
108	   enable interoperability by handling multiple streams in a
109	   standardized way.  It is intended to support the use cases described
110	   in draft-ietf-clue-telepresence-use-cases-00 and to meet the
111	   requirements in draft-romanow-clue-requirements-xx.

113	   The solution described here is strongly focused on what is being done
114	   today, rather than on a vision of future conferencing.  At the same
115	   time, the highest priority has been given to creating an extensible
116	   framework to make it easy to accommodate future conferencing
117	   functionality as it evolves.

119	   The purpose of this effort is to make it possible to handle multiple
120	   streams of media in such a way that a satisfactory user experience is
121	   possible even when participants are on different vendor equipment and
122	   when they are using devices with different types of communication
123	   capabilities.  Information about the relationship of media streams
124	   must be communicated so that audio/video rendering can be done in the
125	   best possible manner.  In addition, it is necessary to choose which
126	   media streams are sent.

128	   There is no attempt here to dictate to the renderer what it should
129	   do.  What the renderer does is up to the renderer.

131	   After the following Definitions, two short sections introduce key
132	   concepts.  The body of the text comprises three sections that deal
133	   with in turn stream content, choosing streams and an implementation
134	   example.  The media provider and media consumer behavior are
135	   described in separate sections as well.  Several appendices describe
136	   further details for using the framework.

138	2.  Terminology

140	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
141	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
142	   document are to be interpreted as described in RFC 2119 [RFC2119].

144	3.  Definitions

146	   The definitions marked with an "*" are new; all the others are from
147	   draft-wenger-clue-definitions-00-01.txt.

149	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

151	   Capture Device: A device that converts audio and video input into an
152	   electrical signal, in most cases to be fed into a media encoder.
153	   Cameras and microphones are examples for capture devices.

155	   Capture Scene: the scene that is captured by a collection of Capture
156	   Devices.  A Capture Scene may be represented by more than one type of
157	   Media.  A Capture Scene may include more than one Media Capture of
158	   the same type.  An example of a Capture Scene is the video image of a
159	   group of people seated next to each other, along with the sound of
160	   their voices, which could be represented by some number of VCs and
161	   ACs.  A middle box may also express Capture Scenes that it constructs
162	   from Media streams it receives.

164	   A Capture Set includes Media Captures that all represent some aspect
165	   of the same Capture Scene.  The items (rows) in a Capture Set
166	   represent different alternatives for representing the same Capture
167	   Scene.

169	   Conference: used as defined in [RFC4353], A Framework for
170	   Conferencing within the Session Initiation Protocol (SIP).

172	   Individual Encode: A variable with a set of attributes that describes
173	   the maximum values of a single audio or video capture encoding.  The
174	   attributes include: maximum bandwidth- and for video maximum
175	   macroblocks, maximum width, maximum height, maximum frame rate.
176	   [Edt. These are based on H.264.]

178	   *Encoding Group: Encoding group: A set of encoding parameters
179	   representing a device's complete encoding capabilities or a
180	   subdivision of them.  Media stream providers formed of multiple
181	   physical units, in each of which resides some encoding capability,
182	   would typically advertise themselves to the remote media stream
183	   consumer as being formed multiple encoding groups.  Within each
184	   encoding group, multiple potential actual encodings are possible,
185	   with the sum of those encodings' characteristics constrained to being
186	   less than or equal to the group-wide constraints.

188	   Endpoint: The logical point of final termination through receiving,
189	   decoding and rendering, and/or initiation through capturing,
190	   encoding, and sending of media streams.  An endpoint consists of one
191	   or more physical devices which source and sink media streams, and
192	   exactly one [RFC4353] Participant (which, in turn, includes exactly
193	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
194	   send and receive media streams, but it is not the initiator nor the
195	   final terminator in the sense that Media is Captured or Rendered.
196	   Endpoints can be anything from multiscreen/multicamera rooms to
197	   handheld devices.

199	   Endpoint Characteristics: include placement of Capture and Rendering
200	   Devices, capture/render angle, resolution of cameras and screens,
201	   spatial location and mixing parameters of microphones.  Endpoint
202	   characteristics are not specific to individual media streams sent by
203	   the endpoint.

205	   Left: For media captures, left and right is from the point of view of
206	   a person observing the rendered media.

208	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
209	   more endpoints together into one single multimedia conference
210	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
211	   tardy in requiring that media from the mixer be sent to EACH
212	   participant.  I think we have practical use cases where this is not
213	   the case.  But the bug (if it is one) is in 4353 and not herein.

215	   Media: Any data that, after suitable encoding, can be conveyed over
216	   RTP, including audio, video or timed text.

218	   *Media Capture: a source of Media, such as from one or more Capture
219	   Devices.  A Media Capture may be the source of one or more Media
220	   streams.  A Media Capture may also be constructed from other Media
221	   streams.  A middle box can express Media Captures that it constructs
222	   from Media streams it receives.

224	   *Media Consumer: an Endpoint or middle box that receives Media
225	   streams

227	   *Media Provider: an Endpoint or middle box that sends Media streams

229	   Model: a set of assumptions a telepresence system of a given vendor
230	   adheres to and expects the remote telepresence system(s) also to
231	   adhere to.

233	   Right: For media captures, left and right is from the point of view
234	   of a person observing the rendered media.

236	   Render: the process of generating a representation from a media, such
237	   as displayed motion video or sound emitted from loudspeakers.

239	   *Simultaneous Transmission Set: a set of media captures that can be
240	   transmitted simultaneously from a Media Provider.

242	   Spatial Relation: The arrangement in space of two objects, in
243	   contrast to relation in time or other relationships.  See also Left
244	   and Right.

246	   *Stream: RTP stream as in [RFC3550].

248	   Stream Characteristics: include media stream attributes commonly used
249	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
250	   resolution, profile/level etc.) as well as CLUE specific attributes
251	   (which could include for example and depending on the solution found:
252	   the I-D or spatial location of a capture device a stream originates
253	   from).

255	   Telepresence: an environment that gives non co-located users or user
256	   groups a feeling of (co-located) presence - the feeling that a Local
257	   user is in the same room with other Local users and the Remote
258	   parties.  The inclusion of Remote parties is achieved through
259	   multimedia communication including at least audio and video signals
260	   of high fidelity.

262	   *Video Capture: Media Capture for video.  Denoted as VCn.

264	   Video composite: A single image that is formed from combining visual
265	   elements from separate sources.

267	4.  Framework Features

269	   Two key functions must be accomplished so that multiple media streams
270	   can be handled in a telepresence conference.  These are:

272	   o  How to choose which streams the provider should send to the
273	      consumer

275	   o  What information needs to be added to the streams to allow a
276	      rendering of the capture scene

278	   The framework/model we present here can be understood as specifying
279	   these two functions.

281	   Media stream providers and consumers are central to the framework.
282	   The provider's job is to advertise its capabilities (as described
283	   here) to the consumer, whose job it is to configure the provider's
284	   encoding capabilities as described below.  Both providers and
285	   consumers can each send and receive information, that is, we do not
286	   have one party as the provider and one as the consumer exclusively,
287	   but all parties have both sending and receiving parts to them.  Most
288	   devices function as both a media provider and as a media consumer.

290	   For two devices to communicate bidirectionally, with media flowing in
291	   both directions, both devices act as both a media provider and a
292	   media consumer.  The protocol exchange shown later in the "Choosing
293	   Streams" section happens twice independently between the 2
294	   bidirectional devices.

296	   Both endpoints and MCUs, or more generally "middleboxes", can be
297	   media providers and consumers.

299	   Generally, the provider is capable of sending alternate captures of a
300	   capture scene.  These are described by the provider as capabilities
301	   and chosen by the consumer.

303	5.  Stream Information

305	   This section describes the structure for communicating information
306	   between providers and consumers.  Figure illustrates how information
307	   to be communicated is organized.  Each construct illustrated in the
308	   diagram is discussed in the sections below.

310	   Diagram for Stream Content

312	                                  +---------------+
313	                                 |               |
314	                                 |  Capture Set  |
315	                                 |               |
316	                                 +-------+-------+
317	                              _..-'      |    ``-._
318	                          _.-'           |         ``-._
319	                      _.-'               |              ``-._
320	             +----------------+  +----------------+  +----------------+
321	             | Media Capture  |  | Media Capture  |  | Media Capture  |
322	             | Audio or Video |  | Audio or Video |  | Audio or Video |
323	             +----------------+  +----------------+  +----------------+
324	                .'     `.   `-..__
325	              .'         `.       ``-..__
326	          ,-----.       ,---------.      ``,----------.
327	        ,' Encode`.   ,'           `.    ,'Simultaneous`.
328	       (   Group   ) (  Attributes   )  (  Transmission  )
329	        `.       ,'   `.           ,'    `.   Sets     ,'
330	          `-----'       `---------'        `----------'

332	5.1.  Media capture -- Audio and Video

334	   A media capture, as defined in definitions, is a fundamental concept
335	   of the model.  Media can be captured in different ways, for example
336	   by various arrangements of cameras and microphones.  The model uses
337	   the terms "video capture" (VC) and "audio capture" (AC) to refer to
338	   sources of media streams.  To distinguish between multiple instances,
339	   they are numbered for example VC1, VC2, and VC3 could refer to three
340	   different video captures which can be used simultaneously.

342	   Media captures are dynamic.  They can come and go in a conference -
343	   and their parameters can change.  A provider can advertise a new list
344	   of captures at any time.  Both the media provider and media consumer
345	   can send "their messages" (i.e., capture set advertisements, stream
346	   configurations) any number of times during a call, and the other end
347	   is always required to act on any new information received (e.g.,
348	   stopping streams it had previously configured that are no longer
349	   valid).

351	   A media capture can be a media source such as video from a specific
352	   camera, or it can be more conceptual such as a composite image from
353	   several cameras, or an automatic dynamically switched capture
354	   choosing from several cameras depending on who is talking or other
355	   factors.

357	   A media capture is described by Attributes and associated with an
358	   Encode Group, and Physical Simultaneity Set.

360	   Audio and video captures are aggregated into Capture Sets as
361	   described below.

363	5.2.  Attributes

365	   Audio and video capture attributes describe information about streams
366	   and their relationships.  [Edt: We do not mean to duplicate SDP, if
367	   an SDP description can be used, great.]  The attributes of media
368	   captures refer to static aspects of those captures that can be used
369	   by the consumer for selecting the captures offered by the provider.

371	   The mechanism of Attributes make the framework extensible.  Although
372	   we are defining some attributes now based on the most common use
373	   cases, new attributes can be added for new use cases as they arise.
374	   In general, the way to extend the solution to handle new features is
375	   by adding attributes and/or values.

377	   We describe attributes by variables and their values.  The current
378	   attributes are listed below and then described.  The variable is
379	   shown in parentheses, and the values follow after the colon:

381	   o  (Purpose): main, presentation

383	   o  (Audio mixed): true, false

385	   o  (Audio Channel Format): mono, stereo, tbd

387	   o  (Area of Capture): A set of 'Ranges' describing the relevant area
388	      being capture by a capture device

390	   o  (Point of Capture): A 'Point' describing the location of the
391	      capture device or pseudo-device

393	   o  (Area scale): true, false indicating if area numbers are in
394	      millimeters

396	   o  (Video composed): true, false

398	   o  (Auto-switched): true, false

400	5.2.1.  Purpose

402	   A variable with enumerated values describing the purpose or role of
403	   the Media Capture.  It could be applied to any media type.  Possible
404	   values: main, presentation, others TBD.

406	   Main:

408	   The audio or video capture is of one or more people participating in
409	   a conference (or where they would be if they were there).  It is of
410	   part or all of the Capture Scene.

412	   Presentation:

414	   The stream provides a presentation, e. g., from a connected laptop or
415	   other input device.

417	5.2.2.  Audio mixed

419	   A Boolean variable to indicate whether the AC is a mix of other ACs
420	   or Streams.

422	5.2.3.  Audio Channel Format

424	   The "channel format" attribute of an Audio Capture indicates how the
425	   meaning of the channels is determined.  It is an enumerated variable
426	   describing the type of audio channel or channels in the Audio
427	   Capture.  The possible values of the "channel format" attribute are:

429	   o  mono

431	   o  stereo

433	   o  TBD - other possible future values (to potentially include other
434	      things like 3.0, 3.1, 5.1 surround sound and binaural)

436	   All ACs in the same row of a Capture Set MUST have the same value of
437	   the "channel format" attribute.

439	   There can be multiple ACs of a particular type, or even different
440	   types.  These multiple ACs could each have an area of capture
441	   attribute to indicate they represent different areas of the capture
442	   scene.

444	   If there are multiple audio streams, they might be correlated (that
445	   is, someone talking might be heard in multiple captures from the same
446	   room).  Echo cancellation and stream synchronization in consumers
447	   should take this into account.

449	   Mono:

451	   An AC with channel format="mono" has one audio channel.

453	   Stereo:

455	   An AC with channel format = "stereo" has exactly two audio channels,
456	   left and right, as part of the same AC.  [Edt: should we mention RFC
457	   3551 here?  The channel format may be related to how Audio Captures
458	   are mapped to RTP streams.  This stereo is not the same as the effect
459	   produced from two mono ACs one from the left and one from the right.
460	   ]

462	5.2.4.  Area of capture

464	   The area_of_capture attribute is used to describe the relevant area
465	   of which a media capture is "capturing".  By comparing the area of
466	   capture for different media captures, a consumer can determine the
467	   spatial relationships of the captures on the provider so that they
468	   can be rendered correctly.  The attribute consists of a set of
469	   'Ranges', one range for each spatial dimension, where each range has
470	   a Begin and End coordinate.  It is not necessary to fill out all of
471	   the dimensions if they are not relevant (i.e. if an endpoint's
472	   captures only span a single dimension, only the 'x' coordinate can be
473	   used).  There is no need to pre-define a possible range for this
474	   coordinate system; a device may choose what is most appropriate for
475	   describing its captures.  However, it is specified that as numbers
476	   move from lower to higher, the location is going from: left to right
477	   (in the case of the 'x' dimension), front to back (in the case of the
478	   'y' dimension or low to high (in the case of the 'z' dimension).

480	5.2.5.  Point of capture

482	   The point_of_capture attribute can be used to describe the location
483	   of a capture device or pseudo-device.  If there are multiple captures
484	   which share the same 'area_of_capture' value, then it is useful to
485	   know the location from which they are capturing that area (e.g. a
486	   device which has multiview).  Point of capture is expressed as a
487	   single {x, y, z} coordinate where, as with area_of_capture, only the
488	   necessary dimensions need be expressed.

490	5.2.6.  Area Scale Millimeters

492	   An optional Boolean variable indicating if the numbers used for area
493	   of capture and point of capture are in terms of millimeters.  If this
494	   attribute is true, then the x,y,z numbers represent millimeters.  If
495	   this attribute is false, then there is no physical scale.  The
496	   default value is false.

498	5.2.7.  Video composed

500	   An optional Boolean variable indicating if the VC is constructed by
501	   composing multiple other video captures together.  (This could
502	   indicate for example a continuous presence view of multiple images in
503	   a grid, or a large image with smaller picture-in-picture images in
504	   it.)

506	   Note: this attribute is not intended to differentiate between
507	   different ways of composing images.  For possible extension of the
508	   framework, additional attributes could be defined to distinguish
509	   between different ways of composing images, with different video
510	   layout arrangements of composing multiple images into one.

512	5.2.8.  Auto-switched

514	   A Boolean variable that may be used or audio and/or video streams.
515	   In this case the offered AC or VC varies depending on some rule; it
516	   is auto-switched between possible VCs, or between possible ACs.  The
517	   most common example of this is sending the video capture associated
518	   with the "loudest" speaker according to an audio detection algorithm.

520	5.3.  Capture Set

522	   A capture set describes the alternative media streams that the
523	   provider offers to send to the consumer.  As shown in the content
524	   diagram above, the capture set is an aggregation of all audio and
525	   video captures for a particular scene that a provider is willing to
526	   send.

528	   A provider describes its ability to send alternative media streams in
529	   the capture set, which lists the media captures in rows, as shown
530	   below.  Each row of the capture set consists of either a single
531	   capture or a group of captures.  A group means the individual
532	   captures in the group are spatially related with the specific
533	   ordering of the captures described through the use of attributes.

535	   Here is an example of a simple capture set with three video captures
536	   and three audio channels:

538	      (VC0, VC1, VC2)

540	      (AC0, AC1, AC2)

542	   The three VCs together in a row indicate those captures are spatially
543	   related to each other.  Similarly for the 3 ACs in the second row.
544	   The ACs and VCs in the same capture set are spatially related to each
545	   other.

547	   Multiple Media Captures of the same media type are often spatially
548	   related to each other.  Typically multiple Video Captures should be
549	   rendered next to each other in a particular order, or multiple audio
550	   channels should be rendered to match different speakers in a
551	   particular way.  Also, media of different types are often associated
552	   with each other, for example a group of Video Captures can be
553	   associated with a group of Audio Captures meaning they should be
554	   rendered together.

556	   Media Captures of the same media type are associated with each other
557	   by grouping them together in a single row of a Capture Set. Media
558	   Captures of different media types are associated with each other by
559	   putting them in different rows of the same Capture Set.

561	   Since all captures have an area_of_capture associated with them, a
562	   consumer can determine the spatial relationships of captures by
563	   comparing the locations of their areas of capture with one another.

565	   Association between audio and video can be made by finding audio and
566	   video captures which share overlapping areas of capture.

568	   The items (rows) in a capture set represent different alternatives
569	   for representing the same Capture Scene.  For example the following
570	   are alternative ways of capturing the same Capture Scene - two
571	   cameras each viewing half of a room, or one camera viewing the whole
572	   room, or one stream that automatically captures the person in the
573	   room who is currently speaking.  Each row of the Capture Set contains
574	   either a single media capture or one group of media captures.

576	   The following example shows a capture set for an endpoint media
577	   provider where:

579	   o  (VC0, VC1, VC2) - left camera capture, center camera capture,
580	      right camera capture

582	   o  (VC3) - capture associated with loudest

584	   o  (VC4) - zoomed out view of all people in the room

586	   o  (AC0) - main audio

588	   The first item in this capture set example is a group of video
589	   captures with a spatial relationship to each other.  These are VC0,
590	   VC1, and VC2.  VC3 and VC4 are additional alternatives of how to
591	   capture the same room in different ways.  The audio capture is
592	   included in the same capture set to indicate AC0 is associated with
593	   those video captures, meaning the audio should be rendered along with
594	   the video in the same set.

596	   The idea is to have sets of captures that represent the same
597	   information ("information" in this context might be a set of people
598	   and their associated audio / video streams, or might be a
599	   presentation supplied by a laptop, perhaps with an accompanying audio
600	   commentary).  Spatial ordering of media captures is described through
601	   the use of attributes.

603	   A media consumer could choose one row of each media type (e.g., audio
604	   and video) from a capture set.  For example a three stream consumer
605	   could choose the first video row plus the audio row, while a single
606	   stream consumer could choose the second or third video row plus the
607	   audio row.  An MCU consumer might choose to receive multiple rows.

609	   The groupsSimultaneous Transmission Set and Encoding Groups as
610	   discussed in the next section apply to media captures listed in
611	   capture sets.  The groupsSimultaneous Transmission Sets and Encoding
612	   Groups MUST allow all the Media Captures in a particular row of the
613	   capture set to be used simultaneously.  But media captures in
614	   different rows of the capture set might not be able to be used
615	   simultaneously.

617	6.  Choosing Streams

619	   This section describes the process of choosing which streams the
620	   provider sends to the consumer.  In order for appropriate streams to
621	   be sent from providers to consumers, certain characteristics of the
622	   multiple streams must be understood by both providers and consumers.
623	   Two separate aspects of streams suffice to describe the necessary
624	   information to be shared by providers and consumers.  The first
625	   aspect we call "physical simultaneity" and the other aspect we refer
626	   to as "encoding group".  These are described in the following
627	   sections, after the message flow is discussed.

629	6.1.  Message Flow

631	   The following diagram shows the flow of messages between a media
632	   provider and a media consumer.  The provider sends information about
633	   its capabilities (as specified in this section), then the consumer
634	   chooses which streams it wants, which we refer to as "configure".
635	   The consumer sends its own capability message to the provider which
636	   may contain information about its own capabilities or restrictions,
637	   in which case the provider might tailor its announcements to the
638	   consumer.

640	   Diagram for Message Flow

642	    Media Consumer                         Media Provider
643	    --------------                         ------------
644	          |                                     |
645	          |----- Consumer Capability ---------->|
646	          |                                     |
647	          |                                     |
648	          |<---- Capabilities (announce) -------|
649	          |                                     |
650	          |                                     |
651	          |------ Configure (request) --------->|
652	          |                                     |

654	6.1.1.  Provider Capabilities Announcement

656	   The provider capabilities announce message includes:

658	   o  the list of captures and their attributes

660	   o  the list of capture sets

662	   o  the list of Simultaneous Transmission Sets

664	   o  the list of the encoding groups

666	6.1.2.  Consumer Capability Message

668	   In order for a maximally-capable provider to be able to advertise a
669	   manageable number of video captures to a consumer, there is a
670	   potential use for the consumer being able, at the start of CLUE to be
671	   able to inform the provider of its capabilities.  One example here
672	   would be the video capture attribute set - a consumer could tell the
673	   provider the complete set of video capture attributes it is able to
674	   understand and so the provider would be able to reduce the capture
675	   set it advertises to be tailored to the consumer.

677	   TBD - the content of this message needs to be better defined.  The
678	   authors believe there is a need for this message, but have not worked
679	   out the details yet.

681	6.1.3.  Consumer Configure Request

683	   After receiving a set of video capture information from a provider
684	   and making its choice of what media streams to receive based on the
685	   consumer's own capabilities and any provider-side simultaneity
686	   restrictions, the consumer needs to essentially configure the
687	   provider to transmit the chosen set.

689	   The expectation is that this message will enumerate each of the
690	   encoding groups and potential encoders within those groups that the
691	   consumer wishes to be active (this may well be a subset of the
692	   complete set available).  For each such encoder within an encoding
693	   group, the consumer would specify the video capture (i.e., VC<n> as
694	   described above) along with the specifics of the video encoding
695	   required, i.e. width, height, frame rate and bit rate.  At this
696	   stage, the consumer would also provide RTP demultiplexing information
697	   as required to distinguish each stream from the others being
698	   configured by the same mechanism.

700	6.2.  Physical Simultaneity

702	   An endpoint or MCU can send multiple captures simultaneously.
703	   However, there may be constraints that limit which captures can be
704	   sent simultaneously with other captures.

706	   Physical or device simultaneity refers to fact that a device may not
707	   be able to be used in different ways at the same time.  This shapes
708	   the way that offers are made from the provider.  The offers are made
709	   so that the consumer will choose one of several possible usages of
710	   the device.  This type of constraint is expressed in Simultaneous
711	   Transmission Sets.  This is easier to show in an example.

713	   Consider the example of a room system where there are 3 cameras each
714	   of which can send a separate capture covering 2 persons each- VC0,
715	   VC1, VC2.  The middle camera can also zoom out and show all 6
716	   persons, VC3.  But the middle camera cannot be used in both modes at
717	   the same time - it has to either show the space where 2 participants
718	   sit or the whole 6 seats.  We refer to this as a physical device
719	   simultaneity constraint.

721	   The following illustration shows 3 cameras with 4 video streams.  The
722	   middle camera can be used as main video zoomed in on 2 people or it
723	   could be used in zoomed out mode and capture the whole endpoint.  The
724	   idea here is that the middle camera cannot be used for both zoomed in
725	   and zoomed out captures simultaneously.  This is a constraint imposed
726	   by the physical limitations of the devices.

728	   Diagram for Simultaneity

730	   `-.   +--------+   VC2
731	      .-'_Camera 3|---------->
732	   .-'   +--------+
733	                       VC3
734	                     -------->
735	   `-.   +--------+ /
736	      .-'|Camera 2|<
737	   .-'   +--------+ \  VC1
738	                     -------->

740	   `-.   +--------+   VC0
741	      .-'|Camera 1|---------->
742	   .-'   +--------+

744	   VC0- video zoomed in on 2 people   VC2- video zoomed in on 2 people
745	   VC1- video zoomed in on 2 people   VC3- video zoomed out on 6 people

747	   Simultaneous transmission sets can be expressed as sets of the VCs
748	   that could physically be transmitted at the same time, though it may
749	   not make sense to do so.

751	   In this example the two simultaneous sets are:

753	   {VC0, VC1, VC2}

755	   {VC0, VC3, VC2}

757	   In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2.
758	   Only one set can be transmitted at a time.  These are physical
759	   capabilities describing what can physically be sent at the same time,
760	   not what might make sense to send.  For example, in the second set
761	   both VC0 and VC2 are redundant if VC3 is included.

763	   In describing its capabilities, the provider must take physical
764	   simultaneity into account and send a list of its Simultaneous
765	   Transmission Sets to the consumer, along with the Capture Sets and
766	   Encoding Groups.

768	6.3.  Encoding Groups

770	   The second aspect of multiple streams that must be understood by
771	   providers and consumers in order to create the best experience
772	   possible, i. e., for the "right" or "best" streams to be sent, is the
773	   encoding characteristics of the possible audio and video streams
774	   which can be sent.  Just as in the way that a constraint is imposed
775	   on the multiple streams due to the physical limitations, there are
776	   also constraints due to encoding limitations.  These are described by
777	   four variables that make up an Encoding Group, as shown in the
778	   following table:

780	   Table: Encoding Group

782	   +----------------+--------------------------------------------------+
783	   | Name           | Description                                      |
784	   +----------------+--------------------------------------------------+
785	   | maxBandwidth   | Maximum number of bits per second relating to    |
786	   |                | all encodes combined                             |
787	   | maxVideoMbps   | Maximum number of macroblocks per second         |
788	   |                | relating to a all video encodes combined ((width |
789	   |                | + 15) / 16) * ((height + 15) / 16) *             |
790	   |                | framesPerSecond                                  |
791	   | videoEncodes[] | Set of potential video encodes can be generated  |
792	   | audioEncodes[] | Set of potential encodes that can be generated   |
793	   +----------------+--------------------------------------------------+

795	   An encoding group is the basic concept for describing encoding
796	   capability.  As shown in the Table, it has an overall maxMbps and
797	   bandwidth limits, as well as being comprised of sets of individual
798	   encodes, which will be described in more detail below.

800	   Each media stream provider includes one or more encoding groups.
801	   There may be multiple encoding groups per endpoint.  For example,
802	   each video capture device might have an associated encoding group
803	   that describes the video streams that can result from that capture.

805	   A remote receiver (i. e., stream consumer)configures some or all of
806	   the specific encodings within one or more groups in order to provide
807	   it with media streams to decode.

809	6.3.1.  Encoding Group Structure

811	   This section shows more detail on the media stream provider's
812	   encoding group structure.  The encoding group includes several
813	   individual encodes, each has different encoding values.  For example
814	   one may be high definition video 1080p60, and another 720p30, with a
815	   third being CIF.  While a typical 3 codec/display system would have
816	   one encoding group per "box", there are many possibilities for the
817	   number of encoding groups a provider may be able offer and for what
818	   encoding values there are in each encoding group.

820	   Diagram for Encoding Group Structure

822	   ,-------------------------------------------------.
823	   |             Media Provider                      |
824	   |                                                 |
825	   |  ,--------------------------------------.       |
826	   |  | ,--------------------------------------.     |
827	   |  | | ,--------------------------------------.   |
828	   |  | | |          Encoding Group              |   |
829	   |  | | | ,-----------.                        |   |
830	   |  | | | |           | ,---------.            |   |
831	   |  | | | |           | |         | ,---------.|   |
832	   |  | | | |  Encode1  | | Encode2 | | Encode3 ||   |
833	   |  `.| | |           | |         | `---------'|   |
834	   |    `.| `-----------' `---------'            |   |
835	   |      `--------------------------------------'   |
836	   `-------------------------------------------------'

838	   As shown in the diagram, each encoding group has multiple potential
839	   individual encodes within it.  Not all encodes are equally capable,
840	   the stream consumer chooses the encodes it wants by configuring the
841	   provider to send it what it wants to receive.

843	   Some encoding endpoints are fixed, others are flexible, e. g., a
844	   single box with multiple DSPs where the resources are shared.

846	6.3.2.  Individual Encodes

848	   An encoding group is associated with a media capture through the
849	   individual encodes, that is, an audio or video capture is encoded in
850	   one or more individual encodes, as described by the videoEncodes[]
851	   and audioEncodes[]variables.

853	   The following table shows the variables for a Video Encode.  (There
854	   is a similar table for audio.)
855	   Table: Individual Video Encode

857	   +--------------+----------------------------------------------------+
858	   | Name         | Description                                        |
859	   +--------------+----------------------------------------------------+
860	   | maxBandwidth | Maximum number of bits per second relating to a    |
861	   |              | single video encoding                              |
862	   | maxMbps      | Maximum number of macroblocks per second relating  |
863	   |              | to a single video encoding: ((width + 15) / 16) *  |
864	   |              | ((height + 15) / 16) * framesPerSecond             |
865	   | maxWidth     | Video resolution's maximum supported width,        |
866	   |              | expressed in pixels                                |
867	   | maxHeight    | Video resolution's maximum supported height,       |
868	   |              | expressed in pixels                                |
869	   | maxFrameRate | Maximum supported frame rate                       |
870	   +--------------+----------------------------------------------------+

872	   A remote receiver configures (i. e., instantiates) some or all of the
873	   specific encodes such that:

875	   o  The configuration of each active ENC<n> does not exceed that
876	      individual encode's maxWidth, maxHeight, maxFrameRate.

878	   o  The total bandwidth of the configured ENC<n&gtt; does not exceed
879	      the maxBandwidth of the encoding group.

881	   o  The sum of the macroblocks per second of each configured encode
882	      does not exceed the maxMbps attribute of the encoding group.

884	   An equivalent set of attributes holds for audio encodes within an
885	   audio encoding group.

887	6.3.3.  More on Encoding Groups

889	   An encoding group EG<n> comprises one or more potential encodings
890	   ENC<n>.  For example,

892	   EG0:  maxMbps=489600, maxBandwidth=6000000
893	        VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
894	                    maxMbps=244800, maxBandwidth=4000000
895	        VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
896	                    maxMbps=244800, maxBandwidth=4000000
897	        AUDIO_ENC0: maxBandwidth=96000
898	        AUDIO_ENC1: maxBandwidth=96000
899	        AUDIO_ENC2: maxBandwidth=96000

901	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
902	   encodings (Mbps for 1080p = 244800), but it is capable of
903	   transmitting a maxFrameRate of 60 frames per second (fps).  To
904	   achieve the maximum resolution (1920 x 1088) the frame rate is
905	   limited to 30 fps.  However 60 fps can be achieved at a lower
906	   resolution if required by the consumer.  Although the encoding group
907	   is capable of transmitting up to 6Mbit/s, no individual video
908	   encoding can exceed 4Mbit/s.

910	   This encoding group also allows up to 3 audio encodings,
911	   AUDIO_ENC<0-2>.  It is not required that audio and video encodings
912	   reside within the same encoding group, but if so then the group's
913	   overall maxBandwidth value is a limit on the sum of all audio and
914	   video encodings configured by the consumer.  A system that does not
915	   wish or need to combine bandwidth limitations in this way should
916	   instead use separate encoding groups for audio and video in order for
917	   the bandwidth limitations on audio and video to not interact.

919	   Audio and video can be expressed in separate encode groups, as in
920	   this illustration.

922	   VIDEO_EG0:  maxMbps=489600, maxBandwidth=6000000
923	        VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
924	                    maxMbps=244800, maxBandwidth=4000000
925	        VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
926	                    maxMbps=244800, maxBandwidth=4000000
927	   AUDIO_EG0: maxBandwidth=500000
928	        AUDIO_ENC0: maxBandwidth=96000
929	        AUDIO_ENC1: maxBandwidth=96000
930	        AUDIO_ENC2: maxBandwidth=96000

932	6.3.4.  Examples of Encoding Groups

934	   This section illustrates further examples of encoding groups.  In the
935	   first example, the capability parameters are the same across ENCs.
936	   In the second example, they vary.

938	   An endpoint that has 3 similar video capture devices would advertise
939	   3 encoding groups that can each transmit up to 2 1080p30 encodings,
940	   as follows:

942	   EG0:  maxMbps = 489600, maxBandwidth=6000000
943	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
944	             maxMbps=244800, maxBandwidth=4000000
945	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
946	             maxMbps=244800, maxBandwidth=4000000
947	   EG1:  maxMbps = 489600, maxBandwidth=6000000
948	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
949	             maxMbps=244800, maxBandwidth=4000000
950	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
951	             maxMbps=244800, maxBandwidth=4000000
952	   EG2:  maxMbps = 489600, maxBandwidth=6000000
953	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
954	             maxMbps=244800, maxBandwidth=4000000
955	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
956	             maxMbps=244800, maxBandwidth=4000000

958	   A remote consumer configures some or all of the specific encodings
959	   such that:

961	   o  The configuration of each active ENC<n> parameter values does not
962	      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
963	      exceeded

965	   o  The total bandwidth of the configured ENC <n> encodings does not
966	      exceed the maxBandwidth of the encoding group

968	   o  The sum of the "macroblocks per second" values of each configured
969	      encoding does not exceed the maxMbps of the encoding group

971	   There is no requirement for all encodings within an encoding group to
972	   be activated when configured by the consumer.

974	   Depending on the provider's encoding methods, the consumer may be
975	   able to request fixed encode values or choose encode values in the
976	   range less than the maximum offered.  We will discuss consumer
977	   behavior in more detail in a section below.

979	6.3.4.1.  Sample video encoding group specification #2

981	   This example specification expresses a system whose encoding groups
982	   can each transmit up to 3 encodings, but with each potential encoding
983	   having a progressively lower specification.  In this example, 1080p60
984	   transmission is possible (as ENC0 has a maxMbps value compatible with
985	   that) as long as it is the only active encoding (as maxMbps for the
986	   entire encoding group is also 489600).  Significantly, as up to 3
987	   encodings are available per group, some sets of captures which
988	   weren't able to be transmitted simultaneously in example #1 above now
989	   become possible, for instance VC1, VC3 and VC6 together.  In common
990	   with example #1, all encoding groups have an identical specification.

992	   EG0:  maxMbps = 489600, maxBandwidth=6000000
993	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
994	             maxMbps=489600, maxBandwidth=4000000
995	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
996	             maxMbps=108000, maxBandwidth=4000000
997	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
998	             maxMbps=61200, maxBandwidth=4000000
999	   EG1:  maxMbps = 489600, maxBandwidth=6000000
1000	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1001	             maxMbps=489600, maxBandwidth=4000000
1002	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
1003	             maxMbps=108000, maxBandwidth=4000000
1004	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
1005	             maxMbps=61200, maxBandwidth=4000000
1006	   EG2:  maxMbps = 489600, maxBandwidth=6000000
1007	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1008	             maxMbps=489600, maxBandwidth=4000000
1009	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
1010	             maxMbps=108000, maxBandwidth=4000000
1011	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
1012	             maxMbps=61200, maxBandwidth=4000000

1014	7.  Using the Framework

1016	   This section shows in more detail how to use the framework to
1017	   represent a typical case for telepresence rooms.  First an endpoint
1018	   is illustrated, then an MCU case is shown.

1020	   Consider an endpoint with the following characteristics:

1022	   o  3 cameras, 3 displays, a 6 person table

1024	   o  Each video device can provide one capture for each 1/3 section of
1025	      the table

1027	   o  A single capture representing the active speaker can be provided

1029	   o  A single capture representing the active speaker with the other 2
1030	      captures shown picture in picture within the stream can be
1031	      provided

1033	   o  A capture showing a zoomed out view of all 6 seats in the room can
1034	      be provided

1036	   The audio and video captures for this endpoint can be described as
1037	   follows.  The Encode Group specifications can be found above in
1038	   Section 6.3.4.1, Sample video encoding group specification #2.

1040	   Video Captures:

1042	   o  VC0- (the left camera stream), encoding group:EG0, attributes:
1043	      purpose=main;auto-switched:no; area_of_capture={xBegin=0, xEnd=33}

1045	   o  VC1- (the center camera stream), encoding group:EG1, attributes:
1046	      purpose=main; auto-switched:no; area_of_capture={xBegin=33,
1047	      xEnd=66}

1049	   o  VC2- (the right camera stream), encoding group:EG2, attributes:
1050	      purpose=main;auto-switched:no; area_of_capture={xBegin=66,
1051	      xEnd=99}

1053	   o  VC3- (the loudest panel stream), encoding group:EG1, attributes:
1054	      purpose=main;auto-switched:yes; area_of_capture={xBegin=0,
1055	      xEnd=99}

1057	   o  VC4- (the loudest panel stream with PiPs), encoding group:EG1,
1058	      attributes: purpose=main; composed=true; auto-switched:yes;
1059	      area_of_capture={xBegin=0, xEnd=99}

1061	   o  VC5- (the zoomed out view of all people in the room), encoding
1062	      group:EG1, attributes: purpose=main;auto-switched:no;
1063	      area_of_capture={xBegin=0, xEnd=99}

1065	   o  VC6- (presentation stream), encoding group:EG1, attributes:
1066	      purpose=presentation;auto-switched:no; area_of_capture={xBegin=0,
1067	      xEnd=99}

1069	   Summary of video captures - 3 codecs, center one is used for center
1070	   camera stream, presentation stream, auto-switched, and zoomed views.

1072	   Note the text in parentheses (e.g. "the left camera stream") is not
1073	   explicitly part of the model, it is just explanatory text for this
1074	   example, and is not included in the model with the media captures and
1075	   attributes.

1077	   [edt.  It is arbitrary that for this example the alternative views
1078	   are on EG1 - they could have been spread out- it was not a necessary
1079	   choice.]

1081	   Audio Captures:

1083	   o  AC0 (left), attributes: purpose=main;channel format=mono;
1084	      area_of_capture={xBegin=0, xEnd=33}

1086	   o  AC1 (right), attributes: purpose=main;channel format=mono;
1087	      area_of_capture={xBegin=66, xEnd=99}

1089	   o  AC2 (center) attributes: purpose=main;channel format=mono;
1090	      area_of_capture={xBegin=33, xEnd=66}

1092	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1093	      attributes: purpose=main;channel format=mono; mixed=true;
1094	      area_of_capture={xBegin=0, xEnd=99}

1096	   o  AC4 audio stream associated with the presentation video (mono)
1097	      attributes: purpose=presentation;channel format=mono;
1098	      area_of_capture={xBegin=0, xEnd=99}

1100	   The physical simultaneity information is:

1102	      {VC0, VC1, VC2, VC3, VC4, VC6}

1104	      {VC0, VC2, VC5, VC6}

1106	   It is possible to select any or all of the rows in a capture set.
1107	   This is strictly what is possible from the devices.  However, using
1108	   every member in the set simultaneously may not make sense- for
1109	   example VC3(loudest) and VC4 (loudest with PIP).  (In addition, there
1110	   are encoding constraints that make choosing all of the VCs in a set
1111	   impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1 has only 3
1112	   ENCs.  This constraint shows up in the Capture list and encoding
1113	   groups, not in the simultaneous transmission sets.)

1115	   In this example there are no restrictions on which audio captures can
1116	   be sent simultaneously.

1118	   The following table represents the capture sets for this provider.
1119	   Recall that a capture set is composed of alternative captures
1120	   covering the same scene.  Capture Set #1 is for the main people
1121	   captures, and Capture Set #2 is for presentation.

1123	                            +----------------+
1124	                            | Capture Set #1 |
1125	                            +----------------+
1126	                            | VC0, VC1, VC2  |
1127	                            | VC3            |
1128	                            | VC4            |
1129	                            | VC5            |
1130	                            | AC0, AC1, AC2  |
1131	                            | AC3            |
1132	                            +----------------+

1134	                            +----------------+
1135	                            | Capture Set #2 |
1136	                            +----------------+
1137	                            | VC6            |
1138	                            | AC4            |
1139	                            +----------------+

1141	   Different capture sets are unique to each other, non-overlapping.  A
1142	   consumer chooses a capture row from each capture set.  In this case
1143	   the three captures VC0, VC1, and VC2 are one way of representing the
1144	   video from the endpoint.  These three captures should appear adjacent
1145	   next to each other.  Alternatively, another way of representing the
1146	   Capture Scene is with the capture VC3, which automatically shows the
1147	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1149	   As in the video case, the different rows of audio in Capture Set #1
1150	   represent the "same thing", in that one way to receive the audio is
1151	   with the 3 linear position audio captures (AC0, AC1, AC2), and
1152	   another way is with the single channel monaural format AC3.  The
1153	   Media Consumer would choose the one audio capture row it is capable
1154	   of receiving.

1156	   The spatial ordering is understood by the media capture attributes
1157	   area and point of capture.

1159	   The consumer finds a "row" in each capture set #x section of the
1160	   table that it wants.  It configures the streams according to the
1161	   encoding group for the row.

1163	   A Media Consumer would likely want to choose a row to receive based
1164	   in part on how many streams it can simultaneously receive.  A
1165	   consumer that can receive three people streams would probably prefer
1166	   to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not
1167	   receive the other rows.  A consumer that can receive only one people
1168	   stream would probably choose one of the other rows.

1170	   If the consumer can receive a presentation stream too, it would also
1171	   choose to receive the only row from Capture Set #2 (VC6).

1173	7.1.  The MCU Case

1175	   This section shows how an MCU might express its Capture Sets,
1176	   intending to offer different choices for consumers that can handle
1177	   different numbers of streams.  A single audio capture stream is
1178	   provided for all single and multi-screen configurations that can be
1179	   associated (e.g. lip-synced) with any combination of video captures
1180	   at the consumer.

1182	   +--------------------+---------------------------------------------+
1183	   | Capture Set #1     | note                                        |
1184	   +--------------------+---------------------------------------------+
1185	   | VC0                | video capture for single screen consumer    |
1186	   | VC1, VC2           | video capture for 2 screen consumer         |
1187	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1188	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1189	   | AC0                | audio capture representing all participants |
1190	   +--------------------+---------------------------------------------+

1192	   If / when a presentation stream becomes active within the conference,
1193	   the MCU might re-advertise the available media as:

1195	         +----------------+--------------------------------------+
1196	         | Capture Set #2 | note                                 |
1197	         +----------------+--------------------------------------+
1198	         | VC10           | video capture for presentation       |
1199	         | AC1            | presentation audio to accompany VC10 |
1200	         +----------------+--------------------------------------+

1202	7.2.  Media Consumer Behavior

1204	   [Edt. Should this be moved to appendix?]

1206	   The receive side of a call needs to balance its requirements, based
1207	   on number of screens and speakers, its decoding capabilities and
1208	   available bandwidth, and the provider's capabilities in order to
1209	   optimally configure the provider's streams.  Typically it would want
1210	   to receive and decode media from each capture set advertised by the
1211	   provider.

1213	   A sane, basic, algorithm might be for the consumer to go through each
1214	   capture set in turn and find the collection of video captures that
1215	   best matches the number of screens it has (this might include
1216	   consideration of screens dedicated to presentation video display
1217	   rather than "people" video) and then decide between alternative rows
1218	   in the video capture sets based either on hard-coded preferences or
1219	   user choice.  Once this choice has been made, the consumer would then
1220	   decide how to configure the provider's encode groups in order to make
1221	   best use of the available network bandwidth and its own decoding
1222	   capabilities.

1224	7.2.1.  One screen consumer

1226	   VC3, VC4 and VC5 are all on different rows by themselves, not in a
1227	   group, so the receiving device should choose between one of those.
1228	   The choice would come down to whether to see the greatest number of
1229	   participants simultaneously at roughly equal precedence (VC5), a
1230	   switched view of just the loudest region (VC3) or a switched view
1231	   with PiPs (VC4).  An endpoint device with a small amount of knowledge
1232	   of these differences could offer a dynamic choice of these options,
1233	   in-call, to the user.

1235	7.2.2.  Two screen consumer configuring the example

1237	   Mixing systems with an even number of screens, "2n", and those with
1238	   "2n+1" cameras (and vice versa) is always likely to be the
1239	   problematic case.  In this instance, the behavior is likely to be
1240	   determined by whether a "2 screen" system is really a "2 decoder"
1241	   system, i.e., whether only one received stream can be displayed per
1242	   screen or whether more than 2 streams can be received and spread
1243	   across the available screen area.  To enumerate 3 possible behaviors
1244	   here for the 2 screen system when it learns that the far end is
1245	   "ideally" expressed via 3 capture streams:

1247	   v

1249	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1250	       per the 1 screen consumer case above) and either leave one screen
1251	       blank or use it for presentation if / when a presentation becomes
1252	       active

1254	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1255	       (either with each capture being scaled to 2/3 of a screen and the
1256	       centre capture being split across 2 screens) or, as would be
1257	       necessary if there were large bezels on the screens, with each
1258	       stream being scaled to 1/2 the screen width and height and there
1259	       being a 4th "blank" panel.  This 4th panel could potentially be
1260	       used for any presentation that became active during the call.

1262	   3.  Receive 3 streams, decode all 3, and use control information
1263	       indicating which was the most active to switch between showing
1264	       the left and centre streams (one per screen) and the centre and
1265	       right streams.

1267	   For an endpoint capable of all 3 methods of working described above,
1268	   again it might be appropriate to offer the user the choice of display
1269	   mode.

1271	7.2.3.  Three screen consumer configuring the example

1273	   This is the most straightforward case - the consumer would look to
1274	   identify a set of streams to receive that best matched its available
1275	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1276	   spatial ordering would give sufficient information for the correct
1277	   video capture to be shown on the correct screen, and the consumer
1278	   would either need to divide a single encode group's capability by 3
1279	   to determine what resolution and frame rate to configure the provider
1280	   with or to configure the individual video captures' encode groups
1281	   with what makes most sense (taking into account the receive side
1282	   decode capabilities, overall call bandwidth, the resolution of the
1283	   screens plus any user preferences such as motion vs sharpness).

1285	8.  Acknowledgements

1287	   Mark Gorzyinski contributed much to the approach.  We want to thank
1288	   Stephen Botzko for helpful discussions on audio.

1290	9.  IANA Considerations

1292	   TBD

1294	10.  Security Considerations

1296	   TBD

1298	11.  Informative References

1300	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1301	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1303	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1304	              A., Peterson, J., Sparks, R., Handley, M., and E.
1305	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1306	              June 2002.

1308	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1309	              Jacobson, "RTP: A Transport Protocol for Real-Time
1310	              Applications", STD 64, RFC 3550, July 2003.

1312	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1313	              Session Initiation Protocol (SIP)", RFC 4353,
1314	              February 2006.

1316	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1317	              January 2008.

1319	Appendix A.  Open Issues

1321	A.1.  Video layout arrangements and centralized composition

1323	   In the context of a conference with a central MCU, there has been
1324	   discussion about a consumer requesting the provider to provide a
1325	   certain type of layout arrangement or perform a certain composition
1326	   algorithm, such as combining some number of most recent talkers, or
1327	   producing a video layout using a 2x2 grid or 1 large cell with 5
1328	   smaller cells around it.  The current framework does not address
1329	   this.  It isn't clear if this topic should be included in this
1330	   framework, or maybe a different part of CLUE, or maybe outside of
1331	   CLUE altogether.

1333	A.2.  Source is selectable

1335	   A Boolean variable.  True indicates the media consumer can request a
1336	   particular media source be mapped to a media capture.  Default is
1337	   false.

1339	   TBD - how does the consumer make the request for a particular source?
1340	   How does the consumer know what is available?  Need to explain better
1341	   how multiple media captures are different from a single media capture
1342	   with choices for the source, and when each concept should be used.

1344	A.3.  Media Source Selection

1346	   The use cases include a case where the person at a receiving endpoint
1347	   can request to receive media from a particular other endpoint, for
1348	   example in a multipoint call to request to receive the video from a
1349	   certain section of a certain room, whether or not people there are
1350	   talking.

1352	   TBD - this framework should address this case.  Maybe need a roster
1353	   list of rooms or people in the conference, with a mechanism to select
1354	   from the roster and associate it with media captures.  This is
1355	   different from selecting a particular media capture from a capture
1356	   set.  The mechanism to do this will probably need to be different
1357	   than selecting media captures based on capture sets and attributes.

1359	A.4.  Endpoint requesting many streams from MCU

1361	   TBD - how to do VC selection for a system where the endpoint media
1362	   consumers want to receive lots of streams and do their own
1363	   composition, rather than MCU doing transcoding and composing.
1364	   Example is 3 screen consumer that wants 3 large loudest speaker
1365	   streams, and a bunch of small ones to render as PiP.  How the small
1366	   ones are chosen, which could potentially be chosen by either the
1367	   endpoint or MCU.  There are other more complicated examples also.  Is
1368	   the current framework adequate to support this?

1370	A.5.  VAD (voice activity detection) tagging of audio streams

1372	   TBD - do we want to have VAD be mandatory?  All audio streams
1373	   originating from a media provider must be tagged with VAD
1374	   information.  This tagging would include an overall energy value for
1375	   the stream plus information on which sections of the capture scene
1376	   are "active".

1378	   Each audio stream which forms a constituent of a row within a capture
1379	   set should include this tagging, and the energy value within it
1380	   calculated using a fixed, consistent algorithm.

1382	   When a system determines the most active area of a capture scene
1383	   (either "loudest", or determined by other means such as a button
1384	   press) it should convey that information to the corresponding media
1385	   stream consumer via any audio streams being sent within that capture
1386	   set.  Specifically, there should be a list of active linear positions
1387	   and their VAD characteristics within the audio stream in addition to
1388	   the overall VAD information for the capture set.  This is to ensure
1389	   all media stream consumers receive the same, consistent, audio energy
1390	   information whichever audio capture or captures they choose to
1391	   receive for a capture set.  Additionally, linear position information
1392	   can be mapped to video captures by a media stream consumer in order
1393	   that it can perform "panel switching" if required.

1395	A.6.  Private Information

1397	Authors' Addresses

1399	   Allyn Romanow
1400	   Cisco Systems
1401	   San Jose, CA  95134
1402	   USA

1404	   Email: allyn@cisco.com
1405	   Mark Duckworth
1406	   Polycom
1407	   Andover, MA  01810
1408	   US

1410	   Email: mark.duckworth@polycom.com

1412	   Andrew Pepperell
1413	   Cisco Systems
1414	   Langley, England
1415	   UK

1417	   Email: apeppere@cisco.com

1419	   Brian Baldino
1420	   Cisco Systems
1421	   San Jose, CA  95134
1422	   US

1424	   Email: bbaldino@cisco.com