idnits 2.17.1 

draft-ietf-clue-framework-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 31, 2011) is 4559 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                         M. Duckworth, Ed.
5	Expires: May 3, 2012                                             Polycom
6	                                                            A. Pepperell
7	                                                              B. Baldino
8	                                                           Cisco Systems
9	                                                        October 31, 2011

11	                Framework for Telepresence Multi-Streams
12	                    draft-ietf-clue-framework-01.txt

14	Abstract

16	   This memo offers a framework for a protocol that enables devices in a
17	   telepresence conference to interoperate by specifying the
18	   relationships between multiple RTP streams.

20	Status of this Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on May 3, 2012.

37	Copyright Notice

39	   Copyright (c) 2011 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
55	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
56	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  5
57	   4.  Framework Features . . . . . . . . . . . . . . . . . . . . . .  7
58	   5.  Stream Information . . . . . . . . . . . . . . . . . . . . . .  8
59	     5.1.  Overview of the Model  . . . . . . . . . . . . . . . . . .  9
60	     5.2.  Media capture -- Audio and Video . . . . . . . . . . . . .  9
61	     5.3.  Attributes for Media Captures  . . . . . . . . . . . . . . 10
62	       5.3.1.  Purpose  . . . . . . . . . . . . . . . . . . . . . . . 11
63	       5.3.2.  Composed . . . . . . . . . . . . . . . . . . . . . . . 11
64	       5.3.3.  Audio Channel Format . . . . . . . . . . . . . . . . . 11
65	       5.3.4.  Area of capture  . . . . . . . . . . . . . . . . . . . 12
66	       5.3.5.  Point of capture . . . . . . . . . . . . . . . . . . . 12
67	       5.3.6.  Auto-switched  . . . . . . . . . . . . . . . . . . . . 13
68	     5.4.  Capture Set  . . . . . . . . . . . . . . . . . . . . . . . 13
69	     5.5.  Attributes for Capture Sets  . . . . . . . . . . . . . . . 15
70	       5.5.1.  Area of Scene  . . . . . . . . . . . . . . . . . . . . 15
71	       5.5.2.  Area Scale Millimeters . . . . . . . . . . . . . . . . 15
72	   6.  Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 16
73	     6.1.  Message Flow . . . . . . . . . . . . . . . . . . . . . . . 16
74	       6.1.1.  Consumer Capability Message  . . . . . . . . . . . . . 17
75	       6.1.2.  Provider Capabilities Announcement . . . . . . . . . . 17
76	       6.1.3.  Consumer Configure Request . . . . . . . . . . . . . . 17
77	     6.2.  Physical Simultaneity  . . . . . . . . . . . . . . . . . . 18
78	     6.3.  Encoding Groups  . . . . . . . . . . . . . . . . . . . . . 19
79	       6.3.1.  Encoding Group Structure . . . . . . . . . . . . . . . 20
80	       6.3.2.  Individual Encodes . . . . . . . . . . . . . . . . . . 21
81	       6.3.3.  More on Encoding Groups  . . . . . . . . . . . . . . . 22
82	       6.3.4.  Examples of Encoding Groups  . . . . . . . . . . . . . 23
83	   7.  Extensibility  . . . . . . . . . . . . . . . . . . . . . . . . 25
84	   8.  Other aspects of the framework . . . . . . . . . . . . . . . . 25
85	   9.  Using the Framework  . . . . . . . . . . . . . . . . . . . . . 26
86	     9.1.  The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 29
87	     9.2.  Media Consumer Behavior  . . . . . . . . . . . . . . . . . 30
88	       9.2.1.  One screen consumer  . . . . . . . . . . . . . . . . . 30
89	       9.2.2.  Two screen consumer configuring the example  . . . . . 30
90	       9.2.3.  Three screen consumer configuring the example  . . . . 31
91	   10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
92	   11. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 31
93	   12. Security Considerations  . . . . . . . . . . . . . . . . . . . 31
94	   13. Informative References . . . . . . . . . . . . . . . . . . . . 32
95	   Appendix A.  Open Issues . . . . . . . . . . . . . . . . . . . . . 32
96	     A.1.  Video layout arrangements and centralized composition  . . 32
97	     A.2.  Source is selectable . . . . . . . . . . . . . . . . . . . 32
98	     A.3.  Media Source Selection . . . . . . . . . . . . . . . . . . 33
99	     A.4.  Endpoint requesting many streams from MCU  . . . . . . . . 33
100	     A.5.  VAD (voice activity detection) tagging of audio streams  . 33
101	     A.6.  Private Information  . . . . . . . . . . . . . . . . . . . 34
102	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34

104	1.  Introduction

106	   Current telepresence systems, though based on open standards such as
107	   RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each
108	   other.  A major factor limiting the interoperability of telepresence
109	   systems is the lack of a standardized way to describe and negotiate
110	   the use of the multiple streams of audio and video comprising the
111	   media flows.  This draft provides a framework for a protocol to
112	   enable interoperability by handling multiple streams in a
113	   standardized way.  It is intended to support the use cases described
114	   in draft-ietf-clue-telepresence-use-cases-00 and to meet the
115	   requirements in draft-romanow-clue-requirements-xx.

117	   The solution described here is strongly focused on what is being done
118	   today, rather than on a vision of future conferencing.  At the same
119	   time, the highest priority has been given to creating an extensible
120	   framework to make it easy to accommodate future conferencing
121	   functionality as it evolves.

123	   The purpose of this effort is to make it possible to handle multiple
124	   streams of media in such a way that a satisfactory user experience is
125	   possible even when participants are on different vendor equipment and
126	   when they are using devices with different types of communication
127	   capabilities.  Information about the relationship of media streams
128	   must be communicated so that audio/video rendering can be done in the
129	   best possible manner.  In addition, it is necessary to choose which
130	   media streams are sent.

132	   There is no attempt here to dictate to the renderer what it should
133	   do.  What the renderer does is up to the renderer.

135	   After the following Definitions, a short section introduces key
136	   concepts.  The body of the text comprises three sections that deal
137	   with in turn stream content, choosing streams and an implementation
138	   example.  The media provider and media consumer behavior are
139	   described in separate sections as well.  Several appendices describe
140	   topics that are under discussion for adding to the document.

142	2.  Terminology

144	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
145	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
146	   document are to be interpreted as described in RFC 2119 [RFC2119].

148	3.  Definitions

150	   The definitions marked with an "*" are new; all the others are from
151	   draft-wenger-clue-definitions-00-01.txt.

153	   *Audio Capture: Media Capture for audio.  Denoted as ACn.

155	   Camera-Left and Right: For media captures, camera-left and camera-
156	   right are from the point of view of a person observing the rendered
157	   media.  They are the opposite of stage-left and stage-right.

159	   Capture Device: A device that converts audio and video input into an
160	   electrical signal, in most cases to be fed into a media encoder.
161	   Cameras and microphones are examples for capture devices.

163	   Capture Scene: the scene that is captured by a collection of Capture
164	   Devices.  A Capture Scene may be represented by more than one type of
165	   Media.  A Capture Scene may include more than one Media Capture of
166	   the same type.  An example of a Capture Scene is the video image of a
167	   group of people seated next to each other, along with the sound of
168	   their voices, which could be represented by some number of VCs and
169	   ACs.  A middle box may also express Capture Scenes that it constructs
170	   from Media streams it receives.

172	   A Capture Set includes Media Captures that all represent some aspect
173	   of the same Capture Scene.  The items (rows) in a Capture Set
174	   represent different alternatives for representing the same Capture
175	   Scene.

177	   Conference: used as defined in [RFC4353], A Framework for
178	   Conferencing within the Session Initiation Protocol (SIP).

180	   *Individual Encode: A variable with a set of attributes that
181	   describes the maximum values of a single audio or video capture
182	   encoding.  The attributes include: maximum bandwidth- and for video
183	   maximum macroblocks, maximum width, maximum height, maximum frame
184	   rate.  [Edt. These are based on H.264.]

186	   *Encoding Group: Encoding group: A set of encoding parameters
187	   representing a device's complete encoding capabilities or a
188	   subdivision of them.  Media stream providers formed of multiple
189	   physical units, in each of which resides some encoding capability,
190	   would typically advertise themselves to the remote media stream
191	   consumer as being formed multiple encoding groups.  Within each
192	   encoding group, multiple potential actual encodings are possible,
193	   with the sum of those encodings' characteristics constrained to being
194	   less than or equal to the group-wide constraints.

196	   Endpoint: The logical point of final termination through receiving,
197	   decoding and rendering, and/or initiation through capturing,
198	   encoding, and sending of media streams.  An endpoint consists of one
199	   or more physical devices which source and sink media streams, and
200	   exactly one [RFC4353] Participant (which, in turn, includes exactly
201	   one SIP User Agent).  In contrast to an endpoint, an MCU may also
202	   send and receive media streams, but it is not the initiator nor the
203	   final terminator in the sense that Media is Captured or Rendered.
204	   Endpoints can be anything from multiscreen/multicamera rooms to
205	   handheld devices.

207	   Endpoint Characteristics: include placement of Capture and Rendering
208	   Devices, capture/render angle, resolution of cameras and screens,
209	   spatial location and mixing parameters of microphones.  Endpoint
210	   characteristics are not specific to individual media streams sent by
211	   the endpoint.

213	   Front: the portion of the room closest to the cameras.  In going
214	   towards back you move away from the cameras.

216	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
217	   more endpoints together into one single multimedia conference
218	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
219	   tardy in requiring that media from the mixer be sent to EACH
220	   participant.  I think we have practical use cases where this is not
221	   the case.  But the bug (if it is one) is in 4353 and not herein.

223	   Media: Any data that, after suitable encoding, can be conveyed over
224	   RTP, including audio, video or timed text.

226	   *Media Capture: a source of Media, such as from one or more Capture
227	   Devices.  A Media Capture (MC) may be the source of one or more Media
228	   streams.  A Media Capture may also be constructed from other Media
229	   streams.  A middle box can express Media Captures that it constructs
230	   from Media streams it receives.

232	   *Media Consumer: an Endpoint or middle box that receives Media
233	   streams

235	   *Media Provider: an Endpoint or middle box that sends Media streams

237	   Model: a set of assumptions a telepresence system of a given vendor
238	   adheres to and expects the remote telepresence system(s) also to
239	   adhere to.

241	   Render: the process of generating a representation from a media, such
242	   as displayed motion video or sound emitted from loudspeakers.

244	   *Simultaneous Transmission Set: a set of media captures that can be
245	   transmitted simultaneously from a Media Provider.

247	   Spatial Relation: The arrangement in space of two objects, in
248	   contrast to relation in time or other relationships.  See also
249	   Camera-Left and Right.

251	   Stage-Left and Right: For media captures, stage-left and stage-right
252	   are the opposite of camera-left and camera-right.  For the case of a
253	   person facing (and captured by) a camera, stage-left and stage-right
254	   are from the point of view of that person.

256	   *Stream: RTP stream as in [RFC3550].

258	   Stream Characteristics: include media stream attributes commonly used
259	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
260	   resolution, profile/level etc.) as well as CLUE specific attributes
261	   (which could include for example and depending on the solution found:
262	   the I-D or spatial location of a capture device a stream originates
263	   from).

265	   Telepresence: an environment that gives non co-located users or user
266	   groups a feeling of (co-located) presence - the feeling that a Local
267	   user is in the same room with other Local users and the Remote
268	   parties.  The inclusion of Remote parties is achieved through
269	   multimedia communication including at least audio and video signals
270	   of high fidelity.

272	   *Video Capture: Media Capture for video.  Denoted as VCn.

274	   Video composite: A single image that is formed from combining visual
275	   elements from separate sources.

277	4.  Framework Features

279	   Two key functions must be accomplished so that multiple media streams
280	   can be handled in a telepresence conference.  These are:

282	   o  How to choose which streams the provider should send to the
283	      consumer

285	   o  What information needs to be added to the streams to allow a
286	      rendering of the capture scene

288	   The framework/model we present here can be understood as specifying
289	   these two functions.

291	   Media stream providers and consumers are central to the framework.
292	   The provider's job is to advertise its capabilities (as described
293	   here) to the consumer, whose job it is to configure the provider's
294	   encoding capabilities as described below.  Both providers and
295	   consumers can each send and receive information, that is, we do not
296	   have one party as the provider and one as the consumer exclusively,
297	   but all parties have both sending and receiving parts to them.  Most
298	   devices function as both a media provider and as a media consumer.

300	   For two devices to communicate bidirectionally, with media flowing in
301	   both directions, both devices act as both a media provider and a
302	   media consumer.  The protocol exchange shown later in the "Choosing
303	   Streams" section happens twice independently between the 2
304	   bidirectional devices.

306	   Both endpoints and MCUs, or more generally "middleboxes", can be
307	   media providers and consumers.

309	5.  Stream Information

311	   This section describes the structure for communicating information
312	   between providers and consumers.  Figure illustrates how information
313	   to be communicated is organized.  Each construct illustrated in the
314	   diagram is discussed in the sections below.

316	   Diagram for Stream Content

318	                                  +---------------+
319	                                 |               |
320	                                 |  Capture Set  |
321	                                 |               |
322	                                 +-------+-------+
323	                              _..-'      |    ``-._
324	                          _.-'           |         ``-._
325	                      _.-'               |              ``-._
326	             +----------------+  +----------------+  +----------------+
327	             | Media Capture  |  | Media Capture  |  | Media Capture  |
328	             | Audio or Video |  | Audio or Video |  | Audio or Video |
329	             +----------------+  +----------------+  +----------------+
330	                .'     `.   `-..__
331	              .'         `.       ``-..__
332	          ,-----.       ,---------.      ``,----------.
333	        ,' Encode`.   ,'           `.    ,'Simultaneous`.
334	       (   Group   ) (  Attributes   )  (  Transmission  )
335	        `.       ,'   `.           ,'    `.   Sets     ,'
336	          `-----'       `---------'        `----------'

338	5.1.  Overview of the Model

340	   The basic method of operation is that a provider describes to a
341	   consumer what streams it has to offer.  It describes them in terms
342	   both of attributes of the media (e.g. audio and video) captures and
343	   in terms of the encoding characteristics of the streams for these
344	   captures.  The consumer then tells the provider which streams it
345	   wants to receive.  Prior to this exchange, the consumer sends
346	   information about itself to the provider which the provider may use
347	   in determining what to advertise to the consumer.

349	   A media provider provides media for one or more capture scenes.  As
350	   defined, a capture scene is the source scene that is captured by
351	   media devices.  An endpoint is likely to have more than one capture
352	   scene, for example one for people and one for presentation.  Each
353	   capture scene is represented by a capture set, which describes all
354	   the collections of media captures for that scene.  A capture set
355	   consists of one or more rows of media captures, where each row
356	   represents a way of capturing the scene.

358	   A media capture, typically audio or video, is the basic data
359	   structure, as defined in definitions and described below in
360	   Section 5.2.  Media captures have attributes that describe them, such
361	   as their spatial properties and relationships.  These attributes are
362	   described in Section 5.3 and Section 5.5.

364	   Media Captures are also associated with data constructs that capture
365	   encoding aspects of the streams - that is, simultaneous transmission
366	   sets and encoding groups, described in Section 6.2 and Section 6.3.

368	   Generally, the provider is capable of sending alternate captures of a
369	   capture scene - different number of captures for the scene, or
370	   captures with differing characteristics like bandwidth or resolution.
371	   These are described by the provider as capabilities, using the
372	   capture set and media capture model mentioned above, and chosen by
373	   the consumer.  The message exchange to accomplish this is described
374	   in Section 6.1.

376	   There are some additional separate aspects of the framework mentioned
377	   in Section 8.

379	5.2.  Media capture -- Audio and Video

381	   A media capture, as defined in definitions, is a fundamental concept
382	   of the model.  Media can be captured in different ways, for example
383	   by various arrangements of cameras and microphones.  The model uses
384	   the terms "video capture" (VC) and "audio capture" (AC) to refer to
385	   sources of media streams.  To distinguish between multiple instances,
386	   they are numbered for example VC1, VC2, and VC3 could refer to three
387	   different video captures which can be used simultaneously.

389	   A media capture can be a media source such as video from a specific
390	   camera, or it can be more conceptual such as a composite image from
391	   several cameras, or an automatic dynamically switched capture
392	   choosing from several cameras depending on who is talking or other
393	   factors.

395	   A media capture can also come from synthetically generated sources,
396	   such as a computer generated audiovisual presentation.  Or from the
397	   playback of a recording.  Any media type that can be carried over RTP
398	   can be represented by a media capture.

400	   A media capture is described by Attributes and associated with an
401	   Encode Group, and Simultaneous Transmission Set.

403	   Media captures are aggregated into Capture Sets as described below.

405	5.3.  Attributes for Media Captures

407	   Media capture attributes describe information about streams and their
408	   relationships.  [Edt: We do not mean to duplicate SDP, if an SDP
409	   description can be used, great.]  The attributes of media captures
410	   refer to static aspects of those captures that can be used by the
411	   consumer for selecting the captures offered by the provider.

413	   The mechanism of Attributes make the framework extensible.  Although
414	   we are defining some attributes now based on the most common use
415	   cases, new attributes can be added for new use cases as they arise.
416	   In general, the way to extend the solution to handle new features is
417	   by adding attributes and/or values.

419	   We describe attributes by variables and their values.  The current
420	   attributes are listed below and then described.  The variable is
421	   shown in parentheses, and the values follow after the colon:

423	   o  (Purpose): main, presentation

425	   o  (Composed): true, false

427	   o  (Audio Channel Format): mono, stereo, tbd

429	   o  (Area of Capture): A set of 'Ranges' describing the relevant area
430	      being capture by a capture device

432	   o  (Point of Capture): A 'Point' describing the location of the
433	      capture device or pseudo-device

435	   o  (Auto-switched): true, false

437	5.3.1.  Purpose

439	   A variable with enumerated values describing the purpose or role of
440	   the Media Capture.  It could be applied to any media type.  Possible
441	   values: main, presentation, others TBD.

443	   Main:

445	   The audio or video capture is of one or more people participating in
446	   a conference (or where they would be if they were there).  It is of
447	   part or all of the Capture Scene.

449	   Presentation:

451	   The capture provides a presentation, e. g., from a connected laptop
452	   or other input device.

454	5.3.2.  Composed

456	   A Boolean variable to indicate whether the MC is a mix or composition
457	   of other MCs or Streams.  (This could indicate for example a
458	   continuous presence view of multiple images in a grid, or a large
459	   image with smaller picture-in-picture images in it.  When applied to
460	   an audio capture, it indicates a composition of ACs by some mixing
461	   algorithm)

463	   This attribute is not intended to differentiate between different
464	   ways of composing or mixing images.  For possible extension of the
465	   framework, additional attributes could be defined to distinguish
466	   between different ways of composing or mixing captures.  For example,
467	   with different video layout arrangements of composing multiple images
468	   into one, or different audio mixing algorithms.

470	5.3.3.  Audio Channel Format

472	   The "channel format" attribute of an Audio Capture indicates how the
473	   meaning of the channels is determined.  It is an enumerated variable
474	   describing the type of audio channel or channels in the Audio
475	   Capture.  The possible values of the "channel format" attribute are:

477	   o  mono

479	   o  stereo

481	   o  TBD - other possible future values (to potentially include other
482	      things like 3.0, 3.1, 5.1 surround sound and binaural)

484	   All ACs in the same row of a Capture Set MUST have the same value of
485	   the "channel format" attribute.

487	   There can be multiple ACs of a particular type, or even different
488	   types.  These multiple ACs could each have an area of capture
489	   attribute to indicate they represent different areas of the capture
490	   scene.

492	   If there are multiple audio streams, they might be correlated (that
493	   is, someone talking might be heard in multiple captures from the same
494	   room).  Echo cancellation and stream synchronization in consumers
495	   should take this into account.

497	   Mono:

499	   An AC with channel format="mono" has one audio channel.

501	   Stereo:

503	   An AC with channel format = "stereo" has exactly two audio channels,
504	   left and right, as part of the same AC.  [Edt: should we mention RFC
505	   3551 here?  The channel format may be related to how Audio Captures
506	   are mapped to RTP streams.  This stereo is not the same as the effect
507	   produced from two mono ACs one from the left and one from the right.]

509	5.3.4.  Area of capture

511	   The area_of_capture attribute is used to describe the relevant area
512	   of which a media capture is "capturing".  By comparing the area of
513	   capture for different media captures, a consumer can determine the
514	   spatial relationships of the captures on the provider so that they
515	   can be rendered correctly.  The attribute consists of a set of
516	   'Ranges', one range for each spatial dimension, where each range has
517	   a Begin and End coordinate.  It is not necessary to fill out all of
518	   the dimensions if they are not relevant (i.e. if an endpoint's
519	   captures only span a single dimension, only the 'x' coordinate can be
520	   used).  There is no need to pre-define a possible range for this
521	   coordinate system; a device may choose what is most appropriate for
522	   describing its captures.  However, it is specified that as numbers
523	   move from lower to higher, the location is going from: camera-left to
524	   camera-right (in the case of the 'x' dimension), front to back (in
525	   the case of the 'y' dimension or low to high (in the case of the 'z'
526	   dimension).

528	5.3.5.  Point of capture

530	   The point_of_capture attribute can be used to describe the location
531	   of a capture device or pseudo-device.  If there are multiple captures
532	   which share the same 'area_of_capture' value, then it is useful to
533	   know the location from which they are capturing that area (e.g. a
534	   device which has multiview).  Point of capture is expressed as a
535	   single {x, y, z} coordinate where, as with area_of_capture, only the
536	   necessary dimensions need be expressed.

538	5.3.6.  Auto-switched

540	   A Boolean variable that may be used for audio and/or video streams.
541	   In this case the offered AC or VC varies depending on some rule; it
542	   is auto-switched between possible VCs, or between possible ACs.  The
543	   most common example of this is sending the video capture associated
544	   with the "loudest" speaker according to an audio detection algorithm.

546	5.4.  Capture Set

548	   A capture set describes the alternative media streams that the
549	   provider offers to send to the consumer.  As shown in the content
550	   diagram above, the capture set is an aggregation of all audio and
551	   video captures for a particular scene that a provider is willing to
552	   send.

554	   A provider can have more than one capture set, each representing a
555	   different scene.  For example one capture set can be for main people
556	   audio and video, and another capture set can be for a computer
557	   generated presentation.

559	   A provider describes its ability to send alternative media streams in
560	   the capture set, which lists the media captures in rows, as shown
561	   below.  Each row of the capture set consists of either a single
562	   capture or a group of captures.  A group means the individual
563	   captures in the group are spatially related with the specific
564	   ordering of the captures described through the use of attributes.

566	   Here is an example of a simple capture set with three video captures
567	   and three audio captures:

569	      (VC0, VC1, VC2)

571	      (AC0, AC1, AC2)

573	   The three VCs together in a row indicate those captures are spatially
574	   related to each other.  Similarly for the 3 ACs in the second row.
575	   The ACs and VCs in the same capture set are spatially related to each
576	   other.

578	   Multiple Media Captures of the same media type are often spatially
579	   related to each other.  Typically multiple Video Captures should be
580	   rendered next to each other in a particular order, or multiple audio
581	   channels should be rendered to match different speakers in a
582	   particular way.  Also, media of different types are often associated
583	   with each other, for example a group of Video Captures can be
584	   associated with a group of Audio Captures meaning they should be
585	   rendered together.

587	   Media Captures of the same media type are associated with each other
588	   by grouping them together in a single row of a Capture Set. Media
589	   Captures of different media types are associated with each other by
590	   putting them in different rows of the same Capture Set.

592	   Since all captures have an area_of_capture associated with them, a
593	   consumer can determine the spatial relationships of captures by
594	   comparing the locations of their areas of capture with one another.

596	   Association between audio and video can be made by finding audio and
597	   video captures which share overlapping areas of capture.

599	   The items (rows) in a capture set represent different alternatives
600	   for representing the same Capture Scene.  For example the following
601	   are alternative ways of capturing the same Capture Scene - two
602	   cameras each viewing half of a room, or one camera viewing the whole
603	   room, or one stream that automatically captures the person in the
604	   room who is currently speaking.  Each row of the Capture Set contains
605	   either a single media capture or one group of media captures.

607	   The following example shows a capture set for an endpoint media
608	   provider where:

610	   o  (VC0, VC1, VC2) - camera-left video capture, center video capture,
611	      camera-right video capture

613	   o  (VC3) - capture associated with loudest

615	   o  (VC4) - zoomed out view of all people in the room

617	   o  (AC0) - main audio

619	   The first item in this capture set example is a group of video
620	   captures with a spatial relationship to each other.  These are VC0,
621	   VC1, and VC2.  VC3 and VC4 are additional alternatives of how to
622	   capture the same room in different ways.  The audio capture is
623	   included in the same capture set to indicate AC0 is associated with
624	   those video captures, meaning the audio should be rendered along with
625	   the video in the same set.

627	   The idea is to have sets of captures that represent the same
628	   information ("information" in this context might be a set of people
629	   and their associated audio / video streams, or might be a
630	   presentation supplied by a laptop, perhaps with an accompanying audio
631	   commentary).  Spatial ordering of media captures is described through
632	   the use of attributes.

634	   A media consumer could choose one row of each media type (e.g., audio
635	   and video) from a capture set.  For example a three stream consumer
636	   could choose the first video row plus the audio row, while a single
637	   stream consumer could choose the second or third video row plus the
638	   audio row.  An MCU consumer might choose to receive multiple rows.

640	   The Simultaneous Transmission Sets and Encoding Groups as discussed
641	   in the next section apply to media captures listed in capture sets.
642	   The Simultaneous Transmission Sets and Encoding Groups MUST allow all
643	   the Media Captures in a particular row of the capture set to be used
644	   simultaneously.  But media captures in different rows of the capture
645	   set might not be able to be used simultaneously.

647	5.5.  Attributes for Capture Sets

649	   These are attibutes that can be applied to a capture set.

651	   o  (Area of Scene): A set of 'Ranges' describing the area of the
652	      entire capture scene

654	   o  (Area scale): true, false indicating if area numbers are in
655	      millimeters

657	5.5.1.  Area of Scene

659	   The area of scene attribute for a capture set has the same format as
660	   the area of capture attribute for a media capture.  The area of scene
661	   is for the entire scene, which is captured by the one or more media
662	   captures in the capture set rows.

664	5.5.2.  Area Scale Millimeters

666	   An optional Boolean variable indicating if the numbers used for area
667	   of scene, area of capture and point of capture are in terms of
668	   millimeters.  If this attribute is true, then the x,y,z numbers
669	   represent millimeters.  If this attribute is false, then there is no
670	   physical scale.  The default value is true.

672	   This attribute applies to all the MCs that are part of the capture
673	   set.

675	6.  Choosing Streams

677	   This section describes the process of choosing which streams the
678	   provider sends to the consumer.  In order for appropriate streams to
679	   be sent from providers to consumers, certain characteristics of the
680	   multiple streams must be understood by both providers and consumers.
681	   Two separate aspects of streams suffice to describe the necessary
682	   information to be shared by providers and consumers.  The first
683	   aspect we call "physical simultaneity" and the other aspect we refer
684	   to as "encoding group".  These are described in the following
685	   sections, after the message flow is discussed.

687	6.1.  Message Flow

689	   The following diagram shows the flow of messages between a media
690	   provider and a media consumer.  The provider sends information about
691	   its capabilities (as specified in this section), then the consumer
692	   chooses which streams it wants, which we refer to as "configure".
693	   The consumer sends its own capability message to the provider which
694	   may contain information about its own capabilities or restrictions,
695	   in which case the provider might tailor its announcements to the
696	   consumer.

698	   Diagram for Message Flow

700	    Media Consumer                         Media Provider
701	    --------------                         ------------
702	          |                                     |
703	          |----- Consumer Capability ---------->|
704	          |                                     |
705	          |                                     |
706	          |<---- Capabilities (announce) -------|
707	          |                                     |
708	          |                                     |
709	          |------ Configure (request) --------->|
710	          |                                     |

712	   Media captures are dynamic.  They can come and go in a conference -
713	   and their parameters can change.  A provider can advertise a new list
714	   of captures at any time.  Both the media provider and media consumer
715	   can send "their messages" (i.e., capture set announcements, stream
716	   configurations) any number of times during a call, and the other end
717	   is always required to act on any new information received (e.g.,
718	   stopping streams it had previously configured that are no longer
719	   valid).

721	   These messages do not always have to occur with all three messages
722	   together as part of an exchange.  A provider can send a new
723	   capabilities announce message any time, without first receiving a new
724	   consumer capability message.  Similarly, a consumer can send a new
725	   configure request at any time, to change what it wants to receive.
726	   The new configure request must be compatible with the most recently
727	   received capabilities announce message.

729	6.1.1.  Consumer Capability Message

731	   In order for a maximally-capable provider to be able to advertise a
732	   manageable number of video captures to a consumer, there is a
733	   potential use for the consumer being able, at the start of CLUE to be
734	   able to inform the provider of its capabilities.  One example here
735	   would be the video capture attribute set - a consumer could tell the
736	   provider the complete set of video capture attributes it is able to
737	   understand and so the provider would be able to reduce the capture
738	   set it advertises to be tailored to the consumer.

740	   TBD - the content of this message needs to be better defined.  The
741	   authors believe there is a need for this message, but have not worked
742	   out the details yet.

744	6.1.2.  Provider Capabilities Announcement

746	   The provider capabilities announce message includes:

748	   o  the list of captures and their attributes

750	   o  the list of capture sets

752	   o  the list of Simultaneous Transmission Sets

754	   o  the list of the encoding groups

756	6.1.3.  Consumer Configure Request

758	   After receiving a set of video capture information from a provider
759	   and making its choice of what media streams to receive based on the
760	   consumer's own capabilities and any provider-side simultaneity
761	   restrictions, the consumer needs to essentially configure the
762	   provider to transmit the chosen set.

764	   The expectation is that this message will enumerate each of the
765	   encoding groups and potential encoders within those groups that the
766	   consumer wishes to be active (this may well be a subset of the
767	   complete set available).  For each such encoder within an encoding
768	   group, the consumer would specify the video capture (i.e., VC<n> as
769	   described above) along with the specifics of the video encoding
770	   required, i.e. width, height, frame rate and bit rate.  At this
771	   stage, the consumer would also provide RTP demultiplexing information
772	   as required to distinguish each stream from the others being
773	   configured by the same mechanism.

775	6.2.  Physical Simultaneity

777	   An endpoint or MCU can send multiple captures simultaneously.
778	   However, there may be constraints that limit which captures can be
779	   sent simultaneously with other captures.

781	   Physical or device simultaneity refers to fact that a device may not
782	   be able to be used in different ways at the same time.  This shapes
783	   the way that offers are made from the provider.  The offers are made
784	   so that the consumer will choose one of several possible usages of
785	   the device.  This type of constraint is expressed in Simultaneous
786	   Transmission Sets.  This is easier to show in an example.

788	   Consider the example of a room system where there are 3 cameras each
789	   of which can send a separate capture covering 2 persons each- VC0,
790	   VC1, VC2.  The middle camera can also zoom out and show all 6
791	   persons, VC3.  But the middle camera cannot be used in both modes at
792	   the same time - it has to either show the space where 2 participants
793	   sit or the whole 6 seats.  We refer to this as a physical device
794	   simultaneity constraint.

796	   The following illustration shows 3 cameras with 4 video streams.  The
797	   middle camera can be used as main video zoomed in on 2 people or it
798	   could be used in zoomed out mode and capture the whole endpoint.  The
799	   idea here is that the middle camera cannot be used for both zoomed in
800	   and zoomed out captures simultaneously.  This is a constraint imposed
801	   by the physical limitations of the devices.

803	   Diagram for Simultaneity
804	   `-.   +--------+   VC2
805	      .-'_Camera 3|---------->
806	   .-'   +--------+
807	                       VC3
808	                     -------->
809	   `-.   +--------+ /
810	      .-'|Camera 2|<
811	   .-'   +--------+ \  VC1
812	                     -------->

814	   `-.   +--------+   VC0
815	      .-'|Camera 1|---------->
816	   .-'   +--------+

818	   VC0- video zoomed in on 2 people   VC2- video zoomed in on 2 people
819	   VC1- video zoomed in on 2 people   VC3- video zoomed out on 6 people

821	   Simultaneous transmission sets can be expressed as sets of the VCs
822	   that could physically be transmitted at the same time, though it may
823	   not make sense to do so.

825	   In this example the two simultaneous sets are:

827	   {VC0, VC1, VC2}

829	   {VC0, VC3, VC2}

831	   In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2.
832	   Only one set can be transmitted at a time.  These are physical
833	   capabilities describing what can physically be sent at the same time,
834	   not what might make sense to send.  For example, in the second set
835	   both VC0 and VC2 are redundant if VC3 is included.

837	   In describing its capabilities, the provider must take physical
838	   simultaneity into account and send a list of its Simultaneous
839	   Transmission Sets to the consumer, along with the Capture Sets and
840	   Encoding Groups.

842	6.3.  Encoding Groups

844	   The second aspect of multiple streams that must be understood by
845	   providers and consumers in order to create the best experience
846	   possible, i. e., for the "right" or "best" streams to be sent, is the
847	   encoding characteristics of the possible audio and video streams
848	   which can be sent.  Just as in the way that a constraint is imposed
849	   on the multiple streams due to the physical limitations, there are
850	   also constraints due to encoding limitations.  These are described by
851	   four variables that make up an Encoding Group, as shown in the
852	   following table:

854	   Table: Encoding Group

856	   +----------------+--------------------------------------------------+
857	   | Name           | Description                                      |
858	   +----------------+--------------------------------------------------+
859	   | maxBandwidth   | Maximum number of bits per second relating to    |
860	   |                | all encodes combined                             |
861	   | maxVideoMbps   | Maximum number of macroblocks per second         |
862	   |                | relating to a all video encodes combined ((width |
863	   |                | + 15) / 16) * ((height + 15) / 16) *             |
864	   |                | framesPerSecond                                  |
865	   | videoEncodes[] | Set of potential video encodes can be generated  |
866	   | audioEncodes[] | Set of potential encodes that can be generated   |
867	   +----------------+--------------------------------------------------+

869	   An encoding group is the basic concept for describing encoding
870	   capability.  As shown in the Table, it has an overall maxMbps and
871	   bandwidth limits, as well as being comprised of sets of individual
872	   encodes, which will be described in more detail below.

874	   Each media stream provider includes one or more encoding groups.
875	   There may be multiple encoding groups per endpoint.  For example,
876	   each video capture device might have an associated encoding group
877	   that describes the video streams that can result from that capture.

879	   A remote receiver (i. e., stream consumer)configures some or all of
880	   the specific encodings within one or more groups in order to provide
881	   it with media streams to decode.

883	6.3.1.  Encoding Group Structure

885	   This section shows more detail on the media stream provider's
886	   encoding group structure.  The encoding group includes several
887	   individual encodes, each has different encoding values.  For example
888	   one may be high definition video 1080p60, and another 720p30, with a
889	   third being CIF.  While a typical 3 codec/display system would have
890	   one encoding group per "box", there are many possibilities for the
891	   number of encoding groups a provider may be able to offer and for
892	   what encoding values there are in each encoding group.

894	   Diagram for Encoding Group Structure
895	   ,-------------------------------------------------.
896	   |             Media Provider                      |
897	   |                                                 |
898	   |  ,--------------------------------------.       |
899	   |  | ,--------------------------------------.     |
900	   |  | | ,--------------------------------------.   |
901	   |  | | |          Encoding Group              |   |
902	   |  | | | ,-----------.                        |   |
903	   |  | | | |           | ,---------.            |   |
904	   |  | | | |           | |         | ,---------.|   |
905	   |  | | | |  Encode1  | | Encode2 | | Encode3 ||   |
906	   |  `.| | |           | |         | `---------'|   |
907	   |    `.| `-----------' `---------'            |   |
908	   |      `--------------------------------------'   |
909	   `-------------------------------------------------'

911	   As shown in the diagram, each encoding group has multiple potential
912	   individual encodes within it.  Not all encodes are equally capable,
913	   the stream consumer chooses the encodes it wants by configuring the
914	   provider to send it what it wants to receive.

916	   Some encoding endpoints are fixed, others are flexible, e. g., a
917	   single box with multiple DSPs where the resources are shared.

919	6.3.2.  Individual Encodes

921	   An encoding group is associated with a media capture through the
922	   individual encodes, that is, an audio or video capture is encoded in
923	   one or more individual encodes, as described by the videoEncodes[]
924	   and audioEncodes[]variables.

926	   The following table shows the variables for a Video Encode.  (There
927	   is a similar table for audio.)

929	   Table: Individual Video Encode

931	   +--------------+----------------------------------------------------+
932	   | Name         | Description                                        |
933	   +--------------+----------------------------------------------------+
934	   | maxBandwidth | Maximum number of bits per second relating to a    |
935	   |              | single video encoding                              |
936	   | maxMbps      | Maximum number of macroblocks per second relating  |
937	   |              | to a single video encoding: ((width + 15) / 16) *  |
938	   |              | ((height + 15) / 16) * framesPerSecond             |
939	   | maxWidth     | Video resolution's maximum supported width,        |
940	   |              | expressed in pixels                                |
941	   | maxHeight    | Video resolution's maximum supported height,       |
942	   |              | expressed in pixels                                |
943	   | maxFrameRate | Maximum supported frame rate                       |
944	   +--------------+----------------------------------------------------+

946	   A remote receiver configures (i. e., instantiates) some or all of the
947	   specific encodes such that:

949	   o  The configuration of each active ENC<n> does not exceed that
950	      individual encode's maxWidth, maxHeight, maxFrameRate.

952	   o  The total bandwidth of the configured ENC<n> does not exceed the
953	      maxBandwidth of the encoding group.

955	   o  The sum of the macroblocks per second of each configured encode
956	      does not exceed the maxMbps attribute of the encoding group.

958	   An equivalent set of attributes holds for audio encodes within an
959	   audio encoding group.

961	6.3.3.  More on Encoding Groups

963	   An encoding group EG<n> comprises one or more potential encodings
964	   ENC<n>.  For example,

966	   EG0:  maxMbps=489600, maxBandwidth=6000000
967	        VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
968	                    maxMbps=244800, maxBandwidth=4000000
969	        VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
970	                    maxMbps=244800, maxBandwidth=4000000
971	        AUDIO_ENC0: maxBandwidth=96000
972	        AUDIO_ENC1: maxBandwidth=96000
973	        AUDIO_ENC2: maxBandwidth=96000

975	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
976	   encodings (Mbps for 1080p = 244800), but it is capable of
977	   transmitting a maxFrameRate of 60 frames per second (fps).  To
978	   achieve the maximum resolution (1920 x 1088) the frame rate is
979	   limited to 30 fps.  However 60 fps can be achieved at a lower
980	   resolution if required by the consumer.  Although the encoding group
981	   is capable of transmitting up to 6Mbit/s, no individual video
982	   encoding can exceed 4Mbit/s.

984	   This encoding group also allows up to 3 audio encodings,
985	   AUDIO_ENC<0-2>.  It is not required that audio and video encodings
986	   reside within the same encoding group, but if so then the group's
987	   overall maxBandwidth value is a limit on the sum of all audio and
988	   video encodings configured by the consumer.  A system that does not
989	   wish or need to combine bandwidth limitations in this way should
990	   instead use separate encoding groups for audio and video in order for
991	   the bandwidth limitations on audio and video to not interact.

993	   Audio and video can be expressed in separate encode groups, as in
994	   this illustration.

996	   VIDEO_EG0:  maxMbps=489600, maxBandwidth=6000000
997	        VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
998	                    maxMbps=244800, maxBandwidth=4000000
999	        VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1000	                    maxMbps=244800, maxBandwidth=4000000
1001	   AUDIO_EG0: maxBandwidth=500000
1002	        AUDIO_ENC0: maxBandwidth=96000
1003	        AUDIO_ENC1: maxBandwidth=96000
1004	        AUDIO_ENC2: maxBandwidth=96000

1006	6.3.4.  Examples of Encoding Groups

1008	   This section illustrates further examples of encoding groups.  In the
1009	   first example, the capability parameters are the same across ENCs.
1010	   In the second example, they vary.

1012	   An endpoint that has 3 similar video capture devices would advertise
1013	   3 encoding groups that can each transmit up to 2 1080p30 encodings,
1014	   as follows:

1016	   EG0:  maxMbps = 489600, maxBandwidth=6000000
1017	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1018	             maxMbps=244800, maxBandwidth=4000000
1019	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1020	             maxMbps=244800, maxBandwidth=4000000
1021	   EG1:  maxMbps = 489600, maxBandwidth=6000000
1022	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1023	             maxMbps=244800, maxBandwidth=4000000
1024	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1025	             maxMbps=244800, maxBandwidth=4000000
1026	   EG2:  maxMbps = 489600, maxBandwidth=6000000
1027	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1028	             maxMbps=244800, maxBandwidth=4000000
1029	       ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1030	             maxMbps=244800, maxBandwidth=4000000

1032	   A remote consumer configures some or all of the specific encodings
1033	   such that:

1035	   o  The configuration of each active ENC<n> parameter values does not
1036	      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
1037	      exceeded

1039	   o  The total bandwidth of the configured ENC <n> encodings does not
1040	      exceed the maxBandwidth of the encoding group

1042	   o  The sum of the "macroblocks per second" values of each configured
1043	      encoding does not exceed the maxMbps of the encoding group

1045	   There is no requirement for all encodings within an encoding group to
1046	   be activated when configured by the consumer.

1048	   Depending on the provider's encoding methods, the consumer may be
1049	   able to request fixed encode values or choose encode values in the
1050	   range less than the maximum offered.  We will discuss consumer
1051	   behavior in more detail in a section below.

1053	6.3.4.1.  Sample video encoding group specification #2

1055	   This example specification expresses a system whose encoding groups
1056	   can each transmit up to 3 encodings, but with each potential encoding
1057	   having a progressively lower specification.  In this example, 1080p60
1058	   transmission is possible (as ENC0 has a maxMbps value compatible with
1059	   that) as long as it is the only active encoding (as maxMbps for the
1060	   entire encoding group is also 489600).  Significantly, as up to 3
1061	   encodings are available per group, some sets of captures which
1062	   weren't able to be transmitted simultaneously in example #1 above now
1063	   become possible, for instance VC1, VC3 and VC6 together.  In common
1064	   with example #1, all encoding groups have an identical specification.

1066	   EG0:  maxMbps = 489600, maxBandwidth=6000000
1067	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1068	             maxMbps=489600, maxBandwidth=4000000
1069	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
1070	             maxMbps=108000, maxBandwidth=4000000
1071	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
1072	             maxMbps=61200, maxBandwidth=4000000
1073	   EG1:  maxMbps = 489600, maxBandwidth=6000000
1074	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1075	             maxMbps=489600, maxBandwidth=4000000
1076	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
1077	             maxMbps=108000, maxBandwidth=4000000
1078	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
1079	             maxMbps=61200, maxBandwidth=4000000
1080	   EG2:  maxMbps = 489600, maxBandwidth=6000000
1081	       ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1082	             maxMbps=489600, maxBandwidth=4000000
1083	       ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30,
1084	             maxMbps=108000, maxBandwidth=4000000
1085	       ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30,
1086	             maxMbps=61200, maxBandwidth=4000000

1088	7.  Extensibility

1090	   One of the most important characteristics of the Framework is its
1091	   extensibility.  Telepresence is a relatively new industry and while
1092	   we can foresee certain directions, we also do not know everything
1093	   about how it will develop.  The standard for interoperability and
1094	   handling multiple streams must be future-proof.

1096	   The framework itself is inherently extensible through expanding the
1097	   data model types.  For example:

1099	   o  Adding more types of media, such as telemetry, can done by
1100	      defining additional types of captures in addition to audio and
1101	      video.

1103	   o  Adding new functionalities , such as 3-D, say, will require
1104	      additional attributes describing the captures, such as x,y, z
1105	      coordinates.

1107	   o  Adding a new codecs, such as H.265, can be accomplished by
1108	      defining new encoding variables.

1110	   The infrastructure is designed to be extended rather than requiring
1111	   new infrastructure elements.  Extension comes through adding to
1112	   defined types.

1114	   Assuming the implementation is in something like XML, adding data
1115	   elements and attributes makes extensibility easy.

1117	8.  Other aspects of the framework

1119	   A few other aspects of the framework are separate from the provider
1120	   capture set model.  These include:

1122	   o  Voice activity detection

1124	   o  Indications about stream switching/composing, information about
1125	      the source media captures

1127	   o  associating captures/streams with a conference roster

1129	   o  mapping the model to specific protocol messages

1131	   [Edt. much of this is work in progress and will need to be updated]

1133	9.  Using the Framework

1135	   This section shows in more detail how to use the framework to
1136	   represent a typical case for telepresence rooms.  First an endpoint
1137	   is illustrated, then an MCU case is shown.

1139	   Consider an endpoint with the following characteristics:

1141	   o  3 cameras, 3 displays, a 6 person table

1143	   o  Each video device can provide one capture for each 1/3 section of
1144	      the table

1146	   o  A single capture representing the active speaker can be provided

1148	   o  A single capture representing the active speaker with the other 2
1149	      captures shown picture in picture within the stream can be
1150	      provided

1152	   o  A capture showing a zoomed out view of all 6 seats in the room can
1153	      be provided

1155	   The audio and video captures for this endpoint can be described as
1156	   follows.  The Encode Group specifications can be found above in
1157	   Section 6.3.4.1, Sample video encoding group specification #2.

1159	   Video Captures:

1161	   o  VC0- (the camera-left camera stream), encoding group:EG0,
1162	      attributes:purpose=main;auto-switched:no;
1163	      area_of_capture={xBegin=0, xEnd=33}

1165	   o  VC1- (the center camera stream), encoding group:EG1, attributes:
1166	      purpose=main; auto-switched:no; area_of_capture={xBegin=33,
1167	      xEnd=66}

1169	   o  VC2- (the camera-right camera stream), encoding group:EG2,
1170	      attributes: purpose=main;auto-switched:no;
1171	      area_of_capture={xBegin=66, xEnd=99}

1173	   o  VC3- (the loudest panel stream), encoding group:EG1, attributes:
1174	      purpose=main;auto-switched:yes; area_of_capture={xBegin=0,
1175	      xEnd=99}

1177	   o  VC4- (the loudest panel stream with PiPs), encoding group:EG1,
1178	      attributes: purpose=main; composed=true; auto-switched:yes;
1179	      area_of_capture={xBegin=0, xEnd=99}

1181	   o  VC5- (the zoomed out view of all people in the room), encoding
1182	      group:EG1, attributes: purpose=main;auto-switched:no;
1183	      area_of_capture={xBegin=0, xEnd=99}

1185	   o  VC6- (presentation stream), encoding group:EG1, attributes:
1186	      purpose=presentation;auto-switched:no; area_of_capture={xBegin=0,
1187	      xEnd=99}

1189	   Summary of video captures - 3 codecs, center one is used for center
1190	   camera stream, presentation stream, auto-switched, and zoomed views.

1192	   Note the text in parentheses (e.g. "the camera-left camera stream")
1193	   is not explicitly part of the model, it is just explanatory text for
1194	   this example, and is not included in the model with the media
1195	   captures and attributes.

1197	   [edt.  It is arbitrary that for this example the alternative views
1198	   are on EG1 - they could have been spread out- it was not a necessary
1199	   choice.]

1201	   Audio Captures:

1203	   o  AC0 (camera-left), attributes: purpose=main;channel format=mono;
1204	      area_of_capture={xBegin=0, xEnd=33}

1206	   o  AC1 (camera-right), attributes: purpose=main;channel format=mono;
1207	      area_of_capture={xBegin=66, xEnd=99}

1209	   o  AC2 (center) attributes: purpose=main;channel format=mono;
1210	      area_of_capture={xBegin=33, xEnd=66}

1212	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1213	      attributes: purpose=main;channel format=mono; mixed=true;
1214	      area_of_capture={xBegin=0, xEnd=99}

1216	   o  AC4 audio stream associated with the presentation video (mono)
1217	      attributes: purpose=presentation;channel format=mono;
1218	      area_of_capture={xBegin=0, xEnd=99}

1220	   The physical simultaneity information is:

1222	      {VC0, VC1, VC2, VC3, VC4, VC6}

1224	      {VC0, VC2, VC5, VC6}

1226	   It is possible to select any or all of the rows in a capture set.
1227	   This is strictly what is possible from the devices.  However, using
1228	   every member in the set simultaneously may not make sense- for
1229	   example VC3(loudest) and VC4 (loudest with PIP).  (In addition, there
1230	   are encoding constraints that make choosing all of the VCs in a set
1231	   impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1 has only 3
1232	   ENCs.  This constraint shows up in the Capture list and encoding
1233	   groups, not in the simultaneous transmission sets.)

1235	   In this example there are no restrictions on which audio captures can
1236	   be sent simultaneously.

1238	   The following table represents the capture sets for this provider.
1239	   Recall that a capture set is composed of alternative captures
1240	   covering the same scene.  Capture Set #1 is for the main people
1241	   captures, and Capture Set #2 is for presentation.

1243	                            +----------------+
1244	                            | Capture Set #1 |
1245	                            +----------------+
1246	                            | VC0, VC1, VC2  |
1247	                            | VC3            |
1248	                            | VC4            |
1249	                            | VC5            |
1250	                            | AC0, AC1, AC2  |
1251	                            | AC3            |
1252	                            +----------------+

1254	                            +----------------+
1255	                            | Capture Set #2 |
1256	                            +----------------+
1257	                            | VC6            |
1258	                            | AC4            |
1259	                            +----------------+

1261	   Different capture sets are unique to each other, non-overlapping.  A
1262	   consumer chooses a capture row from each capture set.  In this case
1263	   the three captures VC0, VC1, and VC2 are one way of representing the
1264	   video from the endpoint.  These three captures should appear adjacent
1265	   next to each other.  Alternatively, another way of representing the
1266	   Capture Scene is with the capture VC3, which automatically shows the
1267	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

1269	   As in the video case, the different rows of audio in Capture Set #1
1270	   represent the "same thing", in that one way to receive the audio is
1271	   with the 3 linear position audio captures (AC0, AC1, AC2), and
1272	   another way is with the single channel monaural format AC3.  The
1273	   Media Consumer would choose the one audio capture row it is capable
1274	   of receiving.

1276	   The spatial ordering is understood by the media capture attributes
1277	   area and point of capture.

1279	   The consumer finds a "row" in each capture set #x section of the
1280	   table that it wants.  It configures the streams according to the
1281	   encoding group for the row.

1283	   A Media Consumer would likely want to choose a row to receive based
1284	   in part on how many streams it can simultaneously receive.  A
1285	   consumer that can receive three people streams would probably prefer
1286	   to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not
1287	   receive the other rows.  A consumer that can receive only one people
1288	   stream would probably choose one of the other rows.

1290	   If the consumer can receive a presentation stream too, it would also
1291	   choose to receive the only row from Capture Set #2 (VC6).

1293	9.1.  The MCU Case

1295	   This section shows how an MCU might express its Capture Sets,
1296	   intending to offer different choices for consumers that can handle
1297	   different numbers of streams.  A single audio capture stream is
1298	   provided for all single and multi-screen configurations that can be
1299	   associated (e.g. lip-synced) with any combination of video captures
1300	   at the consumer.

1302	   +--------------------+---------------------------------------------+
1303	   | Capture Set #1     | note                                        |
1304	   +--------------------+---------------------------------------------+
1305	   | VC0                | video capture for single screen consumer    |
1306	   | VC1, VC2           | video capture for 2 screen consumer         |
1307	   | VC3, VC4, VC5      | video capture for 3 screen consumer         |
1308	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer         |
1309	   | AC0                | audio capture representing all participants |
1310	   +--------------------+---------------------------------------------+

1312	   If / when a presentation stream becomes active within the conference,
1313	   the MCU might re-advertise the available media as:

1315	         +----------------+--------------------------------------+
1316	         | Capture Set #2 | note                                 |
1317	         +----------------+--------------------------------------+
1318	         | VC10           | video capture for presentation       |
1319	         | AC1            | presentation audio to accompany VC10 |
1320	         +----------------+--------------------------------------+

1322	9.2.  Media Consumer Behavior

1324	   [Edt. Should this be moved to appendix?]

1326	   The receive side of a call needs to balance its requirements, based
1327	   on number of screens and speakers, its decoding capabilities and
1328	   available bandwidth, and the provider's capabilities in order to
1329	   optimally configure the provider's streams.  Typically it would want
1330	   to receive and decode media from each capture set advertised by the
1331	   provider.

1333	   A sane, basic, algorithm might be for the consumer to go through each
1334	   capture set in turn and find the collection of video captures that
1335	   best matches the number of screens it has (this might include
1336	   consideration of screens dedicated to presentation video display
1337	   rather than "people" video) and then decide between alternative rows
1338	   in the video capture sets based either on hard-coded preferences or
1339	   user choice.  Once this choice has been made, the consumer would then
1340	   decide how to configure the provider's encode groups in order to make
1341	   best use of the available network bandwidth and its own decoding
1342	   capabilities.

1344	9.2.1.  One screen consumer

1346	   VC3, VC4 and VC5 are all on different rows by themselves, not in a
1347	   group, so the receiving device should choose between one of those.
1348	   The choice would come down to whether to see the greatest number of
1349	   participants simultaneously at roughly equal precedence (VC5), a
1350	   switched view of just the loudest region (VC3) or a switched view
1351	   with PiPs (VC4).  An endpoint device with a small amount of knowledge
1352	   of these differences could offer a dynamic choice of these options,
1353	   in-call, to the user.

1355	9.2.2.  Two screen consumer configuring the example

1357	   Mixing systems with an even number of screens, "2n", and those with
1358	   "2n+1" cameras (and vice versa) is always likely to be the
1359	   problematic case.  In this instance, the behavior is likely to be
1360	   determined by whether a "2 screen" system is really a "2 decoder"
1361	   system, i.e., whether only one received stream can be displayed per
1362	   screen or whether more than 2 streams can be received and spread
1363	   across the available screen area.  To enumerate 3 possible behaviors
1364	   here for the 2 screen system when it learns that the far end is
1365	   "ideally" expressed via 3 capture streams:

1367	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1368	       per the 1 screen consumer case above) and either leave one screen
1369	       blank or use it for presentation if / when a presentation becomes
1370	       active

1372	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
1373	       (either with each capture being scaled to 2/3 of a screen and the
1374	       centre capture being split across 2 screens) or, as would be
1375	       necessary if there were large bezels on the screens, with each
1376	       stream being scaled to 1/2 the screen width and height and there
1377	       being a 4th "blank" panel.  This 4th panel could potentially be
1378	       used for any presentation that became active during the call.

1380	   3.  Receive 3 streams, decode all 3, and use control information
1381	       indicating which was the most active to switch between showing
1382	       the left and centre streams (one per screen) and the centre and
1383	       right streams.

1385	   For an endpoint capable of all 3 methods of working described above,
1386	   again it might be appropriate to offer the user the choice of display
1387	   mode.

1389	9.2.3.  Three screen consumer configuring the example

1391	   This is the most straightforward case - the consumer would look to
1392	   identify a set of streams to receive that best matched its available
1393	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
1394	   spatial ordering would give sufficient information for the correct
1395	   video capture to be shown on the correct screen, and the consumer
1396	   would either need to divide a single encode group's capability by 3
1397	   to determine what resolution and frame rate to configure the provider
1398	   with or to configure the individual video captures' encode groups
1399	   with what makes most sense (taking into account the receive side
1400	   decode capabilities, overall call bandwidth, the resolution of the
1401	   screens plus any user preferences such as motion vs sharpness).

1403	10.  Acknowledgements

1405	   Mark Gorzyinski contributed much to the approach.  We want to thank
1406	   Stephen Botzko for helpful discussions on audio.

1408	11.  IANA Considerations

1410	   TBD

1412	12.  Security Considerations

1414	   TBD

1416	13.  Informative References

1418	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1419	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1421	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1422	              A., Peterson, J., Sparks, R., Handley, M., and E.
1423	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1424	              June 2002.

1426	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1427	              Jacobson, "RTP: A Transport Protocol for Real-Time
1428	              Applications", STD 64, RFC 3550, July 2003.

1430	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1431	              Session Initiation Protocol (SIP)", RFC 4353,
1432	              February 2006.

1434	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1435	              January 2008.

1437	Appendix A.  Open Issues

1439	A.1.  Video layout arrangements and centralized composition

1441	   In the context of a conference with a central MCU, there has been
1442	   discussion about a consumer requesting the provider to provide a
1443	   certain type of layout arrangement or perform a certain composition
1444	   algorithm, such as combining some number of most recent talkers, or
1445	   producing a video layout using a 2x2 grid or 1 large cell with 5
1446	   smaller cells around it.  The current framework does not address
1447	   this.  It isn't clear if this topic should be included in this
1448	   framework, or maybe a different part of CLUE, or maybe outside of
1449	   CLUE altogether.

1451	A.2.  Source is selectable

1453	   A Boolean variable.  True indicates the media consumer can request a
1454	   particular media source be mapped to a media capture.  Default is
1455	   false.

1457	   TBD - how does the consumer make the request for a particular source?
1458	   How does the consumer know what is available?  Need to explain better
1459	   how multiple media captures are different from a single media capture
1460	   with choices for the source, and when each concept should be used.

1462	A.3.  Media Source Selection

1464	   The use cases include a case where the person at a receiving endpoint
1465	   can request to receive media from a particular other endpoint, for
1466	   example in a multipoint call to request to receive the video from a
1467	   certain section of a certain room, whether or not people there are
1468	   talking.

1470	   TBD - this framework should address this case.  Maybe need a roster
1471	   list of rooms or people in the conference, with a mechanism to select
1472	   from the roster and associate it with media captures.  This is
1473	   different from selecting a particular media capture from a capture
1474	   set.  The mechanism to do this will probably need to be different
1475	   than selecting media captures based on capture sets and attributes.

1477	A.4.  Endpoint requesting many streams from MCU

1479	   TBD - how to do VC selection for a system where the endpoint media
1480	   consumers want to receive lots of streams and do their own
1481	   composition, rather than MCU doing transcoding and composing.
1482	   Example is 3 screen consumer that wants 3 large loudest speaker
1483	   streams, and a bunch of small ones to render as PiP.  How the small
1484	   ones are chosen, which could potentially be chosen by either the
1485	   endpoint or MCU.  There are other more complicated examples also.  Is
1486	   the current framework adequate to support this?

1488	A.5.  VAD (voice activity detection) tagging of audio streams

1490	   TBD - do we want to have VAD be mandatory?  All audio streams
1491	   originating from a media provider must be tagged with VAD
1492	   information.  This tagging would include an overall energy value for
1493	   the stream plus information on which sections of the capture scene
1494	   are "active".

1496	   Each audio stream which forms a constituent of a row within a capture
1497	   set should include this tagging, and the energy value within it
1498	   calculated using a fixed, consistent algorithm.

1500	   When a system determines the most active area of a capture scene
1501	   (either "loudest", or determined by other means such as a button
1502	   press) it should convey that information to the corresponding media
1503	   stream consumer via any audio streams being sent within that capture
1504	   set.  Specifically, there should be a list of active linear positions
1505	   and their VAD characteristics within the audio stream in addition to
1506	   the overall VAD information for the capture set.  This is to ensure
1507	   all media stream consumers receive the same, consistent, audio energy
1508	   information whichever audio capture or captures they choose to
1509	   receive for a capture set.  Additionally, linear position information
1510	   can be mapped to video captures by a media stream consumer in order
1511	   that it can perform "panel switching" if required.

1513	A.6.  Private Information

1515	   Do we want a way to include private information?

1517	Authors' Addresses

1519	   Allyn Romanow
1520	   Cisco Systems
1521	   San Jose, CA  95134
1522	   USA

1524	   Email: allyn@cisco.com

1526	   Mark Duckworth (editor)
1527	   Polycom
1528	   Andover, MA  01810
1529	   US

1531	   Email: mark.duckworth@polycom.com

1533	   Andrew Pepperell
1534	   Cisco Systems
1535	   Langley, England
1536	   UK

1538	   Email: apeppere@cisco.com

1540	   Brian Baldino
1541	   Cisco Systems
1542	   San Jose, CA  95134
1543	   US

1545	   Email: bbaldino@cisco.com