idnits 2.17.1 draft-ietf-clue-framework-17.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1117 has weird spacing: '... switch betwe...' == Line 1939 has weird spacing: '...om left bot...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: A separate data channel is established to transport the CLUE protocol messages. The contents of the CLUE protocol messages are based on information introduced in this document, which is represented by an XML schema for this information defined in the CLUE data model [ref]. Some of the information which could possibly introduce privacy concerns is the xCard information as described in section 7.1.1.11. In addition, the (text) description field in the Media Capture attribute (section 7.1.1.7) could possibly reveal sensitive information or specific identities. The same would be true for the descriptions in the Capture Scene (section 7.3.1) and Capture Scene View (7.3.2) attributes. One other important consideration for the information in the xCard as well as the description field in the Media Capture and Capture Scene View attributes is that while the endpoints involved in the session have been authenticated, there is no assurance that the information in the xCard or description fields is authentic. Thus, this information SHOULD not be used to make any authorization decisions and the participants in the sessions SHOULD be made aware of this. -- The document date (September 29, 2014) is 3496 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC6351' is mentioned on line 868, but not defined

  == Missing Reference: 'RFC6350' is mentioned on line 879, but not defined

  == Missing Reference: 'RFC4566' is mentioned on line 1555, but not defined

  ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866)

  == Missing Reference: 'RFC 6503' is mentioned on line 2967, but not defined

  == Missing Reference: 'RFC 3261' is mentioned on line 2989, but not defined

  == Unused Reference: 'I-D.ietf-clue-data-model-schema' is defined on line
     3343, but no explicit reference was found in the text

  == Unused Reference: 'I-D.presta-clue-protocol' is defined on line 3348,
     but no explicit reference was found in the text

  == Unused Reference: 'RFC4579' is defined on line 3374, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-18) exists of
     draft-ietf-clue-datachannel-00

  ** Downref: Normative reference to an Experimental draft:
     draft-ietf-clue-datachannel (ref. 'I-D.ietf-clue-datachannel')

  == Outdated reference: A later version (-17) exists of
     draft-ietf-clue-data-model-schema-06

  -- No information found for draft-prestaclue-protocol - is the name correct?

  -- Possible downref: Normative reference to a draft: ref.
     'I-D.presta-clue-protocol' 

  == Outdated reference: A later version (-15) exists of
     draft-ietf-clue-signaling-03

  ** Downref: Normative reference to an Experimental draft:
     draft-ietf-clue-signaling (ref. 'I-D.ietf-clue-signaling')

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 3 errors (**), 0 flaws (~~), 16 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	CLUE WG                                              M. Duckworth, Ed.
2	Internet Draft                                                  Polycom
3	Intended status: Standards Track                           A. Pepperell
4	Expires: March 29, 2015                                           Acano
5	                                                              S. Wenger
6	                                                                  Vidyo
7	                                                     September 29, 2014

9	                Framework for Telepresence Multi-Streams
10	                    draft-ietf-clue-framework-17.txt

12	Abstract

14	   This document defines a framework for a protocol to enable devices
15	   in a telepresence conference to interoperate.  The protocol enables
16	   communication of information about multiple media streams so a
17	   sending system and receiving system can make reasonable decisions
18	   about transmitting, selecting and rendering the media streams.
19	   This protocol is used in addition to SIP signaling for setting up a
20	   telepresence session.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current
30	   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six
33	   months and may be updated, replaced, or obsoleted by other
34	   documents at any time.  It is inappropriate to use Internet-Drafts
35	   as reference material or to cite them other than as "work in
36	   progress."

38	   This Internet-Draft will expire on March 29, 2015.

40	Copyright Notice

42	   Copyright (c) 2013 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with
50	   respect to this document.  Code Components extracted from this
51	   document must include Simplified BSD License text as described in
52	   Section 4.e of the Trust Legal Provisions and are provided without
53	   warranty as described in the Simplified BSD License.

55	Table of Contents

57	   1. Introduction...................................................3
58	   2. Terminology....................................................4
59	   3. Definitions....................................................4
60	   4. Overview & Motivation..........................................7
61	   5. Overview of the Framework/Model................................9
62	   6. Spatial Relationships.........................................14
63	   7. Media Captures and Capture Scenes.............................16
64	      7.1. Media Captures...........................................16
65	         7.1.1. Media Capture Attributes............................17
66	      7.2. Multiple Content Capture.................................23
67	         7.2.1. MCC Attributes......................................24
68	      7.3. Capture Scene............................................29
69	         7.3.1. Capture Scene attributes............................31
70	         7.3.2. Capture Scene View attributes.......................32
71	         7.3.3. Global View List....................................32
72	   8. Simultaneous Transmission Set Constraints.....................33
73	   9. Encodings.....................................................35
74	      9.1. Individual Encodings.....................................35
75	      9.2. Encoding Group...........................................36
76	      9.3. Associating Captures with Encoding Groups................37
77	   10. Consumer's Choice of Streams to Receive from the Provider....38
78	      10.1. Local preference........................................41
79	      10.2. Physical simultaneity restrictions......................41
80	      10.3. Encoding and encoding group limits......................41
81	   11. Extensibility................................................42
82	   12. Examples - Using the Framework (Informative).................42
83	      12.1. Provider Behavior.......................................42
84	         12.1.1. Three screen Endpoint Provider.....................42
85	         12.1.2. Encoding Group Example.............................49
86	         12.1.3. The MCU Case.......................................50

88	      12.2. Media Consumer Behavior.................................51
89	         12.2.1. One screen Media Consumer..........................52
90	         12.2.2. Two screen Media Consumer configuring the example..52
91	         12.2.3. Three screen Media Consumer configuring the example53
92	      12.3. Multipoint Conference utilizing Multiple Content Captures53
93	         12.3.1. Single Media Captures and MCC in the same
94	         Advertisement..............................................53
95	         12.3.2. Several MCCs in the same Advertisement.............56
96	         12.3.3. Heterogeneous conference with switching and
97	         composition................................................58
98	         12.3.4. Heterogeneous conference with voice activated
99	         switching..................................................65
100	   13. Acknowledgements.............................................67
101	   14. IANA Considerations..........................................68
102	   15. Security Considerations......................................68
103	   16. Changes Since Last Version...................................69
104	   17. Normative References.........................................76
105	   18. Informative References.......................................77
106	   19. Authors' Addresses...........................................78

108	1. Introduction

110	   Current telepresence systems, though based on open standards such
111	   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with
112	   each other.  A major factor limiting the interoperability of
113	   telepresence systems is the lack of a standardized way to describe
114	   and negotiate the use of the multiple streams of audio and video
115	   comprising the media flows.  This document provides a framework for
116	   protocols to enable interoperability by handling multiple streams
117	   in a standardized way.  The framework is intended to support the
118	   use cases described in Use Cases for Telepresence Multistreams
119	   [RFC7205] and to meet the requirements in Requirements for
120	   Telepresence Multistreams [RFC7262].

122	   The basic session setup for the use cases is based on SIP [RFC3261]
123	   and SDP offer/answer [RFC3264].  In addition to basic SIP & SDP
124	   offer/answer, CLUE specific signaling is required to exchange the
125	   information describing the multiple media streams.  The motivation
126	   for this framework, an overview of the signaling, and information
127	   required to be exchanged is described in subsequent sections of
128	   this document.  Companion documents describe the signaling details
129	   [I-D.ietf-clue-signaling] and the data model [I-D.ietf-clue-data-
130	   model-schema].

132	2. Terminology

134	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
135	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
136	   this document are to be interpreted as described in RFC 2119
137	   [RFC2119].

139	3. Definitions

141	   The terms defined below are used throughout this document and
142	   companion documents and they are normative.  In order to easily
143	   identify the use of a defined term, those terms are capitalized.

145	   Advertisement: a CLUE message a Media Provider sends to a Media
146	   Consumer describing specific aspects of the content of the media,
147	   and any restrictions it has in terms of being able to provide
148	   certain Streams simultaneously.

150	   Audio Capture: Media Capture for audio.  Denoted as ACn in the
151	   examples in this document.

153	   Capture: Same as Media Capture.

155	   Capture Device: A device that converts physical input, such as
156	   audio, video or text, into an electrical signal, in most cases to
157	   be fed into a media encoder.

159	   Capture Encoding: A specific encoding of a Media Capture, to be
160	   sent by a Media Provider to a Media Consumer via RTP.

162	   Capture Scene: a structure representing a spatial region captured
163	   by one or more Capture Devices, each capturing media representing a
164	   portion of the region. The spatial region represented by a Capture
165	   Scene MAY or may not correspond to a real region in physical space,
166	   such as a room.  A Capture Scene includes attributes and one or
167	   more Capture Scene Views, with each view including one or more
168	   Media Captures.

170	   Capture Scene View (CSV): a list of Media Captures of the same
171	   media type that together form one way to represent the entire
172	   Capture Scene.

174	   CLUE-capable device: A device that supports the CLUE data channel
175	   [I-D.ietf-clue-datachannel], the CLUE protocol [I-D.presta-clue-
176	   protocol] and the principles of CLUE negotiation, and wishes to
177	   upgrade the call to CLUE-enabled status.

179	   CLUE-enabled call: A call in which two CLUE-capable devices have
180	   successfully negotiated support for a CLUE data channel in SDP. A
181	   CLUE-enabled call is not necessarily immediately able to send CLUE-
182	   controlled media; negotiation of the data channel and of the CLUE
183	   protocol must complete first. Calls between two CLUE-capable
184	   devices which have not yet successfully completed negotiation of
185	   support for the CLUE data channel in SDP are not considered CLUE-
186	   enabled.

188	   Conference: used as defined in [RFC4353], A Framework for
189	   Conferencing within the Session Initiation Protocol (SIP).

191	   Configure Message: A CLUE message a Media Consumer sends to a Media
192	   Provider specifying which content and media streams it wants to
193	   receive, based on the information in a corresponding Advertisement
194	   message.

196	   Consumer: short for Media Consumer.

198	   Encoding or Individual Encoding: a set of parameters representing a
199	   way to encode a Media Capture to become a Capture Encoding.

201	   Encoding Group: A set of encoding parameters representing a total
202	   media encoding capability to be sub-divided across potentially
203	   multiple Individual Encodings.

205	   Endpoint: A CLUE capable-device which is the logical point of final
206	   termination through receiving, decoding and rendering, and/or
207	   initiation through capturing, encoding, and sending of media
208	   streams.  An endpoint consists of one or more physical devices
209	   which source and sink media streams, and exactly one [RFC4353]
210	   Participant (which, in turn, includes exactly one SIP User Agent).
211	   Endpoints can be anything from multiscreen/multicamera rooms to
212	   handheld devices.

214	   Global View: A set of references to one or more Capture Scene Views
215	   of the same media type that are defined within scenes of the same
216	   advertisement.  Each Global View in the list is a suggestion from
217	   the Provider to the Consumer for which CSVs provide a complete
218	   representation of the simultaneous captures provided by the
219	   Provider, across multiple scenes.

221	   MCU: Multipoint Control Unit (MCU) - a CLUE-capable device that
222	   connects two or more endpoints together into one single multimedia
223	   conference [RFC5117].  An MCU includes an [RFC4353] like Mixer,
224	   without the [RFC4353] requirement to send media to each
225	   participant.

227	   Media: Any data that, after suitable encoding, can be conveyed over
228	   RTP, including audio, video or timed text.

230	   Media Capture: a source of Media, such as from one or more Capture
231	   Devices or constructed from other Media streams.

233	   Media Consumer: a CLUE-capable device that that is capable of
234	   receiving Capture Encodings

236	   Media Provider: a CLUE-capable device that is capable of sending
237	   Capture Encodings

239	   Multiple Content Capture (MCC): A Capture that mixes and/or
240	   switches other Captures of a single type. (E.g. all audio or all
241	   video.) Particular Media Captures may or may not be present in the
242	   resultant Capture Encoding depending on time or space.  Denoted as
243	   MCCn in the example cases in this document.

245	   Plane of Interest: The spatial plane containing the most relevant
246	   subject matter.

248	   Provider: Same as Media Provider.

250	   Render: the process of generating a representation from media, such
251	   as displayed motion video or sound emitted from loudspeakers.

253	   Simultaneous Transmission Set: a set of Media Captures that can be
254	   transmitted simultaneously from a Media Provider.

256	   Single Media Capture: A capture which contains media from a single
257	   source capture device, e.g. an audio capture from a single
258	   microphone, a video capture from a single camera.

260	   Spatial Relation: The arrangement in space of two objects, in
261	   contrast to relation in time or other relationships.

263	   Stream: a Capture Encoding sent from a Media Provider to a Media
264	   Consumer via RTP [RFC3550].

266	   Stream Characteristics: the media stream attributes commonly used
267	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
268	   resolution, profile/level etc.) as well as CLUE specific
269	   attributes, such as the Capture ID or a spatial location.

271	   Video Capture: Media Capture for video.  Denoted as VCn in the
272	   example cases in this document.

274	   Video Composite: A single image that is formed, normally by an RTP
275	   mixer inside an MCU, by combining visual elements from separate
276	   sources.

278	4. Overview & Motivation

280	   This section provides an overview of the functional elements
281	   defined in this document to represent a telepresence system.  The
282	   motivations for the framework described in this document are also
283	   provided.

285	   Two key concepts introduced in this document are the terms "Media
286	   Provider" and "Media Consumer". A Media Provider represents the
287	   entity that sends the media and a Media Consumer represents the
288	   entity that receives the media. A Media Provider provides Media in
289	   the form of RTP packets, a Media Consumer consumes those RTP
290	   packets.  Media Providers and Media Consumers can reside in
291	   Endpoints or in Multipoint Control Units (MCUs).  A Media Provider
292	   in an Endpoint is usually associated with the generation of media
293	   for Media Captures; these Media Captures are typically sourced
294	   from cameras, microphones, and the like.  Similarly, the Media
295	   Consumer in an Endpoint is usually associated with renderers, such
296	   as screens and loudspeakers.  In MCUs, Media Providers and
297	   Consumers can have the form of outputs and inputs, respectively,
298	   of RTP mixers, RTP translators, and similar devices.  Typically,
299	   telepresence devices such as Endpoints and MCUs would perform as
300	   both Media Providers and Media Consumers, the former being
301	   concerned with those devices' transmitted media and the latter
302	   with those devices' received media.  In a few circumstances, a
303	   CLUE-capable device includes only Consumer or Provider
304	   functionality, such as recorder-type Consumers or webcam-type
305	   Providers.

307	   The motivations for the framework outlined in this document
308	   include the following:

310	   (1) Endpoints in telepresence systems typically have multiple Media
311	   Capture and Media Render devices, e.g., multiple cameras and
312	   screens. While previous system designs were able to set up calls
313	   that would capture media using all cameras and display media on all
314	   screens, for example, there was no mechanism that can associate
315	   these Media Captures with each other in space and time.

317	   (2) The mere fact that there are multiple capturing and rendering
318	   devices, each of which may be configurable in aspects such as zoom,
319	   leads to the difficulty that a variable number of such devices can
320	   be used to capture different aspects of a region.  The Capture
321	   Scene concept allows for the description of multiple setups for
322	   those multiple capture devices that could represent sensible
323	   operation points of the physical capture devices in a room, chosen
324	   by the operator.  A Consumer can pick and choose from those
325	   configurations based on its rendering abilities and inform the
326	   Provider about its choices.  Details are provided in section 7.

328	   (3) In some cases, physical limitations or other reasons disallow
329	   the concurrent use of a device in more than one setup.  For
330	   example, the center camera in a typical three-camera conference
331	   room can set its zoom objective either to capture only the middle
332	   few seats, or all seats of a room, but not both concurrently.  The
333	   Simultaneous Transmission Set concept allows a Provider to signal
334	   such limitations.  Simultaneous Transmission Sets are part of the
335	   Capture Scene description, and discussed in section 8.

337	   (4) Often, the devices in a room do not have the computational
338	   complexity or connectivity to deal with multiple encoding options
339	   simultaneously, even if each of these options is sensible in
340	   certain scenarios, and even if the simultaneous transmission is
341	   also sensible (i.e. in case of multicast media distribution to
342	   multiple endpoints).   Such constraints can be expressed by the
343	   Provider using the Encoding Group concept, described in section 9.

345	   (5) Due to the potentially large number of RTP flows required for a
346	   Multimedia Conference involving potentially many Endpoints, each of
347	   which can have many Media Captures and media renderers, it has
348	   become common to multiplex multiple RTP media flows onto the same
349	   transport address, so to avoid using the port number as a
350	   multiplexing point and the associated shortcomings such as
351	   NAT/firewall traversal.  While the actual mapping of those RTP
352	   flows to the header fields of the RTP packets is not subject of
353	   this specification, the large number of possible permutations of
354	   sensible options a Media Provider can make available to a Media
355	   Consumer makes a mechanism desirable that allows to narrow down the
356	   number of possible options that a SIP offer-answer exchange has to
357	   consider.  Such information is made available using protocol
358	   mechanisms specified in this document and companion documents,
359	   although it should be stressed that its use in an implementation is
360	   OPTIONAL.  Also, there are aspects of the control of both Endpoints
361	   and MCUs that dynamically change during the progress of a call,
362	   such as audio-level based screen switching, layout changes, and so
363	   on, which need to be conveyed.  Note that these control aspects are
364	   complementary to those specified in traditional SIP based
365	   conference management such as BFCP.  An exemplary call flow can be
366	   found in section 5.

368	   Finally, all this information needs to be conveyed, and the notion
369	   of support for it needs to be established.  This is done by the
370	   negotiation of a "CLUE channel", a data channel negotiated early
371	   during the initiation of a call.  An Endpoint or MCU that rejects
372	   the establishment of this data channel, by definition, does not
373	   support CLUE based mechanisms, whereas an Endpoint or MCU that
374	   accepts it is REQUIRED to use it to the extent specified in this
375	   document and its companion documents.

377	5. Overview of the Framework/Model

379	   The CLUE framework specifies how multiple media streams are to be
380	   handled in a telepresence conference.

382	   A Media Provider (transmitting Endpoint or MCU) describes specific
383	   aspects of the content of the media and the media stream encodings
384	   it can send in an Advertisement; and the Media Consumer responds to
385	   the Media Provider by specifying which content and media streams it
386	   wants to receive in a Configure message.  The Provider then
387	   transmits the asked-for content in the specified streams.

389	   This Advertisement and Configure typically occur during call
390	   initiation, after CLUE has been enabled in a call, but MAY also
391	   happen at any time throughout the call, whenever there is a change
392	   in what the Consumer wants to receive or (perhaps less common) the
393	   Provider can send.

395	   An Endpoint or MCU typically act as both Provider and Consumer at
396	   the same time, sending Advertisements and sending Configurations in
397	   response to receiving Advertisements.  (It is possible to be just
398	   one or the other.)

400	   The data model is based around two main concepts: a Capture and an
401	   Encoding.  A Media Capture (MC), such as audio or video, has
402	   attributes to describe the content a Provider can send.  Media
403	   Captures are described in terms of CLUE-defined attributes, such as
404	   spatial relationships and purpose of the capture.  Providers tell
405	   Consumers which Media Captures they can provide, described in terms
406	   of the Media Capture attributes.

408	   A Provider organizes its Media Captures into one or more Capture
409	   Scenes, each representing a spatial region, such as a room.  A
410	   Consumer chooses which Media Captures it wants to receive from the
411	   Capture Scenes.

413	   In addition, the Provider can send the Consumer a description of
414	   the Individual Encodings it can send in terms of identifiers which
415	   relate to items in SDP.

417	   The Provider can also specify constraints on its ability to provide
418	   Media, and a sensible design choice for a Consumer is to take these
419	   into account when choosing the content and Capture Encodings it
420	   requests in the later offer-answer exchange.  Some constraints are
421	   due to the physical limitations of devices--for example, a camera
422	   may not be able to provide zoom and non-zoom views simultaneously.
423	   Other constraints are system based, such as maximum bandwidth.

425	   The following diagram illustrates the information contained in an
426	   Advertisement.

428	   ...................................................................
429	   .  Provider Advertisement             +--------------------+      .
430	   .                                     | Simultaneous Sets  |      .
431	   .        +------------------------+   +--------------------+      .
432	   .        |       Capture Scene N  |   +--------------------+      .
433	   .      +-+----------------------+ |   | Global View List   |      .
434	   .      |       Capture Scene 2  | |   +--------------------+      .
435	   .    +-+----------------------+ | |      +----------------------+ .
436	   .    |  Capture Scene 1       | | |      |  Encoding Group N    | .
437	   .    |    +---------------+   | | |    +-+--------------------+ | .
438	   .    |    | Attributes    |   | | |    |   Encoding Group 2   | | .
439	   .    |    +---------------+   | | |  +-+--------------------+ | | .
440	   .    |                        | | |  |   Encoding Group 1   | | | .
441	   .    |    +----------------+  | | |  |     parameters       | | | .
442	   .    |    |  V i e w s     |  | | |  |      bandwidth       | | | .
443	   .    |    |  +---------+   |  | | |  | +-------------------+| | | .
444	   .    |    |  |Attribute|   |  | | |  | | V i d e o         || | | .
445	   .    |    |  +---------+   |  | | |  | | E n c o d i n g s || | | .
446	   .    |    |                |  | | |  | | Encoding 1        || | | .
447	   .    |    | View 1         |  | | |  | |                   || | | .
448	   .    |    |  (list of MCs) |  | |-+  | +-------------------+| | | .
449	   .    |    +----|-|--|------+  |-+    |                      | | | .
450	   .    +---------|-|--|---------+      | +-------------------+| | | .
451	   .              | |  |                | | A u d i o         || | | .
452	   .              | |  |                | | E n c o d i n g s || | | .
453	   .              v |  |                | | Encoding 1        || | | .
454	   .      +---------|--|--------+       | |                   || | | .
455	   .      | Media Capture N     |------>| +-------------------+| | | .
456	   .    +-+---------v--|------+ |       |                      | | | .
457	   .    | Media Capture 2     | |       |                      | |-+ .
458	   .  +-+--------------v----+ |-------->|                      | |   .
459	   .  | Media Capture  1    | | |       |                      |-+   .
460	   .  |  +----------------+ |---------->|                      |     .
461	   .  |  | Attributes     | | |_+       +----------------------+     .
462	   .  |  +----------------+ |_+                                      .
463	   .  +---------------------+                                        .
464	   .                                                                 .
465	   ...................................................................
466	   Figure 1: Advertisement Structure

468	   A very brief outline of the call flow used by a simple system (two
469	   Endpoints) in compliance with this document can be described as
470	   follows, and as shown in the following figure.

472	         +-----------+                     +-----------+
473	         | Endpoint1 |                     | Endpoint2 |
474	         +----+------+                     +-----+-----+
475	              | INVITE (BASIC SDP+CLUECHANNEL)   |
476	              |--------------------------------->|
477	              |    200 0K (BASIC SDP+CLUECHANNEL)|
478	              |<---------------------------------|
479	              | ACK                              |
480	              |--------------------------------->|
481	              |                                  |
482	              |<################################>|
483	              |     BASIC SDP MEDIA SESSION      |
484	              |<################################>|
485	              |                                  |
486	              |    CONNECT (CLUE CTRL CHANNEL)   |
487	              |=================================>|
488	              |            ...                   |
489	              |<================================>|
490	              |   CLUE CTRL CHANNEL ESTABLISHED  |
491	              |<================================>|
492	              |                                  |
493	              | ADVERTISEMENT 1                  |
494	              |*********************************>|
495	              |                  ADVERTISEMENT 2 |
496	              |<*********************************|
497	              |                                  |
498	              |                      CONFIGURE 1 |
499	              |<*********************************|
500	              | CONFIGURE 2                      |
501	              |*********************************>|
502	              |                                  |
503	              | REINVITE (UPDATED SDP)           |
504	              |--------------------------------->|
505	              |              200 0K (UPDATED SDP)|
506	              |<---------------------------------|
507	              | ACK                              |
508	              |--------------------------------->|
509	              |                                  |
510	              |<################################>|
511	              |   UPDATED SDP MEDIA SESSION      |
512	              |<################################>|
513	              |                                  |
514	              v                                  v

516	                    Figure 2: Basic Information Flow

518	   An initial offer/answer exchange establishes a basic media session,
519	   for example audio-only, and a CLUE channel between two Endpoints.
520	   With the establishment of that channel, the endpoints have
521	   consented to use the CLUE protocol mechanisms and, therefore, MUST
522	   adhere to the CLUE protocol suite as outlined herein.

524	   Over this CLUE channel, the Provider in each Endpoint conveys its
525	   characteristics and capabilities by sending an Advertisement as
526	   specified herein.  The Advertisement is typically not sufficient to
527	   set up all media.  The Consumer in the Endpoint receives the
528	   information provided by the Provider, and can use it for two
529	   purposes.  First, it MUST construct and send a CLUE Configure
530	   message to tell the Provider what the Consumer wishes to receive.
531	   Second, it MAY, but is not necessarily REQUIRED to, use the
532	   information provided to tailor the SDP it is going to send during
533	   the following SIP offer/answer exchange, and its reaction to SDP it
534	   receives in that step.  It is often a sensible implementation
535	   choice to do so, as the representation of the media information
536	   conveyed over the CLUE channel can dramatically cut down on the
537	   size of SDP messages used in the O/A exchange that follows.
538	   Spatial relationships associated with the Media can be included in
539	   the Advertisement, and it is often sensible for the Media Consumer
540	   to take those spatial relationships into account when tailoring the
541	   SDP.

543	   This CLUE exchange MUST be followed by an SDP offer answer exchange
544	   that not only establishes those aspects of the media that have not
545	   been "negotiated" over CLUE, but has also the side effect of
546	   setting up the media transmission itself, involving potentially
547	   security exchanges, ICE, and whatnot.  This step is plain vanilla
548	   SIP, with the exception that the SDP used herein, in most (but not
549	   necessarily all) cases can be considerably smaller than the SDP a
550	   system would typically need to exchange if there were no pre-
551	   established knowledge about the Provider and Consumer
552	   characteristics.  (The need for cutting down SDP size is not quite
553	   obvious for a point-to-point call involving simple endpoints;
554	   however, when considering a large multipoint conference involving
555	   many multi-screen/multi-camera endpoints, each of which can operate
556	   using multiple codecs for each camera and microphone, it becomes
557	   perhaps somewhat more intuitive.)

559	   During the lifetime of a call, further exchanges MAY occur over the
560	   CLUE channel.  In some cases, those further exchanges lead to a
561	   modified system behavior of Provider or Consumer (or both) without
562	   any other protocol activity such as further offer/answer exchanges.
563	   For example, voice-activated screen switching, signaled over the
564	   CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
565	   re-invites.  However, in other cases, after the CLUE negotiation an
566	   additional offer/answer exchange becomes necessary.  For example,
567	   if both sides decide to upgrade the call from a single screen to a
568	   multi-screen call and more bandwidth is required for the additional
569	   video channels compared to what was previously negotiated using
570	   offer/answer, a new O/A exchange is REQUIRED.

572	   One aspect of the protocol outlined herein and specified in more
573	   detail in companion documents is that it makes available
574	   information regarding the Provider's capabilities to deliver Media,
575	   and attributes related to that Media such as their spatial
576	   relationship, to the Consumer.  The operation of the renderer
577	   inside the Consumer is unspecified in that it can choose to ignore
578	   some information provided by the Provider, and/or not render media
579	   streams available from the Provider (although it MUST follow the
580	   CLUE protocol and, therefore, MUST gracefully receive and respond
581	   (through a Configure) to the Provider's information).  All CLUE
582	   protocol mechanisms are OPTIONAL in the Consumer in the sense that,
583	   while the Consumer MUST be able to receive (and, potentially,
584	   gracefully acknowledge) CLUE messages, it is free to ignore the
585	   information provided therein.

587	   A CLUE-implementing device interoperates with a device that does
588	   not support CLUE, because the non-CLUE device does, by definition,
589	   not understand the offer of a CLUE channel in the initial
590	   offer/answer exchange and, therefore, will reject it. This
591	   rejection MUST be used as the indication to the CLUE-implementing
592	   device that the other side of the communication is not compliant
593	   with CLUE, and to fall back to behavior that does not require CLUE.

595	   As for the media, Provider and Consumer have an end-to-end
596	   communication relationship with respect to (RTP transported) media;
597	   and the mechanisms described herein and in companion documents do
598	   not change the aspects of setting up those RTP flows and sessions.
599	   In other words, the RTP media sessions conform to the negotiated
600	   SDP whether or not CLUE is used.

602	6. Spatial Relationships

604	   In order for a Consumer to perform a proper rendering, it is often
605	   necessary or at least helpful for the Consumer to have received
606	   spatial information about the streams it is receiving.  CLUE
607	   defines a coordinate system that allows Media Providers to describe
608	   the spatial relationships of their Media Captures to enable proper
609	   scaling and spatially sensible rendering of their streams.  The
610	   coordinate system is based on a few principles:

612	   o  Simple systems which do not have multiple Media Captures to
613	      associate spatially need not use the coordinate model.

615	   o  Coordinates can be either in real, physical units (millimeters),
616	      have an unknown scale or have no physical scale.  Systems which
617	      know their physical dimensions (for example professionally
618	      installed Telepresence room systems) MUST always provide those
619	      real-world measurements.  Systems which don't know specific
620	      physical dimensions but still know relative distances MUST use
621	      'unknown scale'.  'No scale' is intended to be used where Media
622	      Captures from different devices (with potentially different
623	      scales) will be forwarded alongside one another (e.g. in the
624	      case of an MCU).

626	      *  "Millimeters" means the scale is in millimeters.

628	      *  "Unknown" means the scale is not necessarily millimeters, but
629	         the scale is the same for every Capture in the Capture Scene.

631	      *  "No Scale" means the scale could be different for each
632	         capture- an MCU Provider that advertises two adjacent
633	         captures and picks sources (which can change quickly) from
634	         different endpoints might use this value; the scale could be
635	         different and changing for each capture.  But the areas of
636	         capture still represent a spatial relation between captures.

638	   o  The coordinate system is right-handed Cartesian X, Y, Z with the
639	      origin at a spatial location of the Provider's choosing.  The
640	      Provider MUST use the same coordinate system with the same scale
641	      and origin for all coordinates within the same Capture Scene.

643	   The direction of increasing coordinate values is:
644	   X increases from left to right, from the point of view of an
645	   observer at the front of the room looking toward the back
646	   Y increases from the front of the room to the back of the room
647	   Z increases from low to high (i.e. floor to ceiling)

649	   Cameras in a scene typically point in the direction of increasing
650	   Y, from front to back.  But there could be multiple cameras
651	   pointing in different directions.  If the physical space does not
652	   have a well-defined front and back, the provider chooses any
653	   direction for X and Y consistent with right-handed coordinates.

655	7. Media Captures and Capture Scenes

657	   This section describes how Providers can describe the content of
658	   media to Consumers.

660	7.1. Media Captures

662	   Media Captures are the fundamental representations of streams that
663	   a device can transmit.  What a Media Capture actually represents is
664	   flexible:

666	   o  It can represent the immediate output of a physical source (e.g.
667	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
668	      DVD player).

670	   o  It can represent the output of an audio mixer or video composer

672	   o  It can represent a concept such as 'the loudest speaker'

674	   o  It can represent a conceptual position such as 'the leftmost
675	      stream'

677	   To identify and distinguish between multiple Capture instances
678	   Captures have a unique identity.  For instance: VC1, VC2 and AC1,
679	   AC2, where VC1 and VC2 refer to two different video captures and
680	   AC1 and AC2 refer to two different audio captures.

682	   Some key points about Media Captures:

684	     . A Media Capture is of a single media type (e.g. audio or
685	        video)
686	     . A Media Capture is defined in a Capture Scene and is given an
687	        advertisement unique identity.  The identity may be referenced
688	        outside the Capture Scene that defines it through a Multiple
689	        Content Capture (MCC)
690	     . A Media Capture may be associated with one or more Capture
691	        Scene Views
692	     . A Media Capture has exactly one set of spatial information
693	     . A Media Capture can be the source of one or more Capture
694	        Encodings

696	   Each Media Capture can be associated with attributes to describe
697	   what it represents.

699	7.1.1. Media Capture Attributes

701	   Media Capture Attributes describe information about the Captures.
702	   A Provider can use the Media Capture Attributes to describe the
703	   Captures for the benefit of the Consumer of the Advertisement
704	   message.  Media Capture Attributes include:

706	     . Spatial information, such as point of capture, point on line
707	        of capture, and area of capture, all of which, in combination
708	        define the capture field of, for example, a camera
709	     . Other descriptive information to help the Consumer choose
710	        between captures (description, presentation, view, priority,
711	        language, person information and type)
712	     . Control information for use inside the CLUE protocol suite

714	   The sub-sections below define the Capture attributes.

716	7.1.1.1. Point of Capture

718	   The Point of Capture attribute is a field with a single Cartesian
719	   (X, Y, Z) point value which describes the spatial location of the
720	   capturing device (such as camera).  For an Audio Capture with
721	   multiple microphones, the Point of Capture defines the nominal mid-
722	   point of the microphones.

724	7.1.1.2. Point on Line of Capture

726	   The Point on Line of Capture attribute is a field with a single
727	   Cartesian (X, Y, Z) point value which describes a position in space
728	   of a second point on the axis of the capturing device, toward the
729	   direction it is pointing; the first point being the Point of
730	   Capture (see above).

732	   Together, the Point of Capture and Point on Line of Capture define
733	   the direction and axis of the capturing device, for example the
734	   optical axis of a camera or the axis of a microphone.  The Media
735	   Consumer can use this information to adjust how it renders the
736	   received media if it so chooses.

738	   For an Audio Capture, the Media Consumer can use this information
739	   along with the Audio Capture Sensitivity Pattern to define a 3-
740	   dimensional volume of capture where sounds can be expected to be
741	   picked up by the microphone providing this specific audio capture.
742	   If the Consumer wants to associate an Audio Capture with a Video
743	   Capture, it can compare this volume with the area of capture for
744	   video media to provide a check on whether the audio capture is
745	   indeed spatially associated with the video capture. For example, a
746	   video area of capture that fails to intersect at all with the audio
747	   volume of capture, or is at such a long radial distance from the
748	   microphone point of capture that the audio level would be very low,
749	   would be inappropriate.

751	7.1.1.3. Area of Capture

753	   The Area of Capture is a field with a set of four (X, Y, Z) points
754	   as a value which describes the spatial location of what is being
755	   "captured".  This attribute applies only to video captures, not
756	   other types of media. By comparing the Area of Capture for
757	   different Video Captures within the same Capture Scene a Consumer
758	   can determine the spatial relationships between them and render
759	   them correctly.

761	   The four points MUST be co-planar, forming a quadrilateral, which
762	   defines the Plane of Interest for the particular media capture.

764	   If the Area of Capture is not specified, it means the Video Capture
765	   is not spatially related to any other Video Capture.

767	   For a switched capture that switches between different sections
768	   within a larger area, the area of capture MUST use coordinates for
769	   the larger potential area.

771	7.1.1.4. Mobility of Capture

773	   The Mobility of Capture attribute indicates whether or not the
774	   point of capture, line on point of capture, and area of capture
775	   values stay the same over time, or are expected to change
776	   (potentially frequently).  Possible values are static, dynamic, and
777	   highly dynamic.

779	   An example for "dynamic" is a camera mounted on a stand which is
780	   occasionally hand-carried and placed at different positions in
781	   order to provide the best angle to capture a work task.  A camera
782	   worn by a person who moves around the room is an example for
783	   "highly dynamic". In either case, the effect is that the capture
784	   point, capture axis and area of capture change with time.

786	   The capture point of a static capture MUST NOT move for the life of
787	   the conference. The capture point of dynamic captures is
788	   categorized by a change in position followed by a reasonable period
789	   of stability--in the order of magnitude of minutes. High dynamic
790	   captures are categorized by a capture point that is constantly
791	   moving.  If the "area of capture", "capture point" and "line of
792	   capture" attributes are included with dynamic or highly dynamic
793	   captures they indicate spatial information at the time of the
794	   Advertisement.

796	7.1.1.5. Audio Capture Sensitivity Pattern

798	   The Audio Capture Sensitivity Pattern attribute applies only to
799	   audio captures.  This is an optional attribute.  This attribute
800	   gives information about the nominal sensitivity pattern of the
801	   microphone which is the source of the capture.  Possible values
802	   include patterns such as omni, shotgun, cardioid, hyper-cardioid.

804	7.1.1.6. Max Capture Encodings

806	   The Max Capture Encodings attribute is an optional attribute
807	   indicating the maximum number of Capture Encodings that can be
808	   simultaneously active for the Media Capture.  The number of
809	   simultaneous Capture Encodings is also limited by the restrictions
810	   of the Encoding Group for the Media Capture.

812	7.1.1.7. Description

814	   The Description attribute is a human-readable description (which
815	   could be in multiple languages) of the Capture.

817	7.1.1.8. Presentation

819	   The Presentation attribute indicates that the capture originates
820	   from a presentation device, that is one that provides supplementary
821	   information to a conference through slides, video, still images,
822	   data etc.  Where more information is known about the capture it MAY
823	   be expanded hierarchically to indicate the different types of
824	   presentation media, e.g. presentation.slides, presentation.image
825	   etc.

827	   Note: It is expected that a number of keywords will be defined that
828	   provide more detail on the type of presentation.

830	7.1.1.9. View

832	   The View attribute is a field with enumerated values, indicating
833	   what type of view the Capture relates to.  The Consumer can use
834	   this information to help choose which Media Captures it wishes to
835	   receive.  The value MUST be one of:

837	   Room - Captures the entire scene

839	   Table - Captures the conference table with seated people

841	   Individual - Captures an individual person

843	   Lectern - Captures the region of the lectern including the
844	   presenter, for example in a classroom style conference room

846	   Audience - Captures a region showing the audience in a classroom
847	   style conference room

849	7.1.1.10. Language

851	   The language attribute indicates one or more languages used in the
852	   content of the Media Capture.  Captures MAY be offered in different
853	   languages in case of multilingual and/or accessible conferences.  A
854	   Consumer can use this attribute to differentiate between them and
855	   pick the appropriate one.

857	   Note that the Language attribute is defined and meaningful both for
858	   audio and video captures.  In case of audio captures, the meaning
859	   is obvious.  For a video capture, "Language" could, for example, be
860	   sign interpretation or text.

862	7.1.1.11. Person Information

864	   The person information attribute allows a Provider to provide
865	   specific information regarding the people in a Capture (regardless
866	   of whether or not the capture has a Presentation attribute). The
867	   Provider may gather the information automatically or manually from
868	   a variety of sources however the xCard [RFC6351] format is used to
869	   convey the information. This allows various information such as
870	   Identification information (section 6.2/[RFC6350]), Communication
871	   Information (section 6.4/[RFC6350]) and Organizational information
872	   (section 6.6/[RFC6350]) to be communicated. A Consumer may then
873	   automatically (i.e. via a policy) or manually select Captures
874	   based on information about who is in a Capture. It also allows a
875	   Consumer to render information regarding the people participating
876	   in the conference or to use it for further processing.

878	   The Provider may supply a minimal set of information or a larger
879	   set of information. However it MUST be compliant to [RFC6350] and
880	   supply a "VERSION" and "FN" property. A Provider may supply
881	   multiple xCards per Capture of any KIND (section 6.1.4/[RFC6350]).

883	   In order to keep CLUE messages compact the Provider SHOULD use a
884	   URI to point to any LOGO, PHOTO or SOUND contained in the xCARD
885	   rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE
886	   message.

888	7.1.1.12. Person Type

890	   The person type attribute indicates the type of people contained in
891	   the capture in the conference with respect to the meeting agenda
892	   (regardless of whether or not the capture has a Presentation
893	   attribute). As a capture may include multiple people the attribute
894	   may contain multiple values. However values shall not be repeated
895	   within the attribute.

897	   An Advertiser associates the person type with an individual capture
898	   when it knows that a particular type is in the capture. If an
899	   Advertiser cannot link a particular type with some certainty to a
900	   capture then it is not included. A Consumer on reception of a
901	   capture with a person type attribute knows with some certainly that
902	   the capture contains that person type. The capture may contain
903	   other person types but the Advertiser has not been able to
904	   determine that this is the case.

906	   The types of Captured people include:

908	     . Chairman - the person responsible for running the conference
909	        according to the agenda.
910	     . Vice-Chairman - the person responsible for assisting the
911	        chairman in running the meeting.
912	     . Minute Taker - the person responsible for recording the
913	        minutes of the conference
914	     . Member - the person has no particular responsibilities with
915	        respect to running the meeting.
916	     . Presenter - the person is scheduled on the agenda to make a
917	        presentation in the meeting. Note: This is not related to any
918	        "active speaker" functionality.

920	     . Translator - the person is providing some form of translation
921	        or commentary in the meeting.
922	     . Timekeeper - the person is responsible for maintaining the
923	        meeting schedule.

925	   Furthermore the person type attribute may contain one or more
926	   strings allowing the Provider to indicate custom meeting specific
927	   roles.

929	7.1.1.13. Priority

931	   The priority attribute indicates a relative priority between
932	   different Media Captures.  The Provider sets this priority, and the
933	   Consumer MAY use the priority to help decide which captures it
934	   wishes to receive.

936	   The "priority" attribute is an integer which indicates a relative
937	   priority between Captures. For example it is possible to assign a
938	   priority between two presentation Captures that would allow a
939	   remote endpoint to determine which presentation is more important.
940	   Priority is assigned at the individual capture level. It represents
941	   the Provider's view of the relative priority between Captures with
942	   a priority. The same priority number MAY be used across multiple
943	   Captures. It indicates they are equally important. If no priority
944	   is assigned no assumptions regarding relative important of the
945	   Capture can be assumed.

947	7.1.1.14. Embedded Text

949	   The Embedded Text attribute indicates that a Capture provides
950	   embedded textual information. For example the video Capture MAY
951	   contain speech to text information composed with the video image.
952	   This attribute is only applicable to video Captures and
953	   presentation streams with visual information.

955	7.1.1.15. Related To

957	   The Related To attribute indicates the Capture contains additional
958	   complementary information related to another Capture.  The value
959	   indicates the identity of the other Capture to which this Capture
960	   is providing additional information.

962	   For example, a conference can utilize translators or facilitators
963	   that provide an additional audio stream (i.e. a translation or
964	   description or commentary of the conference).  Where multiple
965	   captures are available, it may be advantageous for a Consumer to
966	   select a complementary Capture instead of or in addition to a
967	   Capture it relates to.

969	7.2. Multiple Content Capture

971	   The MCC indicates that one or more Single Media Captures are
972	   contained in one Media Capture.  Only one Capture type (i.e. audio,
973	   video, etc.) is allowed in each MCC instance.  The MCC may contain
974	   a reference to the Single Media Captures (which may have their own
975	   attributes) as well as attributes associated with the MCC itself.
976	   A MCC may also contain other MCCs.  The MCC MAY reference Captures
977	   from within the Capture Scene that defines it or from other Capture
978	   Scenes.  No ordering is implied by the order that Captures appear
979	   within a MCC. A MCC MAY contain no references to other Captures to
980	   indicate that the MCC contains content from multiple sources but no
981	   information regarding those sources is given.

983	   One or more MCCs may also be specified in a CSV.  This allows an
984	   Advertiser to indicate that several MCC captures are used to
985	   represent a capture scene.  Table 14 provides an example of this
986	   case.

988	   As outlined in section 7.1. each instance of the MCC has its own
989	   Capture identity i.e. MCC1. It allows all the individual captures
990	   contained in the MCC to be referenced by a single MCC identity.

992	   The example below shows the use of a Multiple Content Capture:

994	        +-----------------------+---------------------------------+
995	        | Capture Scene #1      |                                 |
996	        +-----------------------|---------------------------------+
997	        | VC1                   | {attributes}                    |
998	        | VC2                   | {attributes}                    |
999	        | VCn                   | {attributes}                    |
1000	        | MCC1(VC1,VC2,...VCn)  | {attributes}                    |
1001	        | CSV(MCC1)             |                                 |
1002	        +---------------------------------------------------------+

1004	                Table 1: Multiple Content Capture concept

1006	   This indicates that MCC1 is a single capture that contains the
1007	   Captures VC1, VC2 and VC3 according to any MCC1 attributes.

1009	7.2.1. MCC Attributes

1011	   Attributes may be associated with the MCC instance and the Single
1012	   Media Captures that the MCC references.  A Provider should avoid
1013	   providing conflicting attribute values between the MCC and Single
1014	   Media Captures. Where there is conflict the attributes of the MCC
1015	   override any that may be present in the individual captures.

1017	   A Provider MAY include as much or as little of the original source
1018	   Capture information as it requires.

1020	   There are MCC specific attributes that MUST only be used with
1021	   Multiple Content Captures. These are described in the sections
1022	   below. The attributes described in section 7.1.1. MAY also be used
1023	   with MCCs.

1025	   The spatial related attributes of an MCC indicate its area of
1026	   capture and point of capture within the scene, just like any other
1027	   media capture.  The spatial information does not imply anything
1028	   about how other captures are composed within an MCC.

1030	   For example:  A virtual scene could be constructed for the MCC
1031	   capture with two Video Captures with a "MaxCaptures" attribute set
1032	   to 2 and an "Area of Capture" attribute provided with an overall
1033	   area.  Each of the individual Captures could then also include an
1034	   "Area of Capture" attribute with a sub-set of the overall area.
1035	   The Consumer would then know how each capture is related to others
1036	   within the scene, but not the relative position of the individual
1037	   captures within the composed capture.

1039	        +-----------------------+---------------------------------+
1040	        | Capture Scene #1      |                                 |
1041	        +-----------------------|---------------------------------+
1042	        | VC1                   | AreaofCapture=(0,0,0)(9,0,0)    |
1043	        |                       |               (0,0,9)(9,0,9)    |
1044	        | VC2                   | AreaofCapture=(10,0,0)(19,0,0)  |
1045	        |                       |               (10,0,9)(19,0,9)  |
1046	        | MCC1(VC1,VC2)         | MaxCaptures=2                   |
1047	        |                       | AreaofCapture=(0,0,0)(19,0,0)   |
1048	        |                       |               (0,0,9)(19,0,9)   |
1049	        | CSV(MCC1)             |                                 |
1050	        +---------------------------------------------------------+

1052	        Table 2: Example of MCC and Single Media Capture attributes

1054	   The sections below describe the MCC only attributes.

1056	7.2.1.1. Maximum Number of Captures within a MCC

1058	   The Maximum Number of Captures MCC attribute indicates the maximum
1059	   number of individual captures that may appear in a Capture Encoding
1060	   at a time.  The actual number at any given time can be less than
1061	   this maximum.  It may be used to derive how the Single Media
1062	   Captures within the MCC are composed / switched with regards to
1063	   space and time.

1065	   A Provider can indicate that the number of captures in a MCC
1066	   capture encoding is equal "=" to the MaxCaptures value or that
1067	   there may be any number of captures up to and including "<=" the
1068	   MaxCaptures value. This allows a Provider to distinguish between a
1069	   MCC that purely represents a composition of sources versus a MCC
1070	   that represents switched or switched and composed sources.

1072	   MaxCaptures MAY be set to one so that only content related to one
1073	   of the sources are shown in the MCC Capture Encoding at a time or
1074	   it may be set to any value up to the total number of Source Media
1075	   Captures in the MCC.

1077	   The bullets below describe how the setting of MaxCapture versus the
1078	   number of captures in the MCC affects how sources appear in a
1079	   capture encoding:

1081	     . When MaxCaptures is set to <= 1 and the number of captures in
1082	        the MCC is greater than 1 (or not specified) in the MCC this
1083	        is a switched case. Zero or 1 captures may be switched into
1084	        the capture encoding. Note: zero is allowed because of the
1085	        "<=".
1086	     . When MaxCaptures is set to = 1 and the number of captures in
1087	        the MCC is greater than 1 (or not specified) in the MCC this
1088	        is a switched case. Only one capture source is contained in a
1089	        capture encoding at a time.
1090	     . When MaxCaptures is set to <= N (with N > 1) and the number of
1091	        captures in the MCC is greater than N (or not specified) this
1092	        is a switched and composed case. The capture encoding may
1093	        contain purely switched sources (i.e. <=2 allows for 1 source
1094	        on its own), or may contain composed and switched sources
1095	        (i.e. a composition of 2 sources switched between the
1096	        sources).
1097	     . When MaxCaptures is set to = N (with N > 1) and the number of
1098	        captures in the MCC is greater than N (or not specified) this
1099	        is a switched and composed case. The capture encoding contains
1100	        composed and switched sources (i.e. a composition of N sources
1101	        switched between the sources). It is not possible to have a
1102	        single source.
1103	     . When MaxCaptures is set to <= to the number of captures in the
1104	        MCC this is a switched and composed case. The capture encoding
1105	        may contain media switched between any number (up to the
1106	        MaxCaptures) of composed sources.
1107	     . When MaxCaptures is set to = to the number of captures in the
1108	        MCC this is a composed case. All the sources are composed into
1109	        a single capture encoding.

1111	   If this attribute is not set then as default it is assumed that all
1112	   source content can appear concurrently in the Capture Encoding
1113	   associated with the MCC.

1115	   For example: The use of MaxCaptures equal to 1 on a MCC with three
1116	   Video Captures VC1, VC2 and VC3 would indicate that the Advertiser
1117	   in the capture encoding would switch  between VC1, VC2 or VC3 as
1118	   there may be only a maximum of one capture at a time.

1120	7.2.1.2. Policy

1122	   The Policy MCC Attribute indicates the criteria that the Provider
1123	   uses to determine when and/or where media content appears in the
1124	   Capture Encoding related to the MCC.

1126	   The attribute is in the form of a token that indicates the policy
1127	   and index representing an instance of the policy.

1129	   The tokens are:

1131	   SoundLevel - This indicates that the content of the MCC is
1132	   determined by a sound level detection algorithm. For example: the
1133	   loudest (active) speaker is contained in the MCC.

1135	   RoundRobin - This indicates that the content of the MCC is
1136	   determined by a time based algorithm. For example: the Provider
1137	   provides content from a particular source for a period of time and
1138	   then provides content from another source and so on.

1140	   An index is used to represent an instance in the policy setting. A
1141	   index of 0 represents the most current instance of the policy, i.e.
1142	   the active speaker, 1 represents the previous instance, i.e. the
1143	   previous active speaker and so on.

1145	   The following example shows a case where the Provider provides two
1146	   media streams, one showing the active speaker and a second stream
1147	   showing the previous speaker.

1149	        +-----------------------+---------------------------------+
1150	        | Capture Scene #1      |                                 |
1151	        +-----------------------|---------------------------------+
1152	        | VC1                   |                                 |
1153	        | VC2                   |                                 |
1154	        | MCC1(VC1,VC2)         | Policy=SoundLevel:0             |
1155	        |                       | MaxCaptures=1                   |
1156	        | MCC2(VC1,VC2)         | Policy=SoundLevel:1             |
1157	        |                       | MaxCaptures=1                   |
1158	        | CSV(MCC1,MCC2)        |                                 |
1159	        +---------------------------------------------------------+

1161	                Table 3: Example Policy MCC attribute usage

1163	7.2.1.3. Synchronisation Identity

1165	   The Synchronisation Identity MCC attribute indicates how the
1166	   individual captures in multiple MCC captures are synchronised.  To
1167	   indicate that the Capture Encodings associated with MCCs contain
1168	   captures from the same source at the same time a Provider should
1169	   set the same Synchronisation Identity on each of the concerned
1170	   MCCs.  It is the Provider that determines what the source for the
1171	   Captures is, so a Provider can choose how to group together Single
1172	   Media Captures into a combined "source" for the purpose of
1173	   switching them together to keep them synchronized according to the
1174	   SynchronisationID attribute.  For example when the Provider is in
1175	   an MCU it may determine that each separate CLUE Endpoint is a
1176	   remote source of media. The Synchronisation Identity may be used
1177	   across media types, i.e. to synchronize audio and video related
1178	   MCCs.

1180	   Without this attribute it is assumed that multiple MCCs may provide
1181	   content from different sources at any particular point in time.

1183	   For example:

1185	        +=======================+=================================+
1186	        | Capture Scene #1      |                                 |
1187	        +-----------------------|---------------------------------+
1188	        | VC1                   | Description=Left                |
1189	        | VC2                   | Description=Centre              |
1190	        | VC3                   | Description=Right               |
1191	        | AC1                   | Description=room                |
1192	        | CSV(VC1,VC2,VC3)      |                                 |
1193	        | CSV(AC1)              |                                 |
1194	        +=======================+=================================+
1195	        | Capture Scene #2      |                                 |
1196	        +-----------------------|---------------------------------+
1197	        | VC4                   | Description=Left                |
1198	        | VC5                   | Description=Centre              |
1199	        | VC6                   | Description=Right               |
1200	        | AC2                   | Description=room                |
1201	        | CSV(VC4,VC5,VC6)      |                                 |
1202	        | CSV(AC2)              |                                 |
1203	        +=======================+=================================+
1204	        | Capture Scene #3      |                                 |
1205	        +-----------------------|---------------------------------+
1206	        | VC7                   |                                 |
1207	        | AC3                   |                                 |
1208	        +=======================+=================================+
1209	        | Capture Scene #4      |                                 |
1210	        +-----------------------|---------------------------------+
1211	        | VC8                   |                                 |
1212	        | AC4                   |                                 |
1213	        +=======================+=================================+
1214	        | Capture Scene #3      |                                 |
1215	        +-----------------------|---------------------------------+
1216	        | MCC1(VC1,VC4,VC7)     | SynchronisationID=1             |
1217	        |                       | MaxCaptures=1                   |
1218	        | MCC2(VC2,VC5,VC8)     | SynchronisationID=1             |
1219	        |                       | MaxCaptures=1                   |
1220	        | MCC3(VC3,VC6)         | MaxCaptures=1                   |
1221	        | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1             |
1222	        |                       | MaxCaptures=1                   |
1223	        | CSV(MCC1,MCC2,MCC3)   |                                 |
1224	        | CSV(MCC4)             |                                 |
1225	        +=======================+=================================+

1227	       Table 4: Example Synchronisation Identity MCC attribute usage

1229	   The above Advertisement would indicate that MCC1, MCC2, MCC3 and
1230	   MCC4 make up a Capture Scene.  There would be four capture
1231	   encodings (one for each MCC).  Because MCC1 and MCC2 have the same
1232	   SynchronisationID, each encoding from MCC1 and MCC2 respectively
1233	   would together have content from only Capture Scene 1 or only
1234	   Capture Scene 2 or the combination of VC7 and VC8 at a particular
1235	   point in time.  In this case the Provider has decided the sources
1236	   to be synchronized are Scene #1, Scene #2, and Scene #3 and #4
1237	   together. The encoding from MCC3 would not be synchronised with
1238	   MCC1 or MCC2. As MCC4 also has the same Synchronisation Identity
1239	   as MCC1 and MCC2 the content of the audio encoding will be
1240	   synchronised with the video content.

1242	7.3. Capture Scene

1244	   In order for a Provider's individual Captures to be used
1245	   effectively by a Consumer, the Provider organizes the Captures into
1246	   one or more Capture Scenes, with the structure and contents of
1247	   these Capture Scenes being sent from the Provider to the Consumer
1248	   in the Advertisement.

1250	   A Capture Scene is a structure representing a spatial region
1251	   containing one or more Capture Devices, each capturing media
1252	   representing a portion of the region.  A Capture Scene includes one
1253	   or more Capture Scene Views (CSV), with each CSV including one or
1254	   more Media Captures of the same media type.  There can also be
1255	   Media Captures that are not included in a Capture Scene View. A
1256	   Capture Scene represents, for example, the video image of a group
1257	   of people seated next to each other, along with the sound of their
1258	   voices, which could be represented by some number of VCs and ACs in
1259	   the Capture Scene Views.  An MCU can also describe in Capture
1260	   Scenes what it constructs from media Streams it receives.

1262	   A Provider MAY advertise one or more Capture Scenes.  What
1263	   constitutes an entire Capture Scene is up to the Provider.  A
1264	   simple Provider might typically use one Capture Scene for
1265	   participant media (live video from the room cameras) and another
1266	   Capture Scene for a computer generated presentation.  In more
1267	   complex systems, the use of additional Capture Scenes is also
1268	   sensible.  For example, a classroom may advertise two Capture
1269	   Scenes involving live video, one including only the camera
1270	   capturing the instructor (and associated audio), the other
1271	   including camera(s) capturing students (and associated audio).

1273	   A Capture Scene MAY (and typically will) include more than one type
1274	   of media.  For example, a Capture Scene can include several Capture
1275	   Scene Views for Video Captures, and several Capture Scene Views for
1276	   Audio Captures.  A particular Capture MAY be included in more than
1277	   one Capture Scene View.

1279	   A Provider MAY express spatial relationships between Captures that
1280	   are included in the same Capture Scene.  However, there is no
1281	   spatial relationship between Media Captures from different Capture
1282	   Scenes.  In other words, Capture Scenes each use their own spatial
1283	   measurement system as outlined above in section 6.

1285	   A Provider arranges Captures in a Capture Scene to help the
1286	   Consumer choose which captures it wants to render.  The Capture
1287	   Scene Views in a Capture Scene are different alternatives the
1288	   Provider is suggesting for representing the Capture Scene.  Each
1289	   Capture Scene View is given an advertisement unique identity.  The
1290	   order of Capture Scene Views within a Capture Scene has no
1291	   significance.  The Media Consumer can choose to receive all Media
1292	   Captures from one Capture Scene View for each media type (e.g.
1293	   audio and video), or it can pick and choose Media Captures
1294	   regardless of how the Provider arranges them in Capture Scene
1295	   Views.  Different Capture Scene Views of the same media type are
1296	   not necessarily mutually exclusive alternatives.  Also note that
1297	   the presence of multiple Capture Scene Views (with potentially
1298	   multiple encoding options in each view) in a given Capture Scene
1299	   does not necessarily imply that a Provider is able to serve all the
1300	   associated media simultaneously (although the construction of such
1301	   an over-rich Capture Scene is probably not sensible in many cases).
1302	   What a Provider can send simultaneously is determined through the
1303	   Simultaneous Transmission Set mechanism, described in section 8.

1305	   Captures within the same Capture Scene View MUST be of the same
1306	   media type - it is not possible to mix audio and video captures in
1307	   the same Capture Scene View, for instance.  The Provider MUST be
1308	   capable of encoding and sending all Captures (that have an encoding
1309	   group) in a single Capture Scene View simultaneously.  The order of
1310	   Captures within a Capture Scene View has no significance.  A
1311	   Consumer can decide to receive all the Captures in a single Capture
1312	   Scene View, but a Consumer could also decide to receive just a
1313	   subset of those captures.  A Consumer can also decide to receive
1314	   Captures from different Capture Scene Views, all subject to the
1315	   constraints set by Simultaneous Transmission Sets, as discussed in
1316	   section 8.

1318	   When a Provider advertises a Capture Scene with multiple CSVs, it
1319	   is essentially signaling that there are multiple representations of
1320	   the same Capture Scene available.  In some cases, these multiple
1321	   views would typically be used simultaneously (for instance a "video
1322	   view" and an "audio view").  In some cases the views would
1323	   conceptually be alternatives (for instance a view consisting of
1324	   three Video Captures covering the whole room versus a view
1325	   consisting of just a single Video Capture covering only the center
1326	   of a room).  In this latter example, one sensible choice for a
1327	   Consumer would be to indicate (through its Configure and possibly
1328	   through an additional offer/answer exchange) the Captures of that
1329	   Capture Scene View that most closely matched the Consumer's number
1330	   of display devices or screen layout.

1332	   The following is an example of 4 potential Capture Scene Views for
1333	   an endpoint-style Provider:

1335	   1.  (VC0, VC1, VC2) - left, center and right camera Video Captures

1337	   2.  (VC3) - Video Capture associated with loudest room segment

1339	   3.  (VC4) - Video Capture zoomed out view of all people in the room

1341	   4.  (AC0) - main audio

1343	   The first view in this Capture Scene example is a list of Video
1344	   Captures which have a spatial relationship to each other.
1345	   Determination of the order of these captures (VC0, VC1 and VC2) for
1346	   rendering purposes is accomplished through use of their Area of
1347	   Capture attributes.  The second view (VC3) and the third view (VC4)
1348	   are alternative representations of the same room's video, which
1349	   might be better suited to some Consumers' rendering capabilities.
1350	   The inclusion of the Audio Capture in the same Capture Scene
1351	   indicates that AC0 is associated with all of those Video Captures,
1352	   meaning it comes from the same spatial region.  Therefore, if audio
1353	   were to be rendered at all, this audio would be the correct choice
1354	   irrespective of which Video Captures were chosen.

1356	7.3.1. Capture Scene attributes

1358	   Capture Scene Attributes can be applied to Capture Scenes as well
1359	   as to individual media captures.  Attributes specified at this
1360	   level apply to all constituent Captures.  Capture Scene attributes
1361	   include

1363	     . Human-readable description of the Capture Scene, which could
1364	        be in multiple languages;
1365	     . xCard scene information
1366	     . Scale information (millimeters, unknown, no scale), as
1367	        described in Section 6.

1369	7.3.1.1. Scene Information

1371	   The Scene information attribute provides information regarding the
1372	   Capture Scene rather than individual participants. The Provider
1373	   may gather the information automatically or manually from a
1374	   variety of sources. The scene information attribute allows a
1375	   Provider to indicate information such as: organizational or
1376	   geographic information allowing a Consumer to determine which
1377	   Capture Scenes are of interest in order to then perform Capture
1378	   selection. It also allows a Consumer to render information
1379	   regarding the Scene or to use it for further processing.

1381	   As per 7.1.1.11. the xCard format is used to convey this
1382	   information and the Provider may supply a minimal set of
1383	   information or a larger set of information.

1385	   In order to keep CLUE messages compact the Provider SHOULD use a
1386	   URI to point to any LOGO, PHOTO or SOUND contained in the xCARD
1387	   rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE
1388	   message.

1390	7.3.2. Capture Scene View attributes

1392	   A Capture Scene can include one or more Capture Scene Views in
1393	   addition to the Capture Scene wide attributes described above.
1394	   Capture Scene View attributes apply to the Capture Scene View as a
1395	   whole, i.e. to all Captures that are part of the Capture Scene
1396	   View.

1398	   Capture Scene View attributes include:

1400	     . Human-readable description (which could be in multiple
1401	        languages) of the Capture Scene View

1403	7.3.3. Global View List

1405	   An Advertisement can include an optional Global View list.  Each
1406	   item in this list is a Global View.  A Global View is a set of
1407	   references to one or more Capture Scene Views of the same media
1408	   type that are defined within scenes of the same advertisement.
1409	   Each Global View in the list is a suggestion from the Provider to
1410	   the Consumer for which CSVs provide a complete representation of
1411	   the simultaneous captures provided by the Provider, across
1412	   multiple scenes.  The Provider can include multiple Global Views,
1413	   to allow a Consumer to choose sets of captures appropriate to its
1414	   capabilities or application.  The choice of how to make these
1415	   suggestions in the Global View list for what represents all the
1416	   scenes for which the Provider can send media is up to the
1417	   Provider.  This is very similar to how each CSV represents a
1418	   particular scene.

1420	   As an example, suppose an advertisement has three scenes, and each
1421	   scene has three CSVs, ranging from one to three video captures in
1422	   each CSV.  The Provider is advertising a total of nine video
1423	   Captures across three scenes.  The Provider can use the Global
1424	   View list to suggest alternatives for Consumers that can't receive
1425	   all nine video Captures as separate media streams.  For
1426	   accommodating a Consumer that wants to receive three video
1427	   Captures, a Provider might suggest a Global View containing just a
1428	   single CSV with three Captures and nothing from the other two
1429	   scenes.  Or a Provider might suggest a Global View containing
1430	   three different CSVs, one from each scene, with a single video
1431	   Capture in each.

1433	   Some additional rules:

1435	     . The ordering of Global Views in the Global View list is not
1436	        important.
1437	     . The ordering of CSVs within each Global View is not
1438	        important.
1439	     . A particular CSV may be used in multiple Global Views.
1440	     . The Provider must be capable of encoding and sending all
1441	        Captures within the CSVs of a given Global View
1442	        simultaneously.

1444	8. Simultaneous Transmission Set Constraints

1446	   In many practical cases, a Provider has constraints or limitations
1447	   on its ability to send Captures simultaneously.  One type of
1448	   limitation is caused by the physical limitations of capture
1449	   mechanisms; these constraints are represented by a simultaneous
1450	   transmission set.  The second type of limitation reflects the
1451	   encoding resources available, such as bandwidth or video encoding
1452	   throughput (macroblocks/second).  This type of constraint is
1453	   captured by encoding groups, discussed below.

1455	   Some Endpoints or MCUs can send multiple Captures simultaneously;
1456	   however sometimes there are constraints that limit which Captures
1457	   can be sent simultaneously with other Captures.  A device may not
1458	   be able to be used in different ways at the same time.  Provider
1459	   Advertisements are made so that the Consumer can choose one of
1460	   several possible mutually exclusive usages of the device.  This
1461	   type of constraint is expressed in a Simultaneous Transmission Set,
1462	   which lists all the Captures of a particular media type (e.g.
1463	   audio, video, text) that can be sent at the same time.  There are
1464	   different Simultaneous Transmission Sets for each media type in the
1465	   Advertisement.  This is easier to show in an example.

1467	   Consider the example of a room system where there are three cameras
1468	   each of which can send a separate capture covering two persons
1469	   each- VC0, VC1, VC2.  The middle camera can also zoom out (using an
1470	   optical zoom lens) and show all six persons, VC3.  But the middle
1471	   camera cannot be used in both modes at the same time - it has to
1472	   either show the space where two participants sit or the whole six
1473	   seats, but not both at the same time.  As a result, VC1 and VC3
1474	   cannot be sent simultaneously.

1476	   Simultaneous Transmission Sets are expressed as sets of the Media
1477	   Captures that the Provider could transmit at the same time (though,
1478	   in some cases, it is not intuitive to do so).  If a Multiple
1479	   Content Capture is included in a Simultaneous Transmission Set it
1480	   indicates that the Capture Encoding associated with it could be
1481	   transmitted as the same time as the other Captures within the
1482	   Simultaneous Transmission Set. It does not imply that the Single
1483	   Media Captures contained in the Multiple Content Capture could all
1484	   be transmitted at the same time.

1486	   In this example the two simultaneous sets are shown in Table 5.  If
1487	   a Provider advertises one or more mutually exclusive Simultaneous
1488	   Transmission Sets, then for each media type the Consumer MUST
1489	   ensure that it chooses Media Captures that lie wholly within one of
1490	   those Simultaneous Transmission Sets.

1492	                           +-------------------+
1493	                           | Simultaneous Sets |
1494	                           +-------------------+
1495	                           | {VC0, VC1, VC2}   |
1496	                           | {VC0, VC3, VC2}   |
1497	                           +-------------------+

1499	                Table 5: Two Simultaneous Transmission Sets

1501	   A Provider OPTIONALLY can include the simultaneous sets in its
1502	   Advertisement.  These simultaneous set constraints apply across all
1503	   the Capture Scenes in the Advertisement.  It is a syntax
1504	   conformance requirement that the simultaneous transmission sets
1505	   MUST allow all the media captures in any particular Capture Scene
1506	   View to be used simultaneously.  Similarly, the simultaneous
1507	   transmission sets MUST reflect the simultaneity expressed by any
1508	   Global View.

1510	   For shorthand convenience, a Provider MAY describe a Simultaneous
1511	   Transmission Set in terms of Capture Scene Views and Capture
1512	   Scenes.  If a Capture Scene View is included in a Simultaneous
1513	   Transmission Set, then all Media Captures in the Capture Scene View
1514	   are included in the Simultaneous Transmission Set.  If a Capture
1515	   Scene is included in a Simultaneous Transmission Set, then all its
1516	   Capture Scene Views (of the corresponding media type) are included
1517	   in the Simultaneous Transmission Set.  The end result reduces to a
1518	   set of Media Captures, of a particular media type, in either case.

1520	   If an Advertisement does not include Simultaneous Transmission
1521	   Sets, then the Provider MUST be able to simultaneously provide all
1522	   the captures from any one CSV of each media type from each capture
1523	   scene.  Likewise, if there are no Simultaneous Transmission Sets
1524	   and there is a Global View list, then the Provider MUST be able to
1525	   simultaneously provide all the captures from any particular Global
1526	   View (of each media type) from the Global View list.

1528	   If an Advertisement includes multiple Capture Scene Views in a
1529	   Capture Scene then the Consumer MAY choose one Capture Scene View
1530	   for each media type, or MAY choose individual Captures based on the
1531	   Simultaneous Transmission Sets.

1533	9. Encodings

1535	   Individual encodings and encoding groups are CLUE's mechanisms
1536	   allowing a Provider to signal its limitations for sending Captures,
1537	   or combinations of Captures, to a Consumer.  Consumers can map the
1538	   Captures they want to receive onto the Encodings, with encoding
1539	   parameters they want.  As for the relationship between the CLUE-
1540	   specified mechanisms based on Encodings and the SIP Offer-Answer
1541	   exchange, please refer to section 5.

1543	9.1. Individual Encodings

1545	   An Individual Encoding represents a way to encode a Media Capture
1546	   to become a Capture Encoding, to be sent as an encoded media stream
1547	   from the Provider to the Consumer.  An Individual Encoding has a
1548	   set of parameters characterizing how the media is encoded.

1550	   Different media types have different parameters, and different
1551	   encoding algorithms may have different parameters.  An Individual
1552	   Encoding can be assigned to at most one Capture Encoding at any
1553	   given time.

1555	   Individual Encoding parameters are represented in SDP [RFC4566],
1556	   not in CLUE messages.  For example, for a video encoding using
1557	   H.26x compression technologies, this can include parameters such
1558	   as:

1560	     . Maximum bandwidth;
1561	     . Maximum picture size in pixels;
1562	     . Maxmimum number of pixels to be processed per second;

1564	   The bandwidth parameter is the only one that specifically relates
1565	   to a CLUE Advertisement, as it can be further constrained by the
1566	   maximum group bandwidth in an Encoding Group.

1568	9.2. Encoding Group

1570	   An Encoding Group includes a set of one or more Individual
1571	   Encodings, and parameters that apply to the group as a whole.  By
1572	   grouping multiple individual Encodings together, an Encoding Group
1573	   describes additional constraints on bandwidth for the group. A
1574	   single Encoding Group MAY refer to encodings for different media
1575	   types.

1577	   The Encoding Group data structure contains:

1579	     . Maximum bitrate for all encodings in the group combined;
1580	     . A list of identifiers for the Individual Encodings belonging
1581	        to the group.

1583	   When the Individual Encodings in a group are instantiated into
1584	   Capture Encodings, each Capture Encoding has a bitrate that MUST be
1585	   less than or equal to the max bitrate for the particular individual
1586	   encoding.  The "maximum bitrate for all encodings in the group"
1587	   parameter gives the additional restriction that the sum of all the
1588	   individual capture encoding bitrates MUST be less than or equal to
1589	   this group value.

1591	   The following diagram illustrates one example of the structure of a
1592	   media Provider's Encoding Groups and their contents.

1594	   ,-------------------------------------------------.
1595	   |             Media Provider                      |
1596	   |                                                 |
1597	   |  ,--------------------------------------.       |
1598	   |  | ,--------------------------------------.     |
1599	   |  | | ,--------------------------------------.   |
1600	   |  | | |          Encoding Group              |   |
1601	   |  | | | ,-----------.                        |   |
1602	   |  | | | |           | ,---------.            |   |
1603	   |  | | | |           | |         | ,---------.|   |
1604	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
1605	   |  `.| | |           | |         | `---------'|   |
1606	   |    `.| `-----------' `---------'            |   |
1607	   |      `--------------------------------------'   |
1608	   `-------------------------------------------------'

1610	                    Figure 3: Encoding Group Structure

1612	   A Provider advertises one or more Encoding Groups.  Each Encoding
1613	   Group includes one or more Individual Encodings.  Each Individual
1614	   Encoding can represent a different way of encoding media.  For
1615	   example one Individual Encoding may be 1080p60 video, another could
1616	   be 720p30, with a third being CIF, all in, for example, H.264
1617	   format.
1618	   While a typical three codec/display system might have one Encoding
1619	   Group per "codec box" (physical codec, connected to one camera and
1620	   one screen), there are many possibilities for the number of
1621	   Encoding Groups a Provider may be able to offer and for the
1622	   encoding values in each Encoding Group.

1624	   There is no requirement for all Encodings within an Encoding Group
1625	   to be instantiated at the same time.

1627	9.3. Associating Captures with Encoding Groups

1629	   Each Media Capture, including MCCs, MAY be associated with one or
1630	   more Encoding Groups. To be eligible for configuration, a Media
1631	   Capture MUST be associated with at least one Encoding Group, which
1632	   is used to instantiate that Capture into one or more Capture
1633	   Encodings. When an MCC is configured all the Media Captures
1634	   referenced by the MCC will appear in the Capture Encoding according
1635	   to the attributes of the chosen encoding of the MCC. This allows an
1636	   Advertiser to specify encoding attributes associated with the Media
1637	   Captures without the need to provide an individual Capture Encoding
1638	   for each of the inputs.

1640	   If an Encoding Group is assigned to a Media Capture referenced by
1641	   the MCC it indicates that this Capture may also have an individual
1642	   Capture Encoding.

1644	   For example:

1646	        +--------------------+------------------------------------+
1647	        | Capture Scene #1   |                                    |
1648	        +--------------------+------------------------------------+
1649	        | VC1                | EncodeGroupID=1                    |
1650	        | VC2                |                                    |
1651	        | MCC1(VC1,VC2)      | EncodeGroupID=2                    |
1652	        | CSV(VC1)           |                                    |
1653	        | CSV(MCC1)          |                                    |
1654	        +--------------------+------------------------------------+

1656	     Table 6: Example usage of Encoding with MCC and source Captures

1658	   This would indicate that VC1 may be sent as its own Capture
1659	   Encoding from EncodeGroupID=1 or that it may be sent as part of a
1660	   Capture Encoding from EncodeGroupID=2 along with VC2.

1662	   More than one Capture MAY use the same Encoding Group.

1664	   The maximum number of streams that can result from a particular
1665	   Encoding Group constraint is equal to the number of individual
1666	   Encodings in the group.  The actual number of Capture Encodings
1667	   used at any time MAY be less than this maximum.  Any of the
1668	   Captures that use a particular Encoding Group can be encoded
1669	   according to any of the Individual Encodings in the group.  If
1670	   there are multiple Individual Encodings in the group, then the
1671	   Consumer can configure the Provider, via a Configure message, to
1672	   encode a single Media Capture into multiple different Capture
1673	   Encodings at the same time, subject to the Max Capture Encodings
1674	   constraint, with each capture encoding following the constraints of
1675	   a different Individual Encoding.

1677	   It is a protocol conformance requirement that the Encoding Groups
1678	   MUST allow all the Captures in a particular Capture Scene View to
1679	   be used simultaneously.

1681	10. Consumer's Choice of Streams to Receive from the Provider

1683	   After receiving the Provider's Advertisement message (that includes
1684	   media captures and associated constraints), the Consumer composes
1685	   its reply to the Provider in the form of a Configure message.  The
1686	   Consumer is free to use the information in the Advertisement as it
1687	   chooses, but there are a few obviously sensible design choices,
1688	   which are outlined below.

1690	   If multiple Providers connect to the same Consumer (i.e. in a n
1691	   MCU-less multiparty call), it is the responsibility of the Consumer
1692	   to compose Configures for each Provider that both fulfill each
1693	   Provider's constraints as expressed in the Advertisement, as well
1694	   as its own capabilities.

1696	   In an MCU-based multiparty call, the MCU can logically terminate
1697	   the Advertisement/Configure negotiation in that it can hide the
1698	   characteristics of the receiving endpoint and rely on its own
1699	   capabilities (transcoding/transrating/...) to create Media Streams
1700	   that can be decoded at the Endpoint Consumers.  The timing of an
1701	   MCU's sending of Advertisements (for its outgoing ports) and
1702	   Configures (for its incoming ports, in response to Advertisements
1703	   received there) is up to the MCU and implementation dependent.

1705	   As a general outline, a Consumer can choose, based on the
1706	   Advertisement it has received, which Captures it wishes to receive,
1707	   and which Individual Encodings it wants the Provider to use to
1708	   encode the Captures.

1710	   On receipt of an Advertisement with an MCC the Consumer treats the
1711	   MCC as per other non-MCC Captures with the following differences:

1713	   - The Consumer would understand that the MCC is a Capture that
1714	   includes the referenced individual Captures and that these
1715	   individual Captures are delivered as part of the MCC's Capture
1716	   Encoding.

1718	   - The Consumer may utilise any of the attributes associated with
1719	   the referenced individual Captures and any Capture Scene attributes
1720	   from where the individual Captures were defined to choose Captures
1721	   and for rendering decisions.

1723	   - The Consumer may or may not choose to receive all the indicated
1724	   captures.  Therefore it can choose to receive a sub-set ofCaptures
1725	   indicated by the MCC.

1727	   For example if the Consumer receives:

1729	           MCC1(VC1,VC2,VC3){attributes}

1731	   A Consumer could choose all the Captures within a MCCs however if
1732	   the Consumer determines that it doesn't want VC3 it can return
1733	   MCC1(VC1,VC2).  If it wants all the individual Captures then it
1734	   returns only the MCC identity (i.e. MCC1).  If the MCC in the
1735	   advertisement does not reference any individual captures, then the
1736	   Consumer cannot choose what is included in the MCC, it is up to the
1737	   Provider to decide.

1739	   A Configure Message includes a list of Capture Encodings.  These
1740	   are the Capture Encodings the Consumer wishes to receive from the
1741	   Provider.  Each Capture Encoding refers to one Media Capture and
1742	   one Individual Encoding.  A Configure Message does not include
1743	   references to Capture Scenes or Capture Scene Views.

1745	   For each Capture the Consumer wants to receive, it configures one
1746	   or more of the Encodings in that Capture's Encoding Group.  The
1747	   Consumer does this by telling the Provider, in its Configure
1748	   Message, which Encoding to use for each chosen Capture.  Upon
1749	   receipt of this Configure from the Consumer, common knowledge is
1750	   established between Provider and Consumer regarding sensible
1751	   choices for the media streams.  The setup of the actual media
1752	   channels, at least in the simplest case, is left to a following
1753	   offer-answer exchange.  Optimized implementations MAY speed up the
1754	   reaction to the offer-answer exchange by reserving the resources at
1755	   the time of finalization of the CLUE handshake.

1757	   CLUE advertisements and configure messages don't necessarily
1758	   require a new SDP offer-answer for every CLUE message
1759	   exchange.  But the resulting encodings sent via RTP must conform to
1760	   the most recent SDP offer-answer result.

1762	   In order to meaningfully create and send an initial Configure, the
1763	   Consumer needs to have received at least one Advertisement, and an
1764	   SDP offer defining the Individual Encodings, from the Provider.

1766	   In addition, the Consumer can send a Configure at any time during
1767	   the call.  The Configure MUST be valid according to the most
1768	   recently received Advertisement.  The Consumer can send a Configure
1769	   either in response to a new Advertisement from the Provider or on
1770	   its own, for example because of a local change in conditions
1771	   (people leaving the room, connectivity changes, multipoint related
1772	   considerations).

1774	   When choosing which Media Streams to receive from the Provider, and
1775	   the encoding characteristics of those Media Streams, the Consumer
1776	   advantageously takes several things into account: its local
1777	   preference, simultaneity restrictions, and encoding limits.

1779	10.1. Local preference

1781	   A variety of local factors influence the Consumer's choice of
1782	   Media Streams to be received from the Provider:

1784	   o  if the Consumer is an Endpoint, it is likely that it would
1785	      choose, where possible, to receive video and audio Captures that
1786	      match the number of display devices and audio system it has

1788	   o  if the Consumer is an MCU, it MAY choose to receive loudest
1789	      speaker streams (in order to perform its own media composition)
1790	      and avoid pre-composed video Captures

1792	   o  user choice (for instance, selection of a new layout) MAY result
1793	      in a different set of Captures, or different encoding
1794	      characteristics, being required by the Consumer

1796	10.2. Physical simultaneity restrictions

1798	   Often there are physical simultaneity constraints of the Provider
1799	   that affect the Provider's ability to simultaneously send all of
1800	   the captures the Consumer would wish to receive.  For instance, an
1801	   MCU, when connected to a multi-camera room system, might prefer to
1802	   receive both individual video streams of the people present in the
1803	   room and an overall view of the room from a single camera.  Some
1804	   Endpoint systems might be able to provide both of these sets of
1805	   streams simultaneously, whereas others might not (if the overall
1806	   room view were produced by changing the optical zoom level on the
1807	   center camera, for instance).

1809	10.3. Encoding and encoding group limits

1811	   Each of the Provider's encoding groups has limits on bandwidth and
1812	   computational complexity, and the constituent potential encodings
1813	   have limits on the bandwidth, computational complexity, video
1814	   frame rate, and resolution that can be provided.  When choosing
1815	   the Captures to be received from a Provider, a Consumer device
1816	   MUST ensure that the encoding characteristics requested for each
1817	   individual Capture fits within the capability of the encoding it
1818	   is being configured to use, as well as ensuring that the combined
1819	   encoding characteristics for Captures fit within the capabilities
1820	   of their associated encoding groups.  In some cases, this could
1821	   cause an otherwise "preferred" choice of capture encodings to be
1822	   passed over in favor of different Capture Encodings--for instance,
1823	   if a set of three Captures could only be provided at a low
1824	   resolution then a three screen device could switch to favoring a
1825	   single, higher quality, Capture Encoding.

1827	11. Extensibility

1829	   One important characteristics of the Framework is its
1830	   extensibility.  The standard for interoperability and handling
1831	   multiple streams must be future-proof. The framework itself is
1832	   inherently extensible through expanding the data model types.  For
1833	   example:

1835	   o  Adding more types of media, such as telemetry, can done by
1836	      defining additional types of Captures in addition to audio and
1837	      video.

1839	   o  Adding new functionalities, such as 3-D, say, may require
1840	      additional attributes describing the Captures.

1842	   The infrastructure is designed to be extended rather than
1843	   requiring new infrastructure elements.  Extension comes through
1844	   adding to defined types.

1846	12. Examples - Using the Framework (Informative)

1848	   This section gives some examples, first from the point of view of
1849	   the Provider, then the Consumer, then some multipoint scenarios

1851	12.1. Provider Behavior

1853	   This section shows some examples in more detail of how a Provider
1854	   can use the framework to represent a typical case for telepresence
1855	   rooms.  First an endpoint is illustrated, then an MCU case is
1856	   shown.

1858	12.1.1. Three screen Endpoint Provider

1860	   Consider an Endpoint with the following description:

1862	   3 cameras, 3 displays, a 6 person table

1864	   o  Each camera can provide one Capture for each 1/3 section of the
1865	      table

1867	   o  A single Capture representing the active speaker can be provided
1868	      (voice activity based camera selection to a given encoder input
1869	      port implemented locally in the Endpoint)

1871	   o  A single Capture representing the active speaker with the other
1872	      2 Captures shown picture in picture within the stream can be
1873	      provided (again, implemented inside the endpoint)

1875	   o  A Capture showing a zoomed out view of all 6 seats in the room
1876	      can be provided

1878	   The audio and video Captures for this Endpoint can be described as
1879	   follows.

1881	   Video Captures:

1883	   o  VC0- (the left camera stream), encoding group=EG0, view=table

1885	   o  VC1- (the center camera stream), encoding group=EG1, view=table

1887	   o  VC2- (the right camera stream), encoding group=EG2, view=table

1889	   o  MCC3- (the loudest panel stream), encoding group=EG1,
1890	      view=table, MaxCaptures=1

1892	   o  MCC4- (the loudest panel stream with PiPs), encoding group=EG1,
1893	      view=room, MaxCaptures=3

1895	   o  VC5- (the zoomed out view of all people in the room), encoding
1896	      group=EG1, view=room

1898	   o  VC6- (presentation stream), encoding group=EG1, presentation

1900	   The following diagram is a top view of the room with 3 cameras, 3
1901	   displays, and 6 seats.  Each camera is capturing 2 people.  The
1902	   six seats are not all in a straight line.

1904	      ,-. d
1905	     (   )`--.__        +---+
1906	      `-' /     `--.__  |   |
1907	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1908	   (   ).'     <--(AC1)-+-''`+-+
1909	    `-' |_...---''      |   |
1910	    ,-.c+-..__          +---+
1911	   (   )|     ``--..__  |   |
1912	    `-' |             ``+-..|_-+Camera 1 (VC1)
1913	    ,-. |      <--(AC2)..--'|+-+                          ^
1914	   (   )|     __..--'   |   |                             |
1915	    `-'b|..--'          +---+                             |X
1916	    ,-. |``---..___     |   |                             |
1917	   (   )\          ```--..._|_-+Camera 0 (VC0)            |
1918	    `-'  \     <--(AC0) ..-''`-+                          |
1919	     ,-. \      __.--'' |   |                  <----------+
1920	    (   ) |..-''        +---+                     Y
1921	     `-' a                          (0,0,0) origin is under Camera 1

1923	                    Figure 4: Room Layout Top View

1925	   The two points labeled b and c are intended to be at the midpoint
1926	   between the seating positions, and where the fields of view of the
1927	   cameras intersect.

1929	   The plane of interest for VC0 is a vertical plane that intersects
1930	   points 'a' and 'b'.

1932	   The plane of interest for VC1 intersects points 'b' and 'c'. The
1933	   plane of interest for VC2 intersects points 'c' and 'd'.

1935	   This example uses an area scale of millimeters.

1937	   Areas of capture:

1939	       bottom left    bottom right  top left         top right
1940	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1941	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1942	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1943	   MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1944	   MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1945	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1946	   VC6 none

1948	   Points of capture:
1949	   VC0 (-1678,0,800)
1950	   VC1 (0,0,800)
1951	   VC2 (1678,0,800)
1952	   MCC3 none
1953	   MCC4 none
1954	   VC5 (0,0,800)
1955	   VC6 none

1957	   In this example, the right edge of the VC0 area lines up with the
1958	   left edge of the VC1 area.  It doesn't have to be this way.  There
1959	   could be a gap or an overlap.  One additional thing to note for
1960	   this example is the distance from a to b is equal to the distance
1961	   from b to c and the distance from c to d.  All these distances are
1962	   1346 mm. This is the planar width of each area of capture for VC0,
1963	   VC1, and VC2.

1965	   Note the text in parentheses (e.g. "the left camera stream") is
1966	   not explicitly part of the model, it is just explanatory text for
1967	   this example, and is not included in the model with the media
1968	   captures and attributes.  Also, MCC4 doesn't say anything about
1969	   how a capture is composed, so the media consumer can't tell based
1970	   on this capture that MCC4 is composed of a "loudest panel with
1971	   PiPs".

1973	   Audio Captures:

1975	   Three ceiling microphones are located between the cameras and the
1976	   table, at the same height as the cameras.  The microphones point
1977	   down at an angle toward the seating positions.

1979	   o  AC0 (left), encoding group=EG3

1981	   o  AC1 (right), encoding group=EG3

1983	   o  AC2 (center) encoding group=EG3

1985	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1986	      encoding group=EG3

1988	   o  AC4 audio stream associated with the presentation video (mono)
1989	      encoding group=EG3, presentation

1991	       Point of capture:      Point on Line of Capture:

1993	   AC0 (-1342,2000,800)       (-1342,2925,379)
1994	   AC1 ( 1342,2000,800)       ( 1342,2925,379)
1995	   AC2 (    0,2000,800)       (    0,3000,379)
1996	   AC3 (    0,2000,800)       (    0,3000,379)
1997	   AC4 none

1999	   The physical simultaneity information is:

2001	      Simultaneous transmission set #1 {VC0, VC1, VC2, MCC3, MCC4,
2002	   VC6}

2004	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

2006	   This constraint indicates it is not possible to use all the VCs at
2007	   the same time.  VC5 cannot be used at the same time as VC1 or MCC3
2008	   or MCC4.  Also, using every member in the set simultaneously may
2009	   not make sense - for example MCC3(loudest) and MCC4 (loudest with
2010	   PIP).  (In addition, there are encoding constraints that make
2011	   choosing all of the VCs in a set impossible.  VC1, MCC3, MCC4,
2012	   VC5, VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint
2013	   shows up in the encoding groups, not in the simultaneous
2014	   transmission sets.)

2016	   In this example there are no restrictions on which audio captures
2017	   can be sent simultaneously.

2019	   Encoding Groups:

2021	   This example has three encoding groups associated with the video
2022	   captures.  Each group can have 3 encodings, but with each
2023	   potential encoding having a progressively lower specification.  In
2024	   this example, 1080p60 transmission is possible (as ENC0 has a
2025	   maxPps value compatible with that).  Significantly, as up to 3
2026	   encodings are available per group, it is possible to transmit some
2027	   video captures simultaneously that are not in the same view in the
2028	   capture scene.  For example VC1 and MCC3 at the same time.

2030	   It is also possible to transmit multiple capture encodings of a
2031	   single video capture.  For example VC0 can be encoded using ENC0
2032	   and ENC1 at the same time, as long as the encoding parameters
2033	   satisfy the constraints of ENC0, ENC1, and EG0, such as one at
2034	   4000000 bps and one at 2000000 bps.

2036	   encodeGroupID=EG0, maxGroupBandwidth=6000000
2037	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
2038	                      maxPps=124416000, maxBandwidth=4000000
2039	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
2040	                      maxPps=27648000, maxBandwidth=4000000
2041	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
2042	                      maxPps=15552000, maxBandwidth=4000000
2043	   encodeGroupID=EG1  maxGroupBandwidth=6000000
2044	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
2045	                      maxPps=124416000, maxBandwidth=4000000
2046	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
2047	                      maxPps=27648000, maxBandwidth=4000000
2048	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
2049	                      maxPps=15552000, maxBandwidth=4000000
2050	   encodeGroupID=EG2  maxGroupBandwidth=6000000
2051	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
2052	                      maxPps=124416000, maxBandwidth=4000000
2053	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
2054	                      maxPps=27648000, maxBandwidth=4000000
2055	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
2056	                      maxPps=15552000, maxBandwidth=4000000

2058	                Figure 5: Example Encoding Groups for Video

2060	   For audio, there are five potential encodings available, so all
2061	   five audio captures can be encoded at the same time.

2063	   encodeGroupID=EG3, maxGroupBandwidth=320000
2064	       encodeID=ENC9, maxBandwidth=64000
2065	       encodeID=ENC10, maxBandwidth=64000
2066	       encodeID=ENC11, maxBandwidth=64000
2067	       encodeID=ENC12, maxBandwidth=64000
2068	       encodeID=ENC13, maxBandwidth=64000

2070	                Figure 6: Example Encoding Group for Audio

2072	   Capture Scenes:

2074	   The following table represents the capture scenes for this
2075	   provider. Recall that a capture scene is composed of alternative
2076	   capture scene views covering the same spatial region.  Capture
2077	   Scene #1 is for the main people captures, and Capture Scene #2 is
2078	   for presentation.

2080	   Each row in the table is a separate Capture Scene View

2082	                           +------------------+
2083	                           | Capture Scene #1 |
2084	                           +------------------+
2085	                           | VC0, VC1, VC2    |
2086	                           | MCC3             |
2087	                           | MCC4             |
2088	                           | VC5              |
2089	                           | AC0, AC1, AC2    |
2090	                           | AC3              |
2091	                           +------------------+

2093	                           +------------------+
2094	                           | Capture Scene #2 |
2095	                           +------------------+
2096	                           | VC6              |
2097	                           | AC4              |
2098	                           +------------------+

2100	                Table 7: Example Capture Scene Views

2102	   Different capture scenes are unique to each other, non-
2103	   overlapping. A consumer can choose a view from each capture scene.
2104	   In this case the three captures VC0, VC1, and VC2 are one way of
2105	   representing the video from the endpoint.  These three captures
2106	   should appear adjacent next to each other.  Alternatively, another
2107	   way of representing the Capture Scene is with the capture MCC3,
2108	   which automatically shows the person who is talking.  Similarly
2109	   for the MCC4 and VC5 alternatives.

2111	   As in the video case, the different views of audio in Capture
2112	   Scene #1 represent the "same thing", in that one way to receive
2113	   the audio is with the 3 audio captures (AC0, AC1, AC2), and
2114	   another way is with the mixed AC3.  The Media Consumer can choose
2115	   an audio CSV it is capable of receiving.

2117	   The spatial ordering is understood by the media capture attributes
2118	   Area of Capture and Point of Capture and Point on Line of Capture.

2120	   A Media Consumer would likely want to choose a capture scene view
2121	   to receive based in part on how many streams it can simultaneously
2122	   receive.  A consumer that can receive three people streams would
2123	   probably prefer to receive the first view of Capture Scene #1
2124	   (VC0, VC1, VC2) and not receive the other views.  A consumer that
2125	   can receive only one people stream would probably choose one of
2126	   the other views.

2128	   If the consumer can receive a presentation stream too, it would
2129	   also choose to receive the only view from Capture Scene #2 (VC6).

2131	12.1.2. Encoding Group Example

2133	   This is an example of an encoding group to illustrate how it can
2134	   express dependencies between encodings.

2136	   encodeGroupID=EG0 maxGroupBandwidth=6000000
2137	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
2138	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
2139	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
2140	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
2141	       encodeID=AUDENC0, maxBandwidth=96000
2142	       encodeID=AUDENC1, maxBandwidth=96000
2143	       encodeID=AUDENC2, maxBandwidth=96000

2145	   Here, the encoding group is EG0.  Although the encoding group is
2146	   capable of transmitting up to 6Mbit/s, no individual video
2147	   encoding can exceed 4Mbit/s.

2149	   This encoding group also allows up to 3 audio encodings, AUDENC<0-
2150	   2>. It is not required that audio and video encodings reside
2151	   within the same encoding group, but if so then the group's overall
2152	   maxBandwidth value is a limit on the sum of all audio and video
2153	   encodings configured by the consumer.  A system that does not wish
2154	   or need to combine bandwidth limitations in this way should
2155	   instead use separate encoding groups for audio and video in order
2156	   for the bandwidth limitations on audio and video to not interact.

2158	   Audio and video can be expressed in separate encoding groups, as
2159	   in this illustration.

2161	   encodeGroupID=EG0 maxGroupBandwidth=6000000
2162	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
2163	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
2164	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
2165	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
2166	   encodeGroupID=EG1 maxGroupBandwidth=500000
2167	       encodeID=AUDENC0, maxBandwidth=96000
2168	       encodeID=AUDENC1, maxBandwidth=96000
2169	       encodeID=AUDENC2, maxBandwidth=96000

2171	12.1.3. The MCU Case

2173	   This section shows how an MCU might express its Capture Scenes,
2174	   intending to offer different choices for consumers that can handle
2175	   different numbers of streams.  A single audio capture stream is
2176	   provided for all single and multi-screen configurations that can
2177	   be associated (e.g. lip-synced) with any combination of video
2178	   captures at the consumer.

2180	        +-----------------------+---------------------------------+
2181	        | Capture Scene #1      |                                 |
2182	        +-----------------------|---------------------------------+
2183	        | VC0                   | VC for a single screen consumer |
2184	        | VC1, VC2              | VCs for a two screen consumer   |
2185	        | VC3, VC4, VC5         | VCs for a three screen consumer |
2186	        | VC6, VC7, VC8, VC9    | VCs for a four screen consumer  |
2187	        | AC0                   | AC representing all participants|
2188	        | CSV(VC0)              |                                 |
2189	        | CSV(VC1,VC2)          |                                 |
2190	        | CSV(VC3,VC4,VC5)      |                                 |
2191	        | CSV(VC6,VC7,VC8,VC9)  |                                 |
2192	        | CSV(AC0)              |                                 |
2193	        +-----------------------+---------------------------------+

2195	                Table 8: MCU main Capture Scenes

2197	   If / when a presentation stream becomes active within the
2198	   conference the MCU might re-advertise the available media as:

2200	        +------------------+--------------------------------------+
2201	        | Capture Scene #2 | note                                 |
2202	        +------------------+--------------------------------------+
2203	        | VC10             | video capture for presentation       |
2204	        | AC1              | presentation audio to accompany VC10 |
2205	        | CSV(VC10)        |                                      |
2206	        | CSV(AC1)         |                                      |
2207	        +------------------+--------------------------------------+

2209	                Table 9: MCU presentation Capture Scene

2211	12.2. Media Consumer Behavior

2213	   This section gives an example of how a Media Consumer might behave
2214	   when deciding how to request streams from the three screen
2215	   endpoint described in the previous section.

2217	   The receive side of a call needs to balance its requirements,
2218	   based on number of screens and speakers, its decoding capabilities
2219	   and available bandwidth, and the provider's capabilities in order
2220	   to optimally configure the provider's streams.  Typically it would
2221	   want to receive and decode media from each Capture Scene
2222	   advertised by the Provider.

2224	   A sane, basic, algorithm might be for the consumer to go through
2225	   each Capture Scene View in turn and find the collection of Video
2226	   Captures that best matches the number of screens it has (this
2227	   might include consideration of screens dedicated to presentation
2228	   video display rather than "people" video) and then decide between
2229	   alternative views in the video Capture Scenes based either on
2230	   hard-coded preferences or user choice.  Once this choice has been
2231	   made, the consumer would then decide how to configure the
2232	   provider's encoding groups in order to make best use of the
2233	   available network bandwidth and its own decoding capabilities.

2235	12.2.1. One screen Media Consumer

2237	   MCC3, MCC4 and VC5 are all different views by themselves, not
2238	   grouped together in a single view, so the receiving device should
2239	   choose between one of those.  The choice would come down to
2240	   whether to see the greatest number of participants simultaneously
2241	   at roughly equal precedence (VC5), a switched view of just the
2242	   loudest region (MCC3) or a switched view with PiPs (MCC4).  An
2243	   endpoint device with a small amount of knowledge of these
2244	   differences could offer a dynamic choice of these options, in-
2245	   call, to the user.

2247	12.2.2. Two screen Media Consumer configuring the example

2249	   Mixing systems with an even number of screens, "2n", and those
2250	   with "2n+1" cameras (and vice versa) is always likely to be the
2251	   problematic case.  In this instance, the behavior is likely to be
2252	   determined by whether a "2 screen" system is really a "2 decoder"
2253	   system, i.e., whether only one received stream can be displayed
2254	   per screen or whether more than 2 streams can be received and
2255	   spread across the available screen area.  To enumerate 3 possible
2256	   behaviors here for the 2 screen system when it learns that the far
2257	   end is "ideally" expressed via 3 capture streams:

2259	   1. Fall back to receiving just a single stream (MCC3, MCC4 or VC5
2260	      as per the 1 screen consumer case above) and either leave one
2261	      screen blank or use it for presentation if / when a
2262	      presentation becomes active.

2264	   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2
2265	      screens (either with each capture being scaled to 2/3 of a
2266	      screen and the center capture being split across 2 screens) or,
2267	      as would be necessary if there were large bezels on the
2268	      screens, with each stream being scaled to 1/2 the screen width
2269	      and height and there being a 4th "blank" panel.  This 4th panel
2270	      could potentially be used for any presentation that became
2271	      active during the call.

2273	   3. Receive 3 streams, decode all 3, and use control information
2274	      indicating which was the most active to switch between showing
2275	      the left and center streams (one per screen) and the center and
2276	      right streams.

2278	   For an endpoint capable of all 3 methods of working described
2279	   above, again it might be appropriate to offer the user the choice
2280	   of display mode.

2282	12.2.3. Three screen Media Consumer configuring the example

2284	   This is the most straightforward case - the Media Consumer would
2285	   look to identify a set of streams to receive that best matched its
2286	   available screens and so the VC0 plus VC1 plus VC2 should match
2287	   optimally.  The spatial ordering would give sufficient information
2288	   for the correct video capture to be shown on the correct screen,
2289	   and the consumer would either need to divide a single encoding
2290	   group's capability by 3 to determine what resolution and frame
2291	   rate to configure the provider with or to configure the individual
2292	   video captures' encoding groups with what makes most sense (taking
2293	   into account the receive side decode capabilities, overall call
2294	   bandwidth, the resolution of the screens plus any user preferences
2295	   such as motion vs sharpness).

2297	12.3. Multipoint Conference utilizing Multiple Content Captures

2299	   The use of MCCs allows the MCU to construct outgoing Advertisements
2300	   describing complex and media switching and composition scenarios.
2301	   The following sections provide several examples.

2303	   Note: In the examples the identities of the CLUE elements (e.g.
2304	   Captures, Capture Scene) in the incoming Advertisements overlap.
2305	   This is because there is no co-ordination between the endpoints.
2306	   The MCU is responsible for making these unique in the outgoing
2307	   advertisement.

2309	12.3.1. Single Media Captures and MCC in the same Advertisement

2311	   Four endpoints are involved in a Conference where CLUE is used. An
2312	   MCU acts as a middlebox between the endpoints with a CLUE channel
2313	   between each endpoint and the MCU. The MCU receives the following
2314	   Advertisements.

2316	        +-----------------------+---------------------------------+
2317	        | Capture Scene #1      | Description=AustralianConfRoom  |
2318	        +-----------------------|---------------------------------+
2319	        | VC1                   | Description=Audience            |
2320	        |                       | EncodeGroupID=1                 |
2321	        | CSV(VC1)              |                                 |
2322	        +---------------------------------------------------------+
2323	            Table 10: Advertisement received from Endpoint A

2325	        +-----------------------+---------------------------------+
2326	        | Capture Scene #1      | Description=ChinaConfRoom       |
2327	        +-----------------------|---------------------------------+
2328	        | VC1                   | Description=Speaker             |
2329	        |                       | EncodeGroupID=1                 |
2330	        | VC2                   | Description=Audience            |
2331	        |                       | EncodeGroupID=1                 |
2332	        | CSV(VC1, VC2)         |                                 |
2333	        +---------------------------------------------------------+

2335	            Table 11: Advertisement received from Endpoint B

2337	        +-----------------------+---------------------------------+
2338	        | Capture Scene #1      | Description=USAConfRoom         |
2339	        +-----------------------|---------------------------------+
2340	        | VC1                   | Description=Audience            |
2341	        |                       | EncodeGroupID=1                 |
2342	        | CSV(VC1)              |                                 |
2343	        +---------------------------------------------------------+

2345	            Table 12: Advertisement received from Endpoint C

2347	   Note: Endpoint B above indicates that it sends two streams.

2349	   If the MCU wanted to provide a Multiple Content Capture containing
2350	   a round robin switched view of the audience from the 3 endpoints
2351	   and the speaker it could construct the following advertisement:

2353	   Advertisement sent to Endpoint F

2355	        +=======================+=================================+
2356	        | Capture Scene #1      | Description=AustralianConfRoom  |
2357	        +-----------------------|---------------------------------+
2358	        | VC1                   | Description=Audience            |
2359	        | CSV(VC1)              |                                 |
2360	        +=======================+=================================+
2361	        | Capture Scene #2      | Description=ChinaConfRoom       |
2362	        +-----------------------|---------------------------------+
2363	        | VC2                   | Description=Speaker             |
2364	        | VC3                   | Description=Audience            |
2365	        | CSV(VC2, VC3)         |                                 |
2366	        +=======================+=================================+
2367	        | Capture Scene #3      | Description=USAConfRoom         |
2368	        +-----------------------|---------------------------------+
2369	        | VC4                   | Description=Audience            |
2370	        | CSV(VC4)              |                                 |
2371	        +=======================+=================================+
2372	        | Capture Scene #4      |                                 |
2373	        +-----------------------|---------------------------------+
2374	        | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1             |
2375	        |                       | MaxCaptures=1                   |
2376	        |                       | EncodingGroup=1                 |
2377	        | CSV(MCC1)             |                                 |
2378	        +=======================+=================================+

2380	         Table 13: Advertisement sent to Endpoint F - One Encoding

2382	   Alternatively if the MCU wanted to provide the speaker as one media
2383	   stream and the audiences as another it could assign an encoding
2384	   group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene
2385	   #4 as per the example below.

2387	   Advertisement sent to Endpoint F

2389	        +=======================+=================================+
2390	        | Capture Scene #1      | Description=AustralianConfRoom  |
2391	        +-----------------------|---------------------------------+
2392	        | VC1                   | Description=Audience            |
2393	        | CSV(VC1)              |                                 |
2394	        +=======================+=================================+
2395	        | Capture Scene #2      | Description=ChinaConfRoom       |
2396	        +-----------------------|---------------------------------+
2397	        | VC2                   | Description=Speaker             |
2398	        |                       | EncodingGroup=1                 |
2399	        | VC3                   | Description=Audience            |
2400	        | CSV(VC2, VC3)         |                                 |
2401	        +=======================+=================================+
2402	        | Capture Scene #3      | Description=USAConfRoom         |
2403	        +-----------------------|---------------------------------+
2404	        | VC4                   | Description=Audience            |
2405	        | CSV(VC4)              |                                 |
2406	        +=======================+=================================+
2407	        | Capture Scene #4      |                                 |
2408	        +-----------------------|---------------------------------+
2409	        | MCC1(VC1,VC3,VC4)     | Policy=RoundRobin:1             |
2410	        |                       | MaxCaptures=1                   |
2411	        |                       | EncodingGroup=1                 |
2412	        | MCC2(VC2)             | MaxCaptures=1                   |
2413	        |                       | EncodingGroup=1                 |
2414	        | CSV2(MCC1,MCC2)       |                                 |
2415	        +=======================+=================================+

2417	        Table 14: Advertisement sent to Endpoint F - Two Encodings

2419	   Therefore a Consumer could choose whether or not to have a separate
2420	   speaker related stream and could choose which endpoints to see.  If
2421	   it wanted the second stream but not the Australian conference room
2422	   it could indicate the following captures in the Configure message:

2424	        +-----------------------+---------------------------------+
2425	        | MCC1(VC3,VC4)         | Encoding                        |
2426	        | VC2                   | Encoding                        |
2427	        +-----------------------|---------------------------------+
2428	                      Table 15: MCU case: Consumer Response

2430	12.3.2. Several MCCs in the same Advertisement

2432	   Multiple MCCs can be used where multiple streams are used to carry
2433	   media from multiple endpoints.  For example:

2435	   A conference has three endpoints D, E and F. Each end point has
2436	   three video captures covering the left, middle and right regions of
2437	   each conference room.  The MCU receives the following
2438	   advertisements from D and E.

2440	        +-----------------------+---------------------------------+
2441	        | Capture Scene #1      | Description=AustralianConfRoom  |
2442	        +-----------------------|---------------------------------+
2443	        | VC1                   | CaptureArea=Left                |
2444	        |                       | EncodingGroup=1                 |
2445	        | VC2                   | CaptureArea=Centre              |
2446	        |                       | EncodingGroup=1                 |
2447	        | VC3                   | CaptureArea=Right               |
2448	        |                       | EncodingGroup=1                 |
2449	        | CSV(VC1,VC2,VC3)      |                                 |
2450	        +---------------------------------------------------------+

2452	            Table 16: Advertisement received from Endpoint D

2454	        +-----------------------+---------------------------------+
2455	        | Capture Scene #1      | Description=ChinaConfRoom       |
2456	        +-----------------------|---------------------------------+
2457	        | VC1                   | CaptureArea=Left                |
2458	        |                       | EncodingGroup=1                 |
2459	        | VC2                   | CaptureArea=Centre              |
2460	        |                       | EncodingGroup=1                 |
2461	        | VC3                   | CaptureArea=Right               |
2462	        |                       | EncodingGroup=1                 |
2463	        | CSV(VC1,VC2,VC3)      |                                 |
2464	        +---------------------------------------------------------+

2466	            Table 17: Advertisement received from Endpoint E

2468	   The MCU wants to offer Endpoint F three Capture Encodings.  Each
2469	   Capture Encoding would contain all the Captures from either
2470	   Endpoint D or Endpoint E depending based on the active speaker.
2471	   The MCU sends the following Advertisement:

2473	        +=======================+=================================+
2474	        | Capture Scene #1      | Description=AustralianConfRoom  |
2475	        +-----------------------|---------------------------------+
2476	        | VC1                   |                                 |
2477	        | VC2                   |                                 |
2478	        | VC3                   |                                 |
2479	        | CSV(VC1,VC2,VC3)      |                                 |
2480	        +=======================+=================================+
2481	        | Capture Scene #2      | Description=ChinaConfRoom       |
2482	        +-----------------------|---------------------------------+
2483	        | VC4                   |                                 |
2484	        | VC5                   |                                 |
2485	        | VC6                   |                                 |
2486	        | CSV(VC4,VC5,VC6)      |                                 |
2487	        +=======================+=================================+
2488	        | Capture Scene #3      |                                 |
2489	        +-----------------------|---------------------------------+
2490	        | MCC1(VC1,VC4)         | CaptureArea=Left                |
2491	        |                       | MaxCaptures=1                   |
2492	        |                       | SynchronisationID=1             |
2493	        |                       | EncodingGroup=1                 |
2494	        | MCC2(VC2,VC5)         | CaptureArea=Centre              |
2495	        |                       | MaxCaptures=1                   |
2496	        |                       | SynchronisationID=1             |
2497	        |                       | EncodingGroup=1                 |
2498	        | MCC3(VC3,VC6)         | CaptureArea=Right               |
2499	        |                       | MaxCaptures=1                   |
2500	        |                       | SynchronisationID=1             |
2501	        |                       | EncodingGroup=1                 |
2502	        | CSV(MCC1,MCC2,MCC3)   |                                 |
2503	        +=======================+=================================+

2505	            Table 17: Advertisement received from Endpoint E

2507	12.3.3. Heterogeneous conference with switching and composition

2509	   Consider a conference between endpoints with the following
2510	   characteristics:

2512	      Endpoint A - 4 screens, 3 cameras

2514	      Endpoint B - 3 screens, 3 cameras

2516	      Endpoint C - 3 screens, 3 cameras

2518	      Endpoint D - 3 screens, 3 cameras

2520	      Endpoint E - 1 screen, 1 camera

2522	      Endpoint F - 2 screens, 1 camera

2524	      Endpoint G - 1 screen, 1 camera

2526	   This example focuses on what the user in one of the 3-camera multi-
2527	   screen endpoints sees.  Call this person User A, at Endpoint A.
2528	   There are 4 large display screens at Endpoint A. Whenever somebody
2529	   at another site is speaking, all the video captures from that
2530	   endpoint are shown on the large screens.  If the talker is at a 3-
2531	   camera site, then the video from those 3 cameras fills 3 of the
2532	   screens.  If the talker is at a single-camera site, then video from
2533	   that camera fills one of the screens, while the other screens show
2534	   video from other single-camera endpoints.

2536	   User A hears audio from the 4 loudest talkers.

2538	   User A can also see video from other endpoints, in addition to the
2539	   current talker, although much smaller in size.  Endpoint A has 4
2540	   screens, so one of those screens shows up to 9 other Media Captures
2541	   in a tiled fashion.  When video from a 3 camera endpoint appears in
2542	   the tiled area, video from all 3 cameras appears together across
2543	   the screen with correct spatial relationship among those 3 images.

2545	      +---+---+---+ +-------------+ +-------------+ +-------------+
2546	      |   |   |   | |             | |             | |             |
2547	      +---+---+---+ |             | |             | |             |
2548	      |   |   |   | |             | |             | |             |
2549	      +---+---+---+ |             | |             | |             |
2550	      |   |   |   | |             | |             | |             |
2551	      +---+---+---+ +-------------+ +-------------+ +-------------+
2552	                     Figure 7: Endpoint A - 4 Screen Display

2554	   User B at Endpoint B sees a similar arrangement, except there are
2555	   only 3 screens, so the 9 other Media Captures are spread out across
2556	   the bottom of the 3 displays, in a picture-in-picture (PIP) format.
2557	   When video from a 3 camera endpoint appears in the PIP area, video
2558	   from all 3 cameras appears together across a single screen with
2559	   correct spatial relationship.

2561	              +-------------+ +-------------+ +-------------+
2562	              |             | |             | |             |
2563	              |             | |             | |             |
2564	              |             | |             | |             |
2565	              | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
2566	              | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ |
2567	              +-------------+ +-------------+ +-------------+
2568	                Figure 8: Endpoint B - 3 Screen Display with PiPs

2570	   When somebody at a different endpoint becomes the current talker,
2571	   then User A and User B both see the video from the new talker
2572	   appear on their large screen area, while the previous talker takes
2573	   one of the smaller tiled or PIP areas.  The person who is the
2574	   current talker doesn't see themselves; they see the previous talker
2575	   in their large screen area.

2577	   One of the points of this example is that endpoints A and B each
2578	   want to receive 3 capture encodings for their large display areas,
2579	   and 9 encodings for their smaller areas.  A and B are be able to
2580	   each send the same Configure message to the MCU, and each receive
2581	   the same conceptual Media Captures from the MCU.  The differences
2582	   are in how they are rendered and are purely a local matter at A and
2583	   B.

2585	   The Advertisements for such a scenario are described below.

2587	        +-----------------------+---------------------------------+
2588	        | Capture Scene #1      | Description=Endpoint x          |
2589	        +-----------------------|---------------------------------+
2590	        | VC1                   | EncodingGroup=1                 |
2591	        | VC2                   | EncodingGroup=1                 |
2592	        | VC3                   | EncodingGroup=1                 |
2593	        | AC1                   | EncodingGroup=2                 |
2594	        | CSV1(VC1, VC2, VC3)   |                                 |
2595	        | CSV2(AC1)             |                                 |
2596	        +---------------------------------------------------------+

2598	   Table 19: Advertisement received at the MCU from Endpoints A to D

2600	        +-----------------------+---------------------------------+
2601	        | Capture Scene #1      | Description=Endpoint y          |
2602	        +-----------------------|---------------------------------+
2603	        | VC1                   | EncodingGroup=1                 |
2604	        | AC1                   | EncodingGroup=2                 |
2605	        | CSV1(VC1)             |                                 |
2606	        | CSV2(AC1)             |                                 |
2607	        +---------------------------------------------------------+

2609	   Table 20: Advertisement received at the MCU from Endpoints E to F

2611	   Rather than considering what is displayed CLUE concentrates more
2612	   on what the MCU sends. The MCU doesn't know anything about the
2613	   number of screens an endpoint has.

2615	   As Endpoints A to D each advertise that three Captures make up a
2616	   Capture Scene, the MCU offers these in a "site" switching mode.
2617	   That is that there are three Multiple Content Captures (and
2618	   Capture Encodings) each switching between Endpoints. The MCU
2619	   switches in the applicable media into the stream based on voice
2620	   activity. Endpoint A will not see a capture from itself.

2622	   Using the MCC concept the MCU would send the following
2623	   Advertisement to endpoint A:

2625	        +=======================+=================================+
2626	        | Capture Scene #1      | Description=Endpoint B          |
2627	        +-----------------------|---------------------------------+
2628	        | VC4                   | Left                            |
2629	        | VC5                   | Center                          |
2630	        | VC6                   | Right                           |
2631	        | AC1                   |                                 |
2632	        | CSV(VC4,VC5,VC6)      |                                 |
2633	        | CSV(AC1)              |                                 |
2634	        +=======================+=================================+
2635	        | Capture Scene #2      | Description=Endpoint C          |
2636	        +-----------------------|---------------------------------+
2637	        | VC7                   | Left                            |
2638	        | VC8                   | Center                          |
2639	        | VC9                   | Right                           |
2640	        | AC2                   |                                 |
2641	        | CSV(VC7,VC8,VC9)      |                                 |
2642	        | CSV(AC2)              |                                 |
2643	        +=======================+=================================+
2644	        | Capture Scene #3      | Description=Endpoint D          |
2645	        +-----------------------|---------------------------------+
2646	        | VC10                  | Left                            |
2647	        | VC11                  | Center                          |
2648	        | VC12                  | Right                           |
2649	        | AC3                   |                                 |
2650	        | CSV(VC10,VC11,VC12)   |                                 |
2651	        | CSV(AC3)              |                                 |
2652	        +=======================+=================================+
2653	        | Capture Scene #4      | Description=Endpoint E          |
2654	        +-----------------------|---------------------------------+
2655	        | VC13                  |                                 |
2656	        | AC4                   |                                 |
2657	        | CSV(VC13)             |                                 |
2658	        | CSV(AC4)              |                                 |
2659	        +=======================+=================================+
2660	        | Capture Scene #5      | Description=Endpoint F          |
2661	        +-----------------------|---------------------------------+
2662	        | VC14                  |                                 |
2663	        | AC5                   |                                 |
2664	        | CSV(VC14)             |                                 |
2665	        | CSV(AC5)              |                                 |
2666	        +=======================+=================================+
2667	        | Capture Scene #6      | Description=Endpoint G          |
2668	        +-----------------------|---------------------------------+
2669	        | VC15                  |                                 |
2670	        | AC6                   |                                 |
2671	        | CSV(VC15)             |                                 |
2672	        | CSV(AC6)              |                                 |
2673	        +=======================+=================================+

2675	         Table 21: Advertisement sent to endpoint A - Source Part

2677	   The above part of the Advertisement presents information about the
2678	   sources to the MCC. The information is effectively the same as the
2679	   received Advertisements except that there are no Capture Encodings
2680	   associated with them and the identities have been re-numbered.

2682	   In addition to the source Capture information the MCU advertises
2683	   "site" switching of Endpoints B to G in three streams.

2685	        +=======================+=================================+
2686	        | Capture Scene #7      | Description=Output3streammix    |
2687	        +-----------------------|---------------------------------+
2688	        | MCC1(VC4,VC7,VC10,    | CaptureArea=Left                |
2689	        |      VC13)            | MaxCaptures=1                   |
2690	        |                       | SynchronisationID=1             |
2691	        |                       | Policy=SoundLevel:0             |
2692	        |                       | EncodingGroup=1                 |
2693	        |                       |                                 |
2694	        | MCC2(VC5,VC8,VC11,    | CaptureArea=Center              |
2695	        |      VC14)            | MaxCaptures=1                   |
2696	        |                       | SynchronisationID=1             |
2697	        |                       | Policy=SoundLevel:0             |
2698	        |                       | EncodingGroup=1                 |
2699	        |                       |                                 |
2700	        | MCC3(VC6,VC9,VC12,    | CaptureArea=Right               |
2701	        |      VC15)            | MaxCaptures=1                   |
2702	        |                       | SynchronisationID=1             |
2703	        |                       | Policy=SoundLevel:0             |
2704	        |                       | EncodingGroup=1                 |
2705	        |                       |                                 |
2706	        | MCC4() (for audio)    | CaptureArea=whole scene         |
2707	        |                       | MaxCaptures=1                   |
2708	        |                       | Policy=SoundLevel:0             |
2709	        |                       | EncodingGroup=2                 |
2710	        |                       |                                 |
2711	        | MCC5() (for audio)    | CaptureArea=whole scene         |
2712	        |                       | MaxCaptures=1                   |
2713	        |                       | Policy=SoundLevel:1             |
2714	        |                       | EncodingGroup=2                 |
2715	        |                       |                                 |
2716	        | MCC6() (for audio)    | CaptureArea=whole scene         |
2717	        |                       | MaxCaptures=1                   |
2718	        |                       | Policy=SoundLevel:2             |
2719	        |                       | EncodingGroup=2                 |
2720	        |                       |                                 |
2721	        | MCC7() (for audio)    | CaptureArea=whole scene         |
2722	        |                       | MaxCaptures=1                   |
2723	        |                       | Policy=SoundLevel:3             |
2724	        |                       | EncodingGroup=2                 |
2725	        |                       |                                 |
2726	        | CSV(MCC1,MCC2,MCC3)   |                                 |
2727	        | CSV(MCC4,MCC5,MCC6,   |                                 |
2728	        |     MCC7)             |                                 |
2729	        +=======================+=================================+

2731	       Table 22: Advertisement send to endpoint A - switching part

2733	   The above part describes the switched 3 main streams that relate to
2734	   site switching. MaxCaptures=1 indicates that only one Capture from
2735	   the MCC is sent at a particular time. SynchronisationID=1 indicates
2736	   that the source sending is synchronised. The provider can choose to
2737	   group together VC13, VC14, and VC15 for the purpose of switching
2738	   according to the SynchronisationID.  Therefore when the provider
2739	   switches one of them into an MCC, it can also switch the others
2740	   even though they are not part of the same Capture Scene.

2742	   All the audio for the conference is included in this Scene #7.
2743	   There isn't necessarily a one to one relation between any audio
2744	   capture and video capture in this scene.  Typically a change in
2745	   loudest talker will cause the MCU to switch the audio streams more
2746	   quickly than switching video streams.

2748	   The MCU can also supply nine media streams showing the active and
2749	   previous eight speakers. It includes the following in the
2750	   Advertisement:

2752	        +=======================+=================================+
2753	        | Capture Scene #8      | Description=Output9stream       |
2754	        +-----------------------|---------------------------------+
2755	        | MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1                   |
2756	        |   VC8,VC9,VC10,VC11,  | Policy=SoundLevel:0             |
2757	        |   VC12,VC13,VC14,VC15)| EncodingGroup=1                 |
2758	        |                       |                                 |
2759	        | MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1                   |
2760	        |   VC8,VC9,VC10,VC11,  | Policy=SoundLevel:1             |
2761	        |   VC12,VC13,VC14,VC15)| EncodingGroup=1                 |
2762	        |                       |                                 |
2763	                    to                           to               |
2764	        |                       |                                 |
2765	        | MCC16(VC4,VC5,VC6,VC7,| MaxCaptures=1                   |
2766	        |   VC8,VC9,VC10,VC11,  | Policy=SoundLevel:8             |
2767	        |   VC12,VC13,VC14,VC15)| EncodingGroup=1                 |
2768	        |                       |                                 |
2769	        | CSV(MCC8,MCC9,MCC10,  |                                 |
2770	        |     MCC11,MCC12,MCC13,|                                 |
2771	        |     MCC14,MCC15,MCC16)|                                 |
2772	        +=======================+=================================+

2774	       Table 23: Advertisement sent to endpoint A - 9 switched part

2776	   The above part indicates that there are 9 capture encodings. Each
2777	   of the Capture Encodings may contain any captures from any source
2778	   site with a maximum of one Capture at a time. Which Capture is
2779	   present is determined by the policy.  The MCCs in this scene do not
2780	   have any spatial attributes.

2782	   Note: The Provider alternatively could provide each of the MCCs
2783	   above in its own Capture Scene.

2785	   If the MCU wanted to provide a composed Capture Encoding containing
2786	   all of the 9 captures it could Advertise in addition:

2788	        +=======================+=================================+
2789	        | Capture Scene #9      | Description=NineTiles           |
2790	        +-----------------------|---------------------------------+
2791	        | MCC13(MCC8,MCC9,MCC10,| MaxCaptures=9                   |
2792	        |     MCC11,MCC12,MCC13,| EncodingGroup=1                 |
2793	        |     MCC14,MCC15,MCC16)|                                 |
2794	        |                       |                                 |
2795	        | CSV(MCC13)            |                                 |
2796	        +=======================+=================================+

2798	      Table 24: Advertisement sent to endpoint A - 9 composed part

2800	   As MaxCaptures is 9 it indicates that the capture encoding contains
2801	   information from 9 sources at a time.

2803	   The Advertisement to Endpoint B is identical to the above other
2804	   than the captures from Endpoint A would be added and the captures
2805	   from Endpoint B would be removed. Whether the Captures are rendered
2806	   on a four screen display or a three screen display is up to the
2807	   Consumer to determine.  The Consumer wants to place video captures
2808	   from the same original source endpoint together, in the correct
2809	   spatial order, but the MCCs do not have spatial attributes.  So the
2810	   Consumer needs to associate incoming media packets with the
2811	   original individual captures in the advertisement (such as VC4,
2812	   VC5, and VC6) in order to know the spatial information it needs for
2813	   correct placement on the screens.

2815	   Editor's note: this is an open issue, about how to associate
2816	   incoming packets with the original capture that is a constituent of
2817	   an MCC.  This document probably should mention it in an earlier
2818	   section, after the solution is worked out in the other CLUE
2819	   documents.

2821	12.3.4. Heterogeneous conference with voice activated switching

2823	   This example illustrates how multipoint "voice activated switching"
2824	   behavior can be realized, with an endpoint making its own decision
2825	   about which of its outgoing video streams is considered the "active
2826	   talker" from that endpoint.  Then an MCU can decide which is the
2827	   active talker among the whole conference.

2829	   Consider a conference between endpoints with the following
2830	   characteristics:

2832	      Endpoint A - 3 screens, 3 cameras

2834	      Endpoint B - 3 screens, 3 cameras

2836	      Endpoint C - 1 screen, 1 camera

2838	   This example focuses on what the user at endpoint C sees.  The
2839	   user would like to see the video capture of the current talker,
2840	   without composing it with any other video capture.  In this
2841	   example endpoint C is capable of receiving only a single video
2842	   stream.  The following tables describe advertisements from A and B
2843	   to the MCU, and from the MCU to C, that can be used to accomplish
2844	   this.

2846	        +-----------------------+---------------------------------+
2847	        | Capture Scene #1      | Description=Endpoint x          |
2848	        +-----------------------|---------------------------------+
2849	        | VC1                   | CaptureArea=Left                |
2850	        |                       | EncodingGroup=1                 |
2851	        | VC2                   | CaptureArea=Center              |
2852	        |                       | EncodingGroup=1                 |
2853	        | VC3                   | CaptureArea=Right               |
2854	        |                       | EncodingGroup=1                 |
2855	        | MCC1(VC1,VC2,VC3)     | MaxCaptures=1                   |
2856	        |                       | CaptureArea=whole scene         |
2857	        |                       | Policy=SoundLevel:0             |
2858	        |                       | EncodingGroup=1                 |
2859	        | AC1                   | CaptureArea=whole scene         |
2860	        |                       | EncodingGroup=2                 |
2861	        | CSV1(VC1, VC2, VC3)   |                                 |
2862	        | CSV2(MCC1)            |                                 |
2863	        | CSV3(AC1)             |                                 |
2864	        +---------------------------------------------------------+

2866	   Table 25: Advertisement received at the MCU from Endpoints A and B

2868	   Endpoints A and B are advertising each individual video capture,
2869	   and also a switched capture MCC1 which switches between the other
2870	   three based on who is the active talker.  These endpoints do not
2871	   advertise distinct audio captures associated with each individual
2872	   video capture, so it would be impossible for the MCU (as a media
2873	   consumer) to make its own determination of which video capture is
2874	   the active talker based just on information in the audio streams.

2876	        +-----------------------+---------------------------------+
2877	        | Capture Scene #1      | Description=conference          |
2878	        +-----------------------|---------------------------------+
2879	        | MCC1()                | CaptureArea=Left                |
2880	        |                       | MaxCaptures=1                   |
2881	        |                       | SynchronisationID=1             |
2882	        |                       | Policy=SoundLevel:0             |
2883	        |                       | EncodingGroup=1                 |
2884	        |                       |                                 |
2885	        | MCC2()                | CaptureArea=Center              |
2886	        |                       | MaxCaptures=1                   |
2887	        |                       | SynchronisationID=1             |
2888	        |                       | Policy=SoundLevel:0             |
2889	        |                       | EncodingGroup=1                 |
2890	        |                       |                                 |
2891	        | MCC3()                | CaptureArea=Right               |
2892	        |                       | MaxCaptures=1                   |
2893	        |                       | SynchronisationID=1             |
2894	        |                       | Policy=SoundLevel:0             |
2895	        |                       | EncodingGroup=1                 |
2896	        |                       |                                 |
2897	        | MCC4()                | CaptureArea=whole scene         |
2898	        |                       | MaxCaptures=1                   |
2899	        |                       | Policy=SoundLevel:0             |
2900	        |                       | EncodingGroup=1                 |
2901	        |                       |                                 |
2902	        | MCC5() (for audio)    | CaptureArea=whole scene         |
2903	        |                       | MaxCaptures=1                   |
2904	        |                       | Policy=SoundLevel:0             |
2905	        |                       | EncodingGroup=2                 |
2906	        |                       |                                 |
2907	        | MCC6() (for audio)    | CaptureArea=whole scene         |
2908	        |                       | MaxCaptures=1                   |
2909	        |                       | Policy=SoundLevel:1             |
2910	        |                       | EncodingGroup=2                 |
2911	        | CSV1(MCC1,MCC2,MCC3   |                                 |
2912	        | CSV2(MCC4)            |                                 |
2913	        | CSV3(MCC5,MCC6)       |                                 |
2914	        +---------------------------------------------------------+

2916	            Table 26: Advertisement sent from the MCU to C

2918	   The MCU advertises one scene, with four video MCCs.  Three of them
2919	   in CSV1 give a left, center, right view of the conference, with
2920	   "site switching". MCC4 provides a single video capture
2921	   representing a view of the whole conference.  The MCU intends for
2922	   MCC4 to be switched between all the other original source
2923	   captures.  In this example advertisement the MCU is not giving all
2924	   the information about all the other endpoints' scenes and which of
2925	   those captures is included in the MCCs.  The MCU could include all
2926	   that information if it wants to give the consumers more
2927	   information, but it is not necessary for this example scenario.

2929	   The Provider advertises MCC5 and MCC6 for audio.  Both are
2930	   switched captures, with different SoundLevel policies indicating
2931	   they are the top two dominant talkers.  The Provider advertises
2932	   CSV3 with both MCCs, suggesting the Consumer should use both if it
2933	   can.

2935	   Endpoint C, in its configure message to the MCU, requests to
2936	   receive MCC4 for video, and MCC5 and MCC6 for audio.  In order for
2937	   the MCU to get the information it needs to construct MCC4, it has
2938	   to send configure messages to A and B asking to receive MCC1 from
2939	   each of them, along with their AC1 audio.  Now the MCU can use
2940	   audio energy information from the two incoming audio streams from
2941	   A and B to determine which of those alternatives is the current
2942	   talker.  Based on that, the MCU uses either MCC1 from A or MCC1
2943	   from B as the source of MCC4 to send to C.

2945	13. Acknowledgements

2947	   Allyn Romanow and Brian Baldino were authors of early versions.
2948	   Mark Gorzynski also contributed much to the initial approach.
2949	   Many others also contributed, including Christian Groves, Jonathan
2950	   Lennox, Paul Kyzivat, Rob Hansen, Roni Even, Christer Holmberg,
2951	   Stephen Botzko, Mary Barnes, John Leslie, Paul Coverdale.

2953	14. IANA Considerations

2955	   None.

2957	15. Security Considerations

2959	   There are several potential attacks related to telepresence, and
2960	   specifically the protocols used by CLUE, in the case of
2961	   conferencing sessions, due to the natural involvement of multiple
2962	   endpoints and the many, often user-invoked, capabilities provided
2963	   by the systems.

2965	   An MCU involved in a CLUE session can experience many of the same
2966	   attacks as that of a conferencing system such as that enabled by
2967	   the XCON framework [RFC 6503]. Examples of attacks include the
2968	   following: an endpoint attempting to listen to sessions in which
2969	   it is not authorized to participate, an endpoint attempting to
2970	   disconnect or mute other users, and theft of service by an
2971	   endpoint in attempting to create telepresence sessions it is not
2972	   allowed to create. Thus, it is RECOMMENDED that an MCU
2973	   implementing the protocols necessary to support CLUE, follow the
2974	   security recommendations specified in the conference control
2975	   protocol documents.  In the case of CLUE, SIP is the default
2976	   conferencing protocol, thus the security considerations in RFC
2977	   4579 MUST be followed.

2979	   One primary security concern, surrounding the CLUE framework
2980	   introduced in this document, involves securing the actual
2981	   protocols and the associated authorization mechanisms.  These
2982	   concerns apply to endpoint to endpoint sessions, as well as
2983	   sessions involving multiple endpoints and MCUs. Figure 2 in
2984	   section 5 provides a basic flow of information exchange for CLUE
2985	   and the protocols involved.

2987	   As described in section 5, CLUE uses SIP/SDP to establish the
2988	   session prior to exchanging any CLUE specific information. Thus
2989	   the security mechanisms recommended for SIP [RFC 3261], including
2990	   user authentication and authorization, SHOULD be followed. In
2991	   addition, the media is based on RTP and thus existing RTP security
2992	   mechanisms, such as DTLS/SRTP, MUST be supported.

2994	   A separate data channel is established to transport the CLUE
2995	   protocol messages. The contents of the CLUE protocol messages are
2996	   based on information introduced in this document, which is
2997	   represented by an XML schema for this information defined in the
2998	   CLUE data model [ref]. Some of the information which could
2999	   possibly introduce privacy concerns is the xCard information as
3000	   described in section 7.1.1.11.  In addition, the (text)
3001	   description field in the Media Capture attribute (section 7.1.1.7)
3002	   could possibly reveal sensitive information or specific
3003	   identities. The same would be true for the descriptions in the
3004	   Capture Scene (section 7.3.1) and Capture Scene View (7.3.2)
3005	   attributes.   One other important consideration for the
3006	   information in the xCard as well as the description field in the
3007	   Media Capture and Capture Scene View attributes is that while the
3008	   endpoints involved in the session have been authenticated, there
3009	   is no assurance that the information in the xCard or description
3010	   fields is authentic.  Thus, this information SHOULD not be used to
3011	   make any authorization decisions and the participants in the
3012	   sessions SHOULD be made aware of this.

3014	   While other information in the CLUE protocol messages does not
3015	   reveal specific identities, it can reveal characteristics and
3016	   capabilities of the endpoints.  That information could possibly
3017	   uniquely identify specific endpoints.  It might also be possible
3018	   for an attacker to manipulate the information and disrupt the CLUE
3019	   sessions.  It would also be possible to mount a DoS attack on the
3020	   CLUE endpoints if a malicious agent has access to the data
3021	   channel.  Thus, It MUST be possible for the endpoints to establish
3022	   a channel which is secure against both message recovery and
3023	   message modification. Further details on this are provided in the
3024	   CLUE data channel solution document.

3026	   There are also security issues associated with the authorization
3027	   to perform actions at the CLUE endpoints to invoke specific
3028	   capabilities (e.g., re-arranging screens, sharing content, etc.).
3029	   However, the policies and security associated with these actions
3030	   are outside the scope of this document and the overall CLUE
3031	   solution.

3033	16. Changes Since Last Version

3035	   NOTE TO THE RFC-Editor: Please remove this section prior to
3036	   publication as an RFC.

3038	   Changes from 16 to 17:

3040	     1. Ticket #59 - rename Capture Scene Entry (CSE) to Capture
3041	        Scene View (CSV)

3043	     2. Ticket #60 - rename Global CSE List to Global View List

3045	     3. Ticket #61 - Proposal for describing the coordinate system.
3046	        Describe it better, without conflicts if cameras point in
3047	        different directions.

3049	     4. Minor clarifications and improved wording for Synchronisation
3050	        Identity, MCC, Simultaneous Transmission Set.

3052	     5. Add definitions for CLUE-capable device and CLUE-enabled
3053	        call, taken from the signaling draft.

3055	     6. Update definitions of Capture Device, Media Consumer, Media
3056	        Provider, Endpoint, MCU, MCC.

3058	     7. Replace "middle box" with "MCU".

3060	     8. Explicitly state there can also be Media Captures that are
3061	        not included in a Capture Scene View.

3063	     9. Explicitly state "A single Encoding Group MAY refer to
3064	        encodings for different media types."

3066	     10. In example 12.1.1 add axes and audio captures to the
3067	        diagram, and describe placement of microphones.

3069	     11. Add references to data model and signaling drafts.

3071	     12. Split references into Normative and Informative sections.
3072	        Add heading number for references section.

3074	   Changes from 15 to 16:

3076	     1. Remove Audio Channel Format attribute

3078	     2. Add Audio Capture Sensitivity Pattern attribute

3080	     3. Clarify audio spatial information regarding point of capture
3081	        and point on line of capture.  Area of capture does not apply
3082	        to audio.

3084	     4. Update section 12 example for new treatment of audio spatial
3085	        information.

3087	     5. Clean up wording of some definitions, and various places in
3088	        sections 5 and 10.

3090	     6. Remove individual encoding parameter paragraph from section
3091	        9.

3093	     7. Update Advertisement diagram.

3095	     8. Update Acknowledgements.

3097	     9. References to use cases and requirements now refer to RFCs.

3099	     10. Minor editorial changes.

3101	   Changes from 14 to 15:

3103	     1. Add "=" and "<=" qualifiers to MaxCaptures attribute, and
3104	        clarify the meaning regarding switched and composed MCC.

3106	     2. Add section 7.3.3 Global Capture Scene Entry List, and a few
3107	        other sentences elsewhere that refer to global CSE sets.

3109	     3. Clarify: The Provider MUST be capable of encoding and sending
3110	        all Captures (*that have an encoding group*) in a single
3111	        Capture Scene Entry simultaneously.

3113	     4. Add voice activated switching example in section 12.

3115	     5. Change name of attributes Participant Info/Type to Person
3116	        Info/Type.

3118	     6. Clarify the Person Info/Type attributes have the same meaning
3119	        regardless of whether or not the capture has a Presentation
3120	        attribute.

3122	     7. Update example section 12.1 to be consistent with the rest of
3123	        the document, regarding MCC and capture attributes.

3125	     8. State explicitly each CSE has a unique ID.

3127	   Changes from 13 to 14:

3129	     1. Fill in section for Security Considerations.

3131	     2. Replace Role placeholder with Participant Information,
3132	        Participant Type, and Scene Information attributes.

3134	     3. Spatial information implies nothing about how constituent
3135	        media captures are combined into a composed MCC.

3137	     4. Clean up MCC example in Section 12.3.3.  Clarify behavior of
3138	        tiled and PIP display windows.  Add audio.  Add new open
3139	        issue about associating incoming packets to original source
3140	        capture.

3142	     5. Remove editor's note and associated statement about RTP
3143	        multiplexing at end of section 5.

3145	     6. Remove editor's note and associated paragraph about
3146	        overloading media channel with both CLUE and non-CLUE usage,
3147	        in section 5.

3149	     7. In section 10, clarify intent of media encodings conforming
3150	        to SDP, even with multiple CLUE message exchanges.  Remove
3151	        associated editor's note.

3153	   Changes from 12 to 13:

3155	     1. Added the MCC concept including updates to existing sections
3156	        to incorporate the MCC concept. New MCC attributes:
3157	        MaxCaptures, SynchronisationID and Policy.

3159	     2. Removed the "composed" and "switched" Capture attributes due
3160	        to overlap with the MCC concept.

3162	     3. Removed the "Scene-switch-policy" CSE attribute, replaced by
3163	        MCC and SynchronisationID.

3165	     4. Editorial enhancements including numbering of the Capture
3166	        attribute sections, tables, figures etc.

3168	   Changes from 11 to 12:

3170	     1. Ticket #44. Remove note questioning about requiring a
3171	        Consumer to send a Configure after receiving Advertisement.

3173	     2. Ticket #43. Remove ability for consumer to choose value of
3174	        attribute for scene-switch-policy.

3176	     3. Ticket #36. Remove computational complexity parameter,
3177	        MaxGroupPps, from Encoding Groups.

3179	     4. Reword the Abstract and parts of sections 1 and 4 (now 5)
3180	        based on Mary's suggestions as discussed on the list.  Move
3181	        part of the Introduction into a new section Overview &
3182	        Motivation.

3184	     5. Add diagram of an Advertisement, in the Overview of the
3185	        Framework/Model section.

3187	     6. Change Intended Status to Standards Track.

3189	     7. Clean up RFC2119 keyword language.

3191	   Changes from 10 to 11:

3193	     1. Add description attribute to Media Capture and Capture Scene
3194	        Entry.

3196	     2. Remove contradiction and change the note about open issue
3197	        regarding always responding to Advertisement with a Configure
3198	        message.

3200	     3. Update example section, to cleanup formatting and make the
3201	        media capture attributes and encoding parameters consistent
3202	        with the rest of the document.

3204	   Changes from 09 to 10:

3206	     1. Several minor clarifications such as about SDP usage, Media
3207	        Captures, Configure message.

3209	     2. Simultaneous Set can be expressed in terms of Capture Scene
3210	        and Capture Scene Entry.

3212	     3. Removed Area of Scene attribute.

3214	     4. Add attributes from draft-groves-clue-capture-attr-01.

3216	     5. Move some of the Media Capture attribute descriptions back
3217	        into this document, but try to leave detailed syntax to the
3218	        data model.  Remove the OUTSOURCE sections, which are already
3219	        incorporated into the data model document.

3221	   Changes from 08 to 09:

3223	     1. Use "document" instead of "memo".

3225	     2. Add basic call flow sequence diagram to introduction.

3227	     3. Add definitions for Advertisement and Configure messages.

3229	     4. Add definitions for Capture and Provider.

3231	     5. Update definition of Capture Scene.

3233	     6. Update definition of Individual Encoding.

3235	     7. Shorten definition of Media Capture and add key points in the
3236	        Media Captures section.

3238	     8. Reword a bit about capture scenes in overview.

3240	     9. Reword about labeling Media Captures.

3242	     10. Remove the Consumer Capability message.

3244	     11. New example section heading for media provider behavior

3246	     12. Clarifications in the Capture Scene section.

3248	     13. Clarifications in the Simultaneous Transmission Set section.

3250	     14. Capitalize defined terms.

3252	     15. Move call flow example from introduction to overview section

3254	     16. General editorial cleanup

3256	     17. Add some editors' notes requesting input on issues
3257	     18. Summarize some sections, and propose details be outsourced
3258	        to other documents.

3260	   Changes from 06 to 07:

3262	     1. Ticket #9.  Rename Axis of Capture Point attribute to Point
3263	        on Line of Capture.  Clarify the description of this
3264	        attribute.

3266	     2. Ticket #17.  Add "capture encoding" definition.  Use this new
3267	        term throughout document as appropriate, replacing some usage
3268	        of the terms "stream" and "encoding".

3270	     3. Ticket #18.  Add Max Capture Encodings media capture
3271	        attribute.

3273	     4. Add clarification that different capture scene entries are
3274	        not necessarily mutually exclusive.

3276	   Changes from 05 to 06:

3278	   1. Capture scene description attribute is a list of text strings,
3279	      each in a different language, rather than just a single string.

3281	   2. Add new Axis of Capture Point attribute.

3283	   3. Remove appendices A.1 through A.6.

3285	   4. Clarify that the provider must use the same coordinate system
3286	      with same scale and origin for all coordinates within the same
3287	      capture scene.

3289	   Changes from 04 to 05:

3291	   1. Clarify limitations of "composed" attribute.

3293	   2. Add new section "capture scene entry attributes" and add the
3294	      attribute "scene-switch-policy".

3296	   3. Add capture scene description attribute and description
3297	      language attribute.

3299	   4. Editorial changes to examples section for consistency with the
3300	      rest of the document.

3302	   Changes from 03 to 04:

3304	   1. Remove sentence from overview - "This constitutes a significant
3305	      change ..."

3307	   2. Clarify a consumer can choose a subset of captures from a
3308	      capture scene entry or a simultaneous set (in section "capture
3309	      scene" and "consumer's choice...").

3311	   3. Reword first paragraph of Media Capture Attributes section.

3313	   4. Clarify a stereo audio capture is different from two mono audio
3314	      captures (description of audio channel format attribute).

3316	   5. Clarify what it means when coordinate information is not
3317	      specified for area of capture, point of capture, area of scene.

3319	   6. Change the term "producer" to "provider" to be consistent (it
3320	      was just in two places).

3322	   7. Change name of "purpose" attribute to "content" and refer to
3323	      RFC4796 for values.

3325	   8. Clarify simultaneous sets are part of a provider advertisement,
3326	      and apply across all capture scenes in the advertisement.

3328	   9. Remove sentence about lip-sync between all media captures in a
3329	      capture scene.

3331	   10.   Combine the concepts of "capture scene" and "capture set"
3332	      into a single concept, using the term "capture scene" to
3333	      replace the previous term "capture set", and eliminating the
3334	      original separate capture scene concept.

3336	17. Normative References

3338	   [I-D.ietf-clue-datachannel]
3339	              Holmberg, C., "CLUE Protocol Data Channel", draft-
3340	              ietf-clue-datachannel-00 (work in progress), March
3341	              2014.

3343	   [I-D.ietf-clue-data-model-schema]
3344	              Presta, R., Romano, S P., "An XML Schema for the CLUE
3345	              data model", draft-ietf-clue-data-model-schema-06 (work
3346	              in progress), June 2014.

3348	   [I-D.presta-clue-protocol]
3349	              Presta, R. and S. Romano, "CLUE protocol", draft-
3350	              prestaclue-protocol-04 (work in progress), May 2014.

3352	   [I-D.ietf-clue-signaling]
3353	              Kyzivat, P., Xiao, L., Groves, C., Hansen, R., "CLUE
3354	              Signaling", draft-ietf-clue-signaling-03 (work in
3355	              progress), August 2014.

3357	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
3358	              Requirement Levels", BCP 14, RFC 2119, March 1997.

3360	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
3361	   Johnston,
3362	              A., Peterson, J., Sparks, R., Handley, M., and E.
3363	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
3364	              June 2002.

3366	   [RFC3264]  Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model
3367	              with the Session Description Protocol (SDP)", RFC 3264,
3368	              June 2002.

3370	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
3371	              Jacobson, "RTP: A Transport Protocol for Real-Time
3372	              Applications", STD 64, RFC 3550, July 2003.

3374	   [RFC4579]  Johnston, A., Levin, O., "SIP Call Control -
3375	              Conferencing for User Agents", RFC 4579, August 2006

3377	18. Informative References

3379	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
3380	              Session Initiation Protocol (SIP)", RFC 4353,
3381	              February 2006.

3383	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
3384	              5117, January 2008.

3386	   [RFC7205]  Romanow, A., Botzko, S., Duckworth, M., Even, R.,
3387	              "Use Cases for Telepresence Multistreams", RFC 7205,
3388	              April 2014.

3390	   [RFC7262]  Romanow, A., Botzko, S., Barnes, M., "Requirements
3391	              for Telepresence Multistreams", RFC 7262, June 2014.

3393	19. Authors' Addresses

3395	   Mark Duckworth (editor)
3396	   Polycom
3397	   Andover, MA  01810
3398	   USA

3400	   Email: mark.duckworth@polycom.com

3402	   Andrew Pepperell
3403	   Acano
3404	   Uxbridge, England
3405	   UK

3407	   Email: apeppere@gmail.com

3409	   Stephan Wenger
3410	   Vidyo, Inc.
3411	   433 Hackensack Ave.
3412	   Hackensack, N.J. 07601
3413	   USA

3415	   Email: stewe@stewe.org