idnits 2.17.1 

draft-ietf-clue-framework-11.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1490 has weird spacing: '...om left    bot...'

  == Line 1544 has weird spacing: '...om left    bot...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 15, 2013) is 3938 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	CLUE WG                                              M. Duckworth, Ed.
2	Internet Draft                                                  Polycom
3	Intended status: Informational                             A. Pepperell
4	Expires: November 16, 2013                                        Acano
5	                                                              S. Wenger
6	                                                                  Vidyo
7	                                                          July 15, 2013

9	                Framework for Telepresence Multi-Streams
10	                    draft-ietf-clue-framework-11.txt

12	Abstract

14	   This document offers a framework for a protocol that enables
15	   devices in a telepresence conference to interoperate by specifying
16	   the relationships between multiple media streams.

18	Status of this Memo

20	   This Internet-Draft is submitted in full conformance with the
21	   provisions of BCP 78 and BCP 79.

23	   Internet-Drafts are working documents of the Internet Engineering
24	   Task Force (IETF).  Note that other groups may also distribute
25	   working documents as Internet-Drafts.  The list of current
26	   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

28	   Internet-Drafts are draft documents valid for a maximum of six
29	   months and may be updated, replaced, or obsoleted by other
30	   documents at any time.  It is inappropriate to use Internet-Drafts
31	   as reference material or to cite them other than as "work in
32	   progress."

34	   This Internet-Draft will expire on November 16, 2013.

36	Copyright Notice

38	   Copyright (c) 2013 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with
46	   respect to this document.  Code Components extracted from this
47	   document must include Simplified BSD License text as described in
48	   Section 4.e of the Trust Legal Provisions and are provided without
49	   warranty as described in the Simplified BSD License.

51	Table of Contents

53	   1. Introduction...................................................3
54	   2. Terminology....................................................5
55	   3. Definitions....................................................5
56	   4. Overview of the Framework/Model................................8
57	   5. Spatial Relationships.........................................13
58	   6. Media Captures and Capture Scenes.............................14
59	      6.1. Media Captures...........................................14
60	         6.1.1. Media Capture Attributes............................15
61	      6.2. Capture Scene............................................19
62	         6.2.1. Capture Scene attributes............................22
63	         6.2.2. Capture Scene Entry attributes......................22
64	      6.3. Simultaneous Transmission Set Constraints................24
65	   7. Encodings.....................................................25
66	      7.1. Individual Encodings.....................................25
67	      7.2. Encoding Group...........................................27
68	   8. Associating Captures with Encoding Groups.....................28
69	   9. Consumer's Choice of Streams to Receive from the Provider.....29
70	      9.1. Local preference.........................................31
71	      9.2. Physical simultaneity restrictions.......................31
72	      9.3. Encoding and encoding group limits.......................31
73	   10. Extensibility................................................32
74	   11. Examples - Using the Framework...............................32
75	      11.1. Provider Behavior.......................................33
76	         11.1.1. Three screen Endpoint Provider.....................33
77	         11.1.2. Encoding Group Example.............................40
78	         11.1.3. The MCU Case.......................................41
79	      11.2. Media Consumer Behavior.................................41
80	         11.2.1. One screen Media Consumer..........................42
81	         11.2.2. Two screen Media Consumer configuring the example..42
82	         11.2.3. Three screen Media Consumer configuring the example43
83	   12. Acknowledgements.............................................43
84	   13. IANA Considerations..........................................44
85	   14. Security Considerations......................................44
86	   15. Changes Since Last Version...................................44
87	   16. Authors' Addresses...........................................48

89	1. Introduction

91	   Current telepresence systems, though based on open standards such
92	   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with
93	   each other.  A major factor limiting the interoperability of
94	   telepresence systems is the lack of a standardized way to describe
95	   and negotiate the use of the multiple streams of audio and video
96	   comprising the media flows.  This draft provides a framework for a
97	   protocol to enable interoperability by handling multiple streams in
98	   a standardized way.  It is intended to support the use cases
99	   described in draft-ietf-clue-telepresence-use-cases and to meet the
100	   requirements in draft-ietf-clue-telepresence-requirements.

102	   Conceptually distinguished are Media Providers and Media Consumers.
103	   A Media Provider provides Media in the form of RTP packets, a Media
104	   Consumer consumes those RTP packets.  Media Providers and Media
105	   Consumers can reside in Endpoints or in middleboxes such as
106	   Multipoint Control Units (MCUs).  A Media Provider in an Endpoint
107	   is usually associated with the generation of media for Media
108	   Captures; these Media Captures are typically sourced from cameras,
109	   microphones, and the like.  Similarly, the Media Consumer in an
110	   Endpoint is usually associated with Renderers, such as screens and
111	   loudspeakers.  In middleboxes, Media Providers and Consumers can
112	   have the form of outputs and inputs, respectively, of RTP mixers,
113	   RTP translators, and similar devices.  Typically, telepresence
114	   devices such as Endpoints and middleboxes would perform as both
115	   Media Providers and Media Consumers, the former being concerned
116	   with those devices' transmitted media and the latter with those
117	   devices' received media.  In a few circumstances, a CLUE Endpoint
118	   middlebox may include only Consumer or Provider functionality, such
119	   as recorder-type Consumers or webcam-type Providers.

121	   Motivations for this document (and, in fact, for the existence of
122	   the CLUE protocol) include:

124	   (1) Endpoints according to this document can, and usually do, have
125	   multiple Media Captures and Media Renderers, that is, for example,
126	   multiple cameras and screens.  While previous system designs were
127	   able to set up calls that would light up all screens and cameras
128	   (or equivalent), what was missing was a mechanism that can
129	   associate the Media Captures with each other in space and time.

131	   (2) The mere fact that there are multiple capture and rendering
132	   devices, each of which may be configurable in aspects such as zoom,
133	   leads to the difficulty that a variable number of such devices can
134	   be used to capture different aspects of a region.  The Capture
135	   Scene concept allows for the description of multiple setups for
136	   those multiple capture devices that could represent sensible
137	   operation points of the physical capture devices in a room, chosen
138	   by the operator.  A Consumer can pick and choose from those
139	   configurations based on its rendering abilities and inform the
140	   Provider about its choices.  Details are provided in section 6.

142	   (3) In some cases, physical limitations or other reasons disallow
143	   the concurrent use of a device in more than one setup.  For
144	   example, the center camera in a typical three-camera conference
145	   room can set its zoom objective either to capture only the middle
146	   few seats, or all seats of a room, but not both concurrently.  The
147	   Simultaneous Transmission Set concept allows a Provider to signal
148	   such limitations.  Simultaneous Transmission Sets are part of the
149	   Capture Scene description, and discussed in section 6.3.

151	   (4) Often, the devices in a room do not have the computational
152	   complexity or connectivity to deal with multiple encoding options
153	   simultaneously, even if each of these options may be sensible in
154	   certain environments, and even if the simultaneous transmission may
155	   also be sensible (i.e. in case of multicast media distribution to
156	   multiple endpoints).   Such constraints can be expressed by the
157	   Provider using the Encoding Group concept, described in section 7.

159	   (5) Due to the potentially large number of RTP flows required for a
160	   Multimedia Conference involving potentially many Endpoints, each of
161	   which can have many Media Captures and Media Renderers, a sensible
162	   system design is to multiplex multiple RTP media flows onto the
163	   same transport address, so to avoid using the port number as a
164	   multiplexing point and the associated shortcomings such as
165	   NAT/firewall traversal.  While the actual mapping of those RTP
166	   flows to the header fields of the RTP packets is not subject of
167	   this specification, the large number of possible permutations of
168	   sensible options a Media Provider may make available to a Media
169	   Consumer makes a mechanism desirable that allows to narrow down the
170	   number of possible options that a SIP offer-answer exchange has to
171	   consider.  Such information is made available using protocol
172	   mechanisms specified in this document and companion documents,
173	   although it should be stressed that its use in an implementation is
174	   optional.  Also, there are aspects of the control of both Endpoints
175	   and middleboxes/MCUs that dynamically change during the progress of
176	   a call, such as audio-level based screen switching, layout changes,
177	   and so on, which need to be conveyed.  Note that these control
178	   aspects are complementary to those specified in traditional SIP
179	   based conference management such as BFCP.  An exemplary call flow
180	   can be found in section 4.

182	   Finally, all this information needs to be conveyed, and the notion
183	   of support for it needs to be established.  This is done by the
184	   negotiation of a "CLUE channel", a data channel negotiated early
185	   during the initiation of a call.  An Endpoint or MCU that rejects
186	   the establishment of this data channel, by definition, is not
187	   supporting CLUE based mechanisms, whereas an Endpoint or MCU that
188	   accepts it is required to use it to the extent specified in this
189	   document and its companion documents.

191	2. Terminology

193	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
194	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
195	   this document are to be interpreted as described in RFC 2119
196	   [RFC2119].

198	3. Definitions

200	   The terms defined below are used throughout this document and
201	   companion documents and they are normative.  In order to easily
202	   identify the use of a defined term, those terms are capitalized.

204	   Advertisement: a CLUE message a Media Provider sends to a Media
205	   Consumer describing specific aspects of the content of the media,
206	   the formatting of the media streams it can send, and any
207	   restrictions it has in terms of being able to provide certain
208	   Streams simultaneously.

210	   Audio Capture: Media Capture for audio.  Denoted as ACn in the
211	   example cases in this document.

213	   Camera-Left and Right: For Media Captures, camera-left and camera-
214	   right are from the point of view of a person observing the rendered
215	   media.  They are the opposite of Stage-Left and Stage-Right.

217	   Capture: Same as Media Capture.

219	   Capture Device: A device that converts audio and video input into
220	   an electrical signal, in most cases to be fed into a media encoder.

222	   Capture Encoding: A specific encoding of a Media Capture, to be
223	   sent by a Media Provider to a Media Consumer via RTP.

225	   Capture Scene: a structure representing a spatial region containing
226	   one or more Capture Devices, each capturing media representing a
227	   portion of the region. The spatial region represented by a Capture
228	   Scene may or may not correspond to a real region in physical space,
229	   such as a room.  A Capture Scene includes attributes and one or
230	   more Capture Scene Entries, with each entry including one or more
231	   Media Captures.

233	   Capture Scene Entry: a list of Media Captures of the same media
234	   type that together form one way to represent the entire Capture
235	   Scene.

237	   Conference: used as defined in [RFC4353], A Framework for
238	   Conferencing within the Session Initiation Protocol (SIP).

240	   Configure Message: A CLUE message a Media Consumer sends to a Media
241	   Provider specifying which content and media streams it wants to
242	   receive, based on the information in a corresponding Advertisement
243	   message.

245	   Consumer: short for Media Consumer.

247	   Encoding or Individual Encoding: a set of parameters representing a
248	   way to encode a Media Capture to become a Capture Encoding.

250	   Encoding Group: A set of encoding parameters representing a total
251	   media encoding capability to be sub-divided across potentially
252	   multiple Individual Encodings.

254	   Endpoint: The logical point of final termination through receiving,
255	   decoding and rendering, and/or initiation through capturing,
256	   encoding, and sending of media streams.  An endpoint consists of
257	   one or more physical devices which source and sink media streams,
258	   and exactly one [RFC4353] Participant (which, in turn, includes
259	   exactly one SIP User Agent).  Endpoints can be anything from
260	   multiscreen/multicamera rooms to handheld devices.

262	   Front: the portion of the room closest to the cameras.  In going
263	   towards back you move away from the cameras.

265	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
266	   more endpoints together into one single multimedia conference
267	   [RFC5117].  An MCU includes an [RFC4353] like Mixer, without the
268	   [RFC4353] requirement to send media to each participant.

270	   Media: Any data that, after suitable encoding, can be conveyed over
271	   RTP, including audio, video or timed text.

273	   Media Capture: a source of Media, such as from one or more Capture
274	   Devices or constructed from other Media streams.

276	   Media Consumer: an Endpoint or middle box that receives Media
277	   streams

279	   Media Provider: an Endpoint or middle box that sends Media streams

281	   Model: a set of assumptions a telepresence system of a given vendor
282	   adheres to and expects the remote telepresence system(s) also to
283	   adhere to.

285	   Plane of Interest: The spatial plane containing the most relevant
286	   subject matter.

288	   Provider: Same as Media Provider.

290	   Render: the process of generating a representation from a media,
291	   such as displayed motion video or sound emitted from loudspeakers.

293	   Simultaneous Transmission Set: a set of Media Captures that can be
294	   transmitted simultaneously from a Media Provider.

296	   Spatial Relation: The arrangement in space of two objects, in
297	   contrast to relation in time or other relationships.  See also
298	   Camera-Left and Right.

300	   Stage-Left and Right: For Media Captures, Stage-left and Stage-
301	   right are the opposite of Camera-left and Camera-right.  For the
302	   case of a person facing (and captured by) a camera, Stage-left and
303	   Stage-right are from the point of view of that person.

305	   Stream: a Capture Encoding sent from a Media Provider to a Media
306	   Consumer via RTP [RFC3550].

308	   Stream Characteristics: the media stream attributes commonly used
309	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
310	   resolution, profile/level etc.) as well as CLUE specific
311	   attributes, such as the Capture ID or a spatial location.

313	   Video Capture: Media Capture for video.  Denoted as VCn in the
314	   example cases in this document.

316	   Video Composite: A single image that is formed, normally by an RTP
317	   mixer inside an MCU, by combining visual elements from separate
318	   sources.

320	4. Overview of the Framework/Model

322	   The CLUE framework specifies how multiple media streams are to be
323	   handled in a telepresence conference.

325	   A Media Provider (transmitting Endpoint or MCU) describes specific
326	   aspects of the content of the media and the formatting of the media
327	   streams it can send in an Advertisement; and the Media Consumer
328	   responds to the Media Provider by specifying which content and
329	   media streams it wants to receive in a Configure message.  The
330	   Provider then transmits the asked-for content in the specified
331	   streams.

333	   This Advertisement and Configure occurs as a minimum during call
334	   initiation but may also happen at any time throughout the call,
335	   whenever there is a change in what the Consumer wants to receive or
336	   (perhaps less common) the Provider can send.

338	   An Endpoint or MCU typically act as both Provider and Consumer at
339	   the same time, sending Advertisements and sending Configurations in
340	   response to receiving Advertisements.  (It is possible to be just
341	   one or the other.)

343	   The data model is based around two main concepts: a Capture and an
344	   Encoding.  A Media Capture (MC), such as audio or video, describes
345	   the content a Provider can send.  Media Captures are described in
346	   terms of CLUE-defined attributes, such as spatial relationships and
347	   purpose of the capture.  Providers tell Consumers which Media
348	   Captures they can provide, described in terms of the Media Capture
349	   attributes.

351	   A Provider organizes its Media Captures into one or more Capture
352	   Scenes, each representing a spatial region, such as a room.  A
353	   Consumer chooses which Media Captures it wants to receive from each
354	   Capture Scene.

356	   In addition, the Provider can send the Consumer a description of
357	   the Individual Encodings it can send in terms of the media
358	   attributes of the Encodings, in particular, audio and video
359	   parameters such as bandwidth, frame rate, macroblocks per second.
360	   Note that this is optional, and intended to minimize the number of
361	   options a later SDP offer-answer would require to include in the
362	   SDP in case of complex setups, as should become clearer shortly
363	   when discussing an outline of the call flow.

365	   The Provider can also specify constraints on its ability to provide
366	   Media, and a sensible design choice for a Consumer is to take these
367	   into account when choosing the content and Capture Encodings it
368	   requests in the later offer-answer exchange.  Some constraints are
369	   due to the physical limitations of devices - for example, a camera
370	   may not be able to provide zoom and non-zoom views simultaneously.
371	   Other constraints are system based constraints, such as maximum
372	   bandwidth and maximum macroblocks/second.

374	   A very brief outline of the call flow used by a simple system (two
375	   Endpoints) in compliance with this document can be described as
376	   follows, and as shown in the following figure.

378	         +-----------+                     +-----------+
379	         | Endpoint1 |                     | Endpoint2 |
380	         +----+------+                     +-----+-----+
381	              | INVITE (BASIC SDP+CLUECHANNEL)   |
382	              |--------------------------------->|
383	              |    200 0K (BASIC SDP+CLUECHANNEL)|
384	              |<---------------------------------|
385	              | ACK                              |
386	              |--------------------------------->|
387	              |                                  |
388	              |<################################>|
389	              |     BASIC SDP MEDIA SESSION      |
390	              |<################################>|
391	              |                                  |
392	              |    CONNECT (CLUE CTRL CHANNEL)   |
393	              |=================================>|
394	              |            ...                   |
395	              |<================================>|
396	              |   CLUE CTRL CHANNEL ESTABLISHED  |
397	              |<================================>|
398	              |                                  |
399	              | ADVERTISEMENT 1                  |
400	              |*********************************>|
401	              |                  ADVERTISEMENT 2 |
402	              |<*********************************|
403	              |                                  |
404	              |                      CONFIGURE 1 |
405	              |<*********************************|
406	              | CONFIGURE 2                      |
407	              |*********************************>|
408	              |                                  |
409	              | REINVITE (UPDATED SDP)           |
410	              |--------------------------------->|
411	              |              200 0K (UPDATED SDP)|
412	              |<---------------------------------|
413	              | ACK                              |
414	              |--------------------------------->|
415	              |                                  |
416	              |<################################>|
417	              |   UPDATED SDP MEDIA SESSION      |
418	              |<################################>|
419	              |                                  |
420	              v                                  v

422	   An initial offer/answer exchange establishes a basic media session,
423	   for example audio-only, and a CLUE channel between two Endpoints.
424	   With the establishment of that channel, the endpoints have
425	   consented to use the CLUE protocol mechanisms and have to adhere to
426	   them.

428	   Over this CLUE channel, the Provider in each Endpoint conveys its
429	   characteristics and capabilities by sending an Advertisement as
430	   specified herein (which will typically not be sufficient to set up
431	   all media).  The Consumer in the Endpoint receives the information
432	   provided by the Provider, and can use it for two purposes.  First,
433	   it constructs and sends a CLUE Configure message to tell the
434	   Provider what the Consumer wishes to receive.  Second, it can, but
435	   is not necessarily required to, use the information provided to
436	   tailor the SDP it is going to send during the following SIP
437	   offer/answer exchange, and its reaction to SDP it receives in that
438	   step.  It is often a sensible implementation choice to do so, as
439	   the representation of the media information conveyed over the CLUE
440	   channel can dramatically cut down on the size of SDP messages used
441	   in the O/A exchange that follows.  Spatial relationships associated
442	   with the Media can be included in the Advertisement, and it is
443	   often sensible for the Media Consumer to take those spatial
444	   relationships into account when tailoring the SDP.

446	   This CLUE exchange is followed by an SDP offer answer exchange that
447	   not only establishes those aspects of the media that have not been
448	   "negotiated" over CLUE, but has also the side effect of setting up
449	   the media transmission itself, involving potentially security
450	   exchanges, ICE, and whatnot.  This step is plain vanilla SIP, with
451	   the exception that the SDP used herein, in most cases can (but not
452	   necessarily must) be considerably smaller than the SDP a system
453	   would typically need to exchange if there were no pre-established
454	   knowledge about the Provider and Consumer characteristics.  (The
455	   need for cutting down SDP size may not be obvious for a point-to-
456	   point call involving simple endpoints; however, when considering a
457	   large multipoint conference involving many multi-screen/multi-
458	   camera endpoints, each of which can operate using multiple codecs
459	   for each camera and microphone, it becomes perhaps somewhat more
460	   intuitive.)

462	   During the lifetime of a call, further exchanges can occur over the
463	   CLUE channel.  In some cases, those further exchanges can lead to a
464	   modified system behavior of Provider or Consumer (or both) without
465	   any other protocol activity such as further offer/answer exchanges.
466	   For example, voice-activated screen switching, signaled over the
467	   CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
468	   re-invites.  However, in other cases, after the CLUE negotiation an
469	   additional offer/answer exchange may become necessary.  For
470	   example, if both sides decide to upgrade the call from a single
471	   screen to a multi-screen call and more bandwidth is required for
472	   the additional video channels, that could require a new O/A
473	   exchange.

475	   Numerous optimizations may be possible, and are the implementer's
476	   choice.  For example, it may be sensible to establish one or more
477	   initial media channels during the initial offer/answer exchange,
478	   which would allow, for example, for a fast startup of audio.
479	   Depending on the system design, it may be possible to re-use this
480	   established channel for more advanced media negotiated only by CLUE
481	   mechanisms, thereby avoiding further offer/answer exchanges.

483	   Edt. note: The editors are not sure whether the mentioned
484	   overloading of established RTP channels using only CLUE messages is
485	   possible, or desired by the WG.  If it were, certainly there is
486	   need for specification work.  One possible issue: a Provider which
487	   thinks that it can switch, say, a audio codec algorithm by CLUE
488	   only, talks to a  Consumer which thinks that it has to faithfully
489	   answer the Providers Advertisement through a Configure, but does
490	   not dare setting up its internal resource until such time it has
491	   received its authoritative O/A exchange.  Working group input is
492	   solicited.

494	   One aspect of the protocol outlined herein and specified in
495	   normative detail in companion documents is that it makes available
496	   information regarding the Provider's capabilities to deliver Media,
497	   and attributes related to that Media such as their spatial
498	   relationship, to the Consumer.  The operation of the Renderer
499	   inside the Consumer is unspecified in that it can choose to ignore
500	   some information provided by the Provider, and/or not render media
501	   streams available from the Provider (although it has to follow the
502	   CLUE protocol and, therefore, has to gracefully receive and respond
503	   (through a Configure) to the Provider's information).  All CLUE
504	   protocol mechanisms are optional in the Consumer in the sense that,
505	   while the Consumer must be able to receive (and, potentially,
506	   gracefully acknowledge) CLUE messages, it is free to ignore the
507	   information provided therein.  Obviously, this is not a
508	   particularly sensible design choice.

510	   Legacy devices are defined here in as those Endpoints and MCUs that
511	   do not support the setup and use of the CLUE channel.  The notion
512	   of a device being a legacy device is established during the initial
513	   offer/answer exchange, in which the legacy device will not
514	   understand the offer for the CLUE channel and, therefore, reject
515	   it.  This is the indication for the CLUE-implementing Endpoint or
516	   MCU that the other side of the communication is not compliant with
517	   CLUE, and to fall back to whatever mechanism was used before the
518	   introduction of CLUE.

520	   As for the media, Provider and Consumer have an end-to-end
521	   communication relationship with respect to (RTP transported) media;
522	   and the mechanisms described herein and in companion documents do
523	   not change the aspects of setting up those RTP flows and sessions.
524	   In other words, the RTP media sessions conform to the negotiated
525	   SDP whether or not CLUE is used. However, it should be noted that
526	   forms of RTP multiplexing of multiple RTP flows onto the same
527	   transport address are developed concurrently with the CLUE suite of
528	   specifications, and it is widely expected that most, if not all,
529	   Endpoints or MCUs supporting CLUE will also support those
530	   mechanisms.  Some design choices made in this document reflect this
531	   coincidence in spec development timing.

533	5. Spatial Relationships

535	   In order for a Consumer to perform a proper rendering, it is often
536	   necessary or at least helpful for the Consumer to have received
537	   spatial information about the streams it is receiving.  CLUE
538	   defines a coordinate system that allows Media Providers to describe
539	   the spatial relationships of their Media Captures to enable proper
540	   scaling and spatially sensible rendering of their streams.  The
541	   coordinate system is based on a few principles:

543	   o  Simple systems which do not have multiple Media Captures to
544	      associate spatially need not use the coordinate model.

546	   o  Coordinates can either be in real, physical units (millimeters),
547	      have an unknown scale or have no physical scale.  Systems which
548	      know their physical dimensions (for example professionally
549	      installed Telepresence room systems) should always provide those
550	      real-world measurements.  Systems which don't know specific
551	      physical dimensions but still know relative distances should use
552	      'unknown scale'.  'No scale' is intended to be used where Media
553	      Captures from different devices (with potentially different
554	      scales) will be forwarded alongside one another (e.g. in the
555	      case of a middle box).

557	      *  "millimeters" means the scale is in millimeters

559	      *  "Unknown" means the scale is not necessarily millimeters, but
560	         the scale is the same for every Capture in the Capture Scene.

562	      *  "No Scale" means the scale could be different for each
563	         capture- an MCU provider that advertises two adjacent
564	         captures and picks sources (which can change quickly) from
565	         different endpoints might use this value; the scale could be
566	         different and changing for each capture.  But the areas of
567	         capture still represent a spatial relation between captures.

569	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
570	      spatial location of the provider's choosing.  The Provider must
571	      use the same coordinate system with same scale and origin for
572	      all coordinates within the same Capture Scene.

574	   The direction of increasing coordinate values is:
575	   X increases from Camera-Left to Camera-Right
576	   Y increases from Front to back
577	   Z increases from low to high

579	6. Media Captures and Capture Scenes

581	   This section describes how  Providers can describe the content of
582	   media to Consumers.

584	6.1. Media Captures

586	   Media Captures are the fundamental representations of streams that
587	   a device can transmit.  What a Media Capture actually represents is
588	   flexible:

590	   o  It can represent the immediate output of a physical source (e.g.
591	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
592	      DVD player).

594	   o  It can represent the output of an audio mixer or video composer

596	   o  It can represent a concept such as 'the loudest speaker'

598	   o  It can represent a conceptual position such as 'the leftmost
599	      stream'

601	   To identify and distinguish between multiple instances, video and
602	   audio captures are labeled.  For instance: VC1, VC2 and AC1, AC2,
603	   where  VC1 and VC2 refer to two different video captures and AC1
604	   and AC2 refer to two different audio captures.

606	   Some key points about Media Captures:

608	     . A Media Capture is of a single media type (e.g. audio or
609	        video)
610	     . A Media Capture is associated with exactly one Capture Scene
611	     . A Media Capture is associated with one or more Capture Scene
612	        Entries
613	     . A Media Capture has exactly one set of spatial information
614	     . A Media Capture may be the source of one or more Capture
615	        Encodings

617	   Each Media Capture can be associated with attributes to describe
618	   what it represents.

620	6.1.1. Media Capture Attributes

622	   Media Capture Attributes describe information about the Captures.
623	   A Provider can use the Media Capture Attributes to describe the
624	   Captures for the benefit of the Consumer in the Advertisement
625	   message.  Media Capture Attributes include:

627	     . spatial information, such as point of capture, point on line
628	        of capture, and area of capture, all of which, in combination
629	        define the capture field of, for example, a camera;
630	     . Capture multiplexing information (composed/switched video,
631	        mono/stereo audio, maximum number of simultaneous encodings
632	        per Capture and so on); and
633	     . Other descriptive information to help the Consumer choose
634	        between captures (description, presentation, view, priority,
635	        language, role).
636	     . Control information for use inside the CLUE protocol suite.

638	   Point of Capture:

640	   A field with a single Cartesian (X, Y, Z) point value which
641	   describes the spatial location of the capturing device (such as
642	   camera).

644	   Point on Line of Capture:

646	   A field with a single Cartesian (X, Y, Z) point value which
647	   describes a position in space of a second point on the axis of the
648	   capturing device; the first point being the Point of Capture (see
649	   above).

651	   Together, the Point of Capture and Point on Line of Capture define
652	   an axis of the capturing device, for example the optical axis of a
653	   camera.  The Media Consumer can use this information to adjust how
654	   it renders the received media if it so chooses.

656	   Area of Capture:

658	   A field with a set of four (X, Y, Z) points as a value which
659	   describe the spatial location of what is being "captured".  By
660	   comparing the Area of Capture for different Media Captures within
661	   the same Capture Scene a consumer can determine the spatial
662	   relationships between them and render them correctly.

664	   The four points should be co-planar, forming a quadrilateral, which
665	   defines the Plane of Interest for the particular media capture.

667	   If the Area of Capture is not specified, it means the Media Capture
668	   is not spatially related to any other Media Capture.

670	   For a switched capture that switches between different sections
671	   within a larger area, the area of capture should use coordinates
672	   for the larger potential area.

674	   Mobility of Capture:

676	   This attribute indicates whether or not the point of capture, line
677	   on point of capture, and area of capture values will stay the same,
678	   or are expected to change frequently.  Possible values are static,
679	   dynamic, and highly dynamic.

681	   For example, a camera may be placed at different positions in order
682	   to provide the best angle to capture a work task, or may include a
683	   camera worn by a participant. This would have an effect of changing
684	   the capture point, capture axis and area of capture. In order that
685	   the Consumer can choose to render the capture appropriately, the
686	   Provider can include this attribute to indicate if the camera
687	   location is dynamic or not.

689	   The capture point of a static capture does not move for the life of
690	   the conference. The capture point of dynamic captures is
691	   categorised by a change in position followed by a reasonable period
692	   of stability. High dynamic captures are categorised by a capture
693	   point that is constantly moving.  If the "area of capture",
694	   "capture point" and "line of capture" attributes are included with
695	   dynamic or highly dynamic captures they indicate spatial
696	   information at the time of the Advertisement. No information
697	   regarding future spatial information should be assumed.

699	   Composed:

701	   A boolean field which indicates whether or not the Media Capture is
702	   a mix (audio) or composition (video) of streams.

704	   This attribute is useful for a media consumer to avoid nesting a
705	   composed video capture into another composed capture or rendering.
706	   This attribute is not intended to describe the layout a media
707	   provider uses when composing video streams.

709	   Switched:

711	   A boolean field which indicates whether or not the Media Capture
712	   represents the (dynamic) most appropriate subset of a 'whole'.
713	   What is 'most appropriate' is up to the provider and could be the
714	   active speaker, a lecturer or a VIP.

716	   Audio Channel Format:

718	   A field with enumerated values which describes the method of
719	   encoding used for audio. A value of 'mono' means the Audio Capture
720	   has one channel.  'stereo' means the Audio Capture has two audio
721	   channels, left and right.

723	   This attribute applies only to Audio Captures.  A single stereo
724	   capture is different from two mono captures that have a left-right
725	   spatial relationship.  A stereo capture maps to a single Capture
726	   Encoding, while each mono audio capture maps to a separate Capture
727	   Encoding.

729	   Max Capture Encodings:

731	   An optional attribute indicating the maximum number of Capture
732	   Encodings that can be simultaneously active for the Media Capture.
733	   The number of simultaneous Capture Encodings is also limited by the
734	   restrictions of the Encoding Group for the Media Capture.

736	   Description:

738	   Human-readable description of the Capture Scene, which could be in
739	   multiple languages.

741	   Presentation:

743	   This attribute indicates that the capture originates from a
744	   presentation device, that is one that provides supplementary
745	   information to a conference through slides, video, still images,
746	   data etc.  Where more information is known about the capture it may
747	   be expanded hierarchically to indicate the different types of
748	   presentation media, e.g. presentation.slides, presentation.image
749	   etc.

751	   Note: It is expected that a number of keywords will be defined that
752	   provide more detail on the type of presentation.

754	   View:

756	   A field with enumerated values, indicating what type of view the
757	   capture relates to.  The Consumer can use this information to help
758	   choose which Media Captures it wishes to receive.  The value can be
759	   one of:

761	   Room - Captures the entire scene

763	   Table - Captures the conference table with seated participants

765	   Individual - Captures an individual participant

767	   Lectern - Captures the region of the lectern including the
768	   presenter in a classroom style conference

770	   Audience - Captures a region showing the audience in a classroom
771	   style conference

773	   Language:

775	   This attribute indicates one or more languages used in the content
776	   of the media capture.  Captures may be offered in different
777	   languages in case of multilingualand/or accessible conferences, so
778	   a Consumer can use this attribute to differentiate between them.

780	   This indicates which language is associated with the capture.  For
781	   example: it may provide a language associated with an audio capture
782	   or a language associated with a video capture when sign
783	   interpretation or text is used.

785	   Role:

787	   Edt. Note -- this is a placeholder for a role attribute, as
788	   discussed in draft-groves-clue-capture-attr.  We expect to continue
789	   discussing the role attribute in the context of that draft, and
790	   follow-on drafts, before adding it to this framework document.

792	   Priority:

794	   This attribute indicates a relative priority between different
795	   Media Captures.  The Provider sets this priority, and the Consumer
796	   may use the priority to help decide which captures it wishes to
797	   receive.

799	   The "priority" attribute is an integer which indicates a relative
800	   priority between captures. For example it is possible to assign a
801	   priority between two presentation captures that would allow a
802	   remote endpoint to determine which presentation is more important.
803	   Priority is assigned at the individual capture level. It represents
804	   the Provider's view of the relative priority between captures with
805	   a priority. The same priority number may be used across multiple
806	   captures. It indicates they are equally as important. If no
807	   priority is assigned no assumptions regarding relative important of
808	   the capture can be assumed.

810	   Embedded Text:

812	   This attribute indicates that a capture provides embedded textual
813	   information. For example the video capture may contain speech to
814	   text information composed with the video image. This attribute is
815	   only applicable to video captures and presentation streams with
816	   visual information.

818	   Related To:

820	   This attribute indicates the capture contains additional
821	   complementary information related to another capture.  The value
822	   indicates the other capture to which this capture is providing
823	   additional information.

825	   For example, a conferences can utilise translators or facilitators
826	   that provide an additional audio stream (i.e. a translation or
827	   description or commentary of the conference).  Where multiple
828	   captures are available, it may be advantageous for a Consumer to
829	   select a complementary capture instead of or in addition to a
830	   capture it relates to.

832	6.2. Capture Scene

834	   In order for a Provider's individual Captures to be used
835	   effectively by a Consumer, the provider organizes the Captures into
836	   one or more Capture Scenes, with the structure and contents of
837	   these Capture Scenes being sent from the Provider to the Consumer
838	   in the Advertisement.

840	   A Capture Scene is a structure representing a spatial region
841	   containing one or more Capture Devices, each capturing media
842	   representing a portion of the region.  A Capture Scene includes one
843	   or more Capture Scene entries, with each entry including one or
844	   more Media Captures.  A Capture Scene represents, for example, the
845	   video image of a group of people seated next to each other, along
846	   with the sound of their voices, which could be represented by some
847	   number of VCs and ACs in the Capture Scene Entries.  A middle box
848	   may also express Capture Scenes that it constructs from media
849	   Streams it receives.

851	   A Provider may advertise multiple Capture Scenes or just a single
852	   Capture Scene.  What constitutes an entire Capture Scene is up to
853	   the Provider.  A Provider might typically use one Capture Scene for
854	   participant media (live video from the room cameras) and another
855	   Capture Scene for a computer generated presentation.  In more
856	   complex systems, the use of additional Capture Scenes is also
857	   sensible.  For example, a classroom may advertise two Capture
858	   Scenes involving live video, one including only the camera
859	   capturing the instructor (and associated audio), the other
860	   including camera(s) capturing students (and associated audio).

862	   A Capture Scene may (and typically will) include more than one type
863	   of media.  For example, a Capture Scene can include several Capture
864	   Scene Entries for Video Captures, and several Capture Scene Entries
865	   for Audio Captures.  A particular Capture may be included in more
866	   than one Capture Scene Entry.

868	   A provider can express spatial relationships between Captures that
869	   are included in the same Capture Scene.  However, there is not
870	   necessarily the same spatial relationship between Media Captures
871	   that are in different Capture Scenes.  In other words, Capture
872	   Scenes can use their own spatial measurement system as outlined
873	   above in section 5.

875	   A Provider arranges Captures in a Capture Scene to help the
876	   Consumer choose which captures it wants.  The Capture Scene Entries
877	   in a Capture Scene are different alternatives the provider is
878	   suggesting for representing the Capture Scene.  The order of
879	   Capture Scene Entries within a Capture Scene has no significance.
880	   The Media Consumer can choose to receive all Media Captures from
881	   one Capture Scene Entry for each media type (e.g. audio and video),
882	   or it can pick and choose Media Captures regardless of how the
883	   Provider arranges them in Capture Scene Entries.  Different Capture
884	   Scene Entries of the same media type are not necessarily mutually
885	   exclusive alternatives.  Also note that the presence of multiple
886	   Capture Scene Entries (with potentially multiple encoding options
887	   in each entry) in a given Capture Scene does not necessarily imply
888	   that a Provider is able to serve all the associated media
889	   simultaneously (although the construction of such an over-rich
890	   Capture Scene is probably not sensible in many cases).  What a
891	   Provider can send simultaneously is determined through the
892	   Simultaneous Transmission Set mechanism, described in section 6.3.

894	   Captures within the same Capture Scene entry must be of the same
895	   media type - it is not possible to mix audio and video captures in
896	   the same Capture Scene Entry, for instance.  The Provider must be
897	   capable of encoding and sending all Captures in a single Capture
898	   Scene Entry simultaneously.  The order of Captures within a Capture
899	   Scene Entry has no significance.  A Consumer may decide to receive
900	   all the Captures in a single Capture Scene Entry, but a Consumer
901	   could also decide to receive just a subset of those captures.  A
902	   Consumer can also decide to receive Captures from different Capture
903	   Scene Entries, all subject to the constraints set by Simultaneous
904	   Transmission Sets, as discussed in section 6.3.

906	   When a Provider advertises a Capture Scene with multiple entries,
907	   it is essentially signaling that there are multiple representations
908	   of the same Capture Scene available.  In some cases, these multiple
909	   representations would typically be used simultaneously (for
910	   instance a "video entry" and an "audio entry").  In some cases the
911	   entries would conceptually be alternatives (for instance an entry
912	   consisting of three Video Captures covering the whole room versus
913	   an entry consisting of just a single Video Capture covering only
914	   the center if a room).  In this latter example, one sensible choice
915	   for a Consumer would be to indicate (through its Configure and
916	   possibly through an additional offer/answer exchange) the Captures
917	   of that Capture Scene Entry that most closely matched the
918	   Consumer's number of display devices or screen layout.

920	   The following is an example of 4 potential Capture Scene Entries
921	   for an endpoint-style Provider:

923	   1.  (VC0, VC1, VC2) - left, center and right camera Video Captures

925	   2.  (VC3) - Video Capture associated with loudest room segment

927	   3.  (VC4) - Video Capture zoomed out view of all people in the room

929	   4.  (AC0) - main audio
930	   The first entry in this Capture Scene example is a list of Video
931	   Captures which have a spatial relationship to each other.
932	   Determination of the order of these captures (VC0, VC1 and VC2) for
933	   rendering purposes is accomplished through use of their Area of
934	   Capture attributes.  The second entry (VC3) and the third entry
935	   (VC4) are alternative representations of the same room's video,
936	   which might be better suited to some Consumers' rendering
937	   capabilities.  The inclusion of the Audio Capture in the same
938	   Capture Scene indicates that AC0 is associated with all of those
939	   Video Captures, meaning it comes from the same spatial region.
940	   Therefore, if audio were to be rendered at all, this audio would be
941	   the correct choice irrespective of which Video Captures were
942	   chosen.

944	6.2.1. Capture Scene attributes

946	   Capture Scene Attributes can be applied to Capture Scenes as well
947	   as to individual media captures.  Attributes specified at this
948	   level apply to all constituent Captures.  Capture Scene attributes
949	   include

951	     . Human-readable description of the Capture Scene, which could
952	        be in multiple languages;
953	     . Scale information (millimeters, unknown, no scale), as
954	        described in Section 5.

956	6.2.2. Capture Scene Entry attributes

958	   A Capture Scene can include one or more Capture Scene Entries in
959	   addition to the Capture Scene wide attributes described above.
960	   Capture Scene Entry attributes apply to the Capture Scene Entry as
961	   a whole, i.e. to all Captures that are part of the Capture Scene
962	   Entry.

964	   Capture Scene Entry attributes include:

966	     . Human-readable description of the Capture Scene, which could
967	        be in multiple languages;
968	     . Scene-switch-policy: {site-switch, segment-switch}

970	   A media provider uses this scene-switch-policy attribute to
971	   indicate its support for different switching policies.  In the
972	   provider's Advertisement, this attribute can have multiple values,
973	   which means the provider supports each of the indicated policies.
974	   The consumer, when it requests media captures from this Capture
975	   Scene Entry, should also include this attribute but with only the
976	   single value (from among the values indicated by the provider)
977	   indicating the Consumer's choice for which policy it wants the
978	   provider to use.  The Consumer must choose the same value for all
979	   the Media Captures in the Capture Scene Entry.  If the provider
980	   does not support any of these policies, it should omit this
981	   attribute.

983	   The "site-switch" policy means all captures are switched at the
984	   same time to keep captures from the same endpoint site together.
985	   Let's say the speaker is at site A and everyone else is at a
986	   "remote" site.

988	   When the room at site A shown, all the camera images from site A
989	   are forwarded to the remote sites.  Therefore at each receiving
990	   remote site, all the screens display camera images from site A.
991	   This can be used to preserve full size image display, and also
992	   provide full visual context of the displayed far end, site A. In
993	   site switching, there is a fixed relation between the cameras in
994	   each room and the displays in remote rooms.  The room or
995	   participants being shown is switched from time to time based on who
996	   is speaking or by manual control.

998	   The "segment-switch" policy means different captures can switch at
999	   different times, and can be coming from different endpoints.  Still
1000	   using site A as where the speaker is, and "remote" to refer to all
1001	   the other sites, in segment switching, rather than sending all the
1002	   images from site A, only the image containing the speaker at site A
1003	   is shown.  The camera images of the current speaker and previous
1004	   speakers (if any) are forwarded to the other sites in the
1005	   conference.

1007	   Therefore the screens in each site are usually displaying images
1008	   from different remote sites - the current speaker at site A and the
1009	   previous ones.  This strategy can be used to preserve full size
1010	   image display, and also capture the non-verbal communication
1011	   between the speakers.  In segment switching, the display depends on
1012	   the activity in the remote rooms - generally, but not necessarily
1013	   based on audio / speech detection.

1015	6.3. Simultaneous Transmission Set Constraints

1017	   The Provider may have constraints or limitations on its ability to
1018	   send Captures.  One type is caused by the physical limitations of
1019	   capture mechanisms; these constraints are represented by a
1020	   simultaneous transmission set.  The second type of limitation
1021	   reflects the encoding resources available - bandwidth and
1022	   macroblocks/second.  This type of constraint is captured by
1023	   encoding groups, discussed below.

1025	   Some Endpoints or MCUs can send multiple Captures simultaneously,
1026	   however sometimes there are constraints that limit which Captures
1027	   can be sent simultaneously with other Captures.  A device may not
1028	   be able to be used in different ways at the same time.  Provider
1029	   Advertisements are made so that the Consumer can choose one of
1030	   several possible mutually exclusive usages of the device.  This
1031	   type of constraint is expressed in a Simultaneous Transmission Set,
1032	   which lists all the Captures of a particular media type (e.g.
1033	   audio, video, text) that can be sent at the same time.  There are
1034	   different Simultaneous Transmission Sets for each media type in the
1035	   Advertisement.  This is easier to show in an example.

1037	   Consider the example of a room system where there are three cameras
1038	   each of which can send a separate capture covering two persons
1039	   each- VC0, VC1, VC2.  The middle camera can also zoom out (using an
1040	   optical zoom lens) and show all six persons, VC3.  But the middle
1041	   camera cannot be used in both modes at the same time - it has to
1042	   either show the space where two participants sit or the whole six
1043	   seats, but not both at the same time.

1045	   Simultaneous transmission sets are expressed as sets of the Media
1046	   Captures that the Provider could transmit at the same time (though
1047	   it may not make sense to do so).  In this example the two
1048	   simultaneous sets are shown in Table 1.  If a Provider advertises
1049	   one or more mutually exclusive Simultaneous Transmission Sets, then
1050	   for each media type the Consumer must ensure that it chooses Media
1051	   Captures that lie wholly within one of those Simultaneous
1052	   Transmission Sets.

1054	                           +-------------------+
1055	                           | Simultaneous Sets |
1056	                           +-------------------+
1057	                           | {VC0, VC1, VC2}   |
1058	                           | {VC0, VC3, VC2}   |
1059	                           +-------------------+

1061	                Table 1: Two Simultaneous Transmission Sets

1063	   A Provider optionally can include the simultaneous sets in its
1064	   provider Advertisement.  These simultaneous set constraints apply
1065	   across all the Capture Scenes in the Advertisement.  It is a syntax
1066	   conformance requirement that the simultaneous transmission sets
1067	   must allow all the media captures in any particular Capture Scene
1068	   Entry to be used simultaneously.

1070	   For shorthand convenience, a Provider may describe a Simultaneous
1071	   Transmission Set in terms of Capture Scene Entries and Capture
1072	   Scenes.  If a Capture Scene Entry is included in a Simultaneous
1073	   Transmission Set, then all Media Captures in the Capture Scene
1074	   Entry are included in the Simultaneous Transmission Set.  If a
1075	   Capture Scene is included in a Simultaneous Transmission Set, then
1076	   all its Capture Scene Entries (of the corresponding media type) are
1077	   included in the Simultaneous Transmission Set.  The end result
1078	   reduces to a set of Media Captures in any case.

1080	   If an Advertisement does not include Simultaneous Transmission
1081	   Sets, then all Capture Scenes can be provided simultaneously.  If
1082	   multiple capture Scene Entries are in a Capture Scene then the
1083	   Consumer chooses at most one Capture Scene Entry per Capture Scene
1084	   for each media type.

1086	   If an Advertisement includes multiple Capture Scene Entries in a
1087	   Capture Scene then the Consumer should choose one Capture Scene
1088	   Entry for each media type, but may choose individual Captures based
1089	   on the Simultaneous Transmission Sets.

1091	7. Encodings

1093	   Individual encodings and encoding groups are CLUE's mechanisms
1094	   allowing a Provider to signal its limitations for sending Captures,
1095	   or combinations of Captures, to a Consumer.  Consumers can map the
1096	   Captures they want to receive onto the Encodings, with encoding
1097	   parameters they want.    As for the relationship between the CLUE-
1098	   specified mechanisms based on Encodings and the SIP Offer-Answer
1099	   exchange, please refer to section 4.

1101	7.1. Individual Encodings

1103	   An Individual Encoding represents a way to encode a Media Capture
1104	   to become a Capture Encoding, to be sent as an encoded media stream
1105	   from the Provider to the Consumer.  An Individual Encoding has a
1106	   set of parameters characterizing how the media is encoded.

1108	   Different media types have different parameters, and different
1109	   encoding algorithms may have different parameters.  An Individual
1110	   Encoding can be assigned to at most one Capture Encoding at any
1111	   given time.

1113	   The parameters of an Individual Encoding represent the maximum
1114	   values for certain aspects of the encoding.  A particular
1115	   instantiation into a Capture Encoding might use lower values than
1116	   these maximums.

1118	   In general, the parameters of an Individual Encoding have been
1119	   chosen to represent those negotiable parameters of media codecs of
1120	   the media type that greatly influence computational complexity,
1121	   while abstracting from details of particular media codecs used.
1122	   The parameters have been chosen with those media codecs in mind
1123	   that have seen wide deployment in the video conferencing and
1124	   Telepresence industry.

1126	   For video codecs (using H.26x compression technologies), those
1127	   parameters include:

1129	     . Maximum bitrate;
1130	     . Maximum picture size in pixels;
1131	     . Maxmimum number of pixels to be processed per second; and
1132	     . Clue-protocol internal information.

1134	   For audio codecs, so far only one parameter has been identified:

1136	     . Maximum bitrate.

1138	   Edt. note: the maximum number of pixel per second are currently
1139	   expressed as H.264maxmbps.

1141	   Edt. note: it would be desirable to make the computational
1142	   complexity mechanism codec independent so to allow for expressing
1143	   that, say, H.264 codecs are less complex than H.265 codecs, and,
1144	   therefore, the same hardware can process higher pixel rates for
1145	   H.264 than for H.265.  To be discussed in the WG.

1147	7.2. Encoding Group

1149	   An Encoding Group includes a set of one or more Individual
1150	   Encodings, and parameters that apply to the group as a whole.  By
1151	   grouping multiple individual Encodings together, an Encoding Group
1152	   describes additional constraints on bandwidth and other parameters
1153	   for the group.

1155	   The Encoding Group data structure contains:

1157	     . Maximum bitrate for all encodings in the group combined;
1158	     . Maximum number of pixels per second for all video encodings of
1159	        the group combined.
1160	     . A list of identifiers for audio and video encodings,
1161	        respectively, belonging to the group.

1163	   When the Individual Encodings in a group are instantiated into
1164	   Capture Encodings, each Capture Encoding has a bitrate that must be
1165	   less than or equal to the max bitrate for the particular individual
1166	   encoding.  The "maximum bitrate for all encodings in the group"
1167	   parameter gives the additional restriction that the sum of all the
1168	   individual capture encoding bitrates must be less than or equal to
1169	   the this group value.

1171	   Likewise, the sum of the pixels per second of each instantiated
1172	   encoding in the group must not exceed the group value.

1174	   The following diagram illustrates one example of the structure of a
1175	   media provider's Encoding Groups and their contents.

1177	   ,-------------------------------------------------.
1178	   |             Media Provider                      |
1179	   |                                                 |
1180	   |  ,--------------------------------------.       |
1181	   |  | ,--------------------------------------.     |
1182	   |  | | ,--------------------------------------.   |
1183	   |  | | |          Encoding Group              |   |
1184	   |  | | | ,-----------.                        |   |
1185	   |  | | | |           | ,---------.            |   |
1186	   |  | | | |           | |         | ,---------.|   |
1187	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
1188	   |  `.| | |           | |         | `---------'|   |
1189	   |    `.| `-----------' `---------'            |   |
1190	   |      `--------------------------------------'   |
1191	   `-------------------------------------------------'

1193	                    Figure 1: Encoding Group Structure

1195	   A Provider advertises one or more Encoding Groups.  Each Encoding
1196	   Group includes one or more Individual Encodings.  Each Individual
1197	   Encoding can represent a different way of encoding media.  For
1198	   example one Individual Encoding may be 1080p60 video, another could
1199	   be 720p30, with a third being CIF, all in, for example, H.264
1200	   format.

1202	   While a typical three codec/display system might have one Encoding
1203	   Group per "codec box" (physical codec, connected to one camera and
1204	   one screen), there are many possibilities for the number of
1205	   Encoding Groups a Provider may be able to offer and for the
1206	   encoding values in each Encoding Group.

1208	   There is no requirement for all Encodings within an Encoding Group
1209	   to be instantiated at the same time.

1211	8. Associating Captures with Encoding Groups

1213	   Every Capture is associated with an Encoding Group, which is used
1214	   to instantiate that Capture into one or more Capture Encodings.
1215	   More than one Capture may use the same Encoding Group.

1217	   The maximum number of streams that can result from a particular
1218	   Encoding Group constraint is equal to the number of individual
1219	   Encodings in the group.  The actual number of Capture Encodings
1220	   used at any time may be less than this maximum.  Any of the
1221	   Captures that use a particular Encoding Group can be encoded
1222	   according to any of the Individual Encodings in the group.  If
1223	   there are multiple Individual Encodings in the group, then the
1224	   Consumer can configure the Provider, via a Configure message, to
1225	   encode a single Media Capture into multiple different Capture
1226	   Encodings at the same time, subject to the Max Capture Encodings
1227	   constraint, with each capture encoding following the constraints of
1228	   a different Individual Encoding.

1230	   It is a protocol conformance requirement that the Encoding Groups
1231	   must allow all the Captures in a particular Capture Scene Entry to
1232	   be used simultaneously.

1234	9. Consumer's Choice of Streams to Receive from the Provider

1236	   After receiving the Provider's Advertisement message (that includes
1237	   media captures and associated constraints), the Consumer composes
1238	   its reply to the Provider in the form of a Configure message.  The
1239	   Consumer is free to use the information in the Advertisement as it
1240	   chooses, but there are a few obviously sensible design choices,
1241	   which are outlined below.

1243	   If multiple Providers connect to the same Consumer (i.e. in a n
1244	   MCU-less multiparty call), it is the repsonsibility of the Consumer
1245	   to compose Configures for each Provider that both fulfill each
1246	   Provider's constraints as expressed in the Advertisement, as well
1247	   as its own capabilities.

1249	   In an MCU-based multiparty call, the MCU can logically terminate
1250	   the Advertisement/Configure negotiation in that it can hide the
1251	   characteristics of the receiving endpoint and rely on its own
1252	   capabilities (transcoding/transrating/...) to create Media Streams
1253	   that can be decoded at the Endpoint Consumers.  The timing of an
1254	   MCU's sending of Advertisements (for its outgoing ports) and
1255	   Configures (for its incoming ports, in response to Advertisements
1256	   received there) is up to the MCU and implementation dependent.

1258	   As a general outline, A Consumer can choose, based on the
1259	   Advertisement it has received, which Captures it wishes to receive,
1260	   and which Individual Encodings it wants the Provider to use to
1261	   encode the Captures.  Each Capture has an Encoding Group ID
1262	   attribute which specifies which Individual Encodings are available
1263	   to be used for that Capture.

1265	   A Configure Message includes a list of Capture Encodings.  These
1266	   are the Capture Encodings the Consumer wishes to receive from the
1267	   Provider.  Each Capture Encoding refers to one Media Capture, one
1268	   Individual Encoding, and includes the encoding parameter values.
1269	   For each Media Capture in the message, the Consumer may also
1270	   specify the value of any attributes for which the Provider has
1271	   offered a choice, for example the value for the Scene-switch-policy
1272	   attribute.  A Configure Message does not include references to
1273	   Capture Scenes or Capture Scene Entries.

1275	   For each Capture the Consumer wants to receive, it configures one
1276	   or more of the encodings in that capture's encoding group.  The
1277	   Consumer does this by telling the Provider, in its Configure
1278	   Message, parameters such as the resolution, frame rate, bandwidth,
1279	   etc. for each Capture Encodings for its chosen Captures.  Upon
1280	   receipt of this Configure from the Consumer, common knowledge is
1281	   established between Provider and Consumer regarding sensible
1282	   choices for the media streams and their parameters.  The setup of
1283	   the actual media channels, at least in the simplest case, is left
1284	   to a following offer-answer exchange.  Optimized implementations
1285	   may speed up the reaction to the offer-answer exchange by reserving
1286	   the resources at the time of finalization of the CLUE handshake.
1287	   Even more advanced devices may choose to establish media streams
1288	   without an offer-answer exchange, for example by overloading
1289	   existing 5 tuple connections with the negotiated media.

1291	   The Consumer must have received at least one Advertisement from the
1292	   Provider to be able to create and send a Configure.

1294	   In addition, the Consumer can send a Configure at any time during
1295	   the call.  The Configure must be valid according to the most
1296	   recently received Advertisement.  The Consumer can send a Configure
1297	   either in response to a new Advertisement from the Provider or as
1298	   by its own, for example because of a local change in conditions
1299	   (people leaving the room, connectivity changes, multipoint related
1300	   considerations).

1302	   Edt. Note: The editors solicit input from the working group as to
1303	   whether or not a Consumer must respond to every Advertisement with
1304	   a new Configure message.  We expect this to be decided in the
1305	   context of the signaling document, then it should be mentioned
1306	   here.

1308	   When choosing which Media Streams to receive from the Provider, and
1309	   the encoding characteristics of those Media Streams, the Consumer
1310	   advantageously takes several things into account: its local
1311	   preference, simultaneity restrictions, and encoding limits.

1313	9.1. Local preference

1315	   A variety of local factors influence the Consumer's choice of
1316	   Media Streams to be received from the Provider:

1318	   o  if the Consumer is an Endpoint, it is likely that it would
1319	      choose, where possible, to receive video and audio Captures that
1320	      match the number of display devices and audio system it has

1322	   o  if the Consumer is a middle box such as an MCU, it may choose to
1323	      receive loudest speaker streams (in order to perform its own
1324	      media composition) and avoid pre-composed video Captures

1326	   o  user choice (for instance, selection of a new layout) may result
1327	      in a different set of Captures, or different encoding
1328	      characteristics, being required by the Consumer

1330	9.2. Physical simultaneity restrictions

1332	   There may be physical simultaneity constraints imposed by the
1333	   Provider that affect the Provider's ability to simultaneously send
1334	   all of the captures the Consumer would wish to receive.  For
1335	   instance, a middle box such as an MCU, when connected to a multi-
1336	   camera room system, might prefer to receive both individual video
1337	   streams of the people present in the room and an overall view of
1338	   the room from a single camera.  Some Endpoint systems might be
1339	   able to provide both of these sets of streams simultaneously,
1340	   whereas others may not (if the overall room view were produced by
1341	   changing the optical zoom level on the center camera, for
1342	   instance).

1344	9.3. Encoding and encoding group limits

1346	   Each of the Provider's encoding groups has limits on bandwidth and
1347	   computational complexity, and the constituent potential encodings
1348	   have limits on the bandwidth, computational complexity, video
1349	   frame rate, and resolution that can be provided.  When choosing
1350	   the Captures to be received from a Provider, a Consumer device
1351	   must ensure that the encoding characteristics requested for each
1352	   individual Capture fits within the capability of the encoding it
1353	   is being configured to use, as well as ensuring that the combined
1354	   encoding characteristics for Captures fit within the capabilities
1355	   of their associated encoding groups.  In some cases, this could
1356	   cause an otherwise "preferred" choice of capture encodings to be
1357	   passed over in favour of different Capture Encodings - for
1358	   instance, if a set of three Captures could only be provided at a
1359	   low resolution then a three screen device could switch to favoring
1360	   a single, higher quality, Capture Encoding.

1362	10. Extensibility

1364	   One of the most important characteristics of the Framework is its
1365	   extensibility.  Telepresence is a relatively new industry and
1366	   while we can foresee certain directions, we also do not know
1367	   everything about how it will develop.  The standard for
1368	   interoperability and handling multiple streams must be future-
1369	   proof. The framework itself is inherently extensible through
1370	   expanding the data model types.  For example:

1372	   o  Adding more types of media, such as telemetry, can done by
1373	      defining additional types of Captures in addition to audio and
1374	      video.

1376	   o  Adding new functionalities , such as 3-D, say, may require
1377	      additional attributes describing the Captures.

1379	   o  Adding a new codecs, such as H.265, can be accomplished by
1380	      defining new encoding variables.

1382	   The infrastructure is designed to be extended rather than
1383	   requiring new infrastructure elements.  Extension comes through
1384	   adding to defined types.

1386	11. Examples - Using the Framework

1388	   EDT. Note: these examples are currently out of date with respect
1389	   to H264Mbps codepoints, which will be fixed in the next release
1390	   once an agreement about codec computational complexity has been
1391	   found.  Other than that, the examples are still valid.

1393	   EDT Note: remove syntax-like details in these examples, and focus
1394	   on concepts for this document.  Syntax examples with XML should be
1395	   in the data model doc or dedicated example document.

1397	   This section gives some examples, first from the point of view of
1398	   the Provider, then the Consumer.

1400	11.1. Provider Behavior

1402	   This section shows some examples in more detail of how a Provider
1403	   can use the framework to represent a typical case for telepresence
1404	   rooms.  First an endpoint is illustrated, then an MCU case is
1405	   shown.

1407	11.1.1. Three screen Endpoint Provider

1409	   Consider an Endpoint with the following description:

1411	   3 cameras, 3 displays, a 6 person table

1413	   o  Each camera can provide one Capture for each 1/3 section of the
1414	      table

1416	   o  A single Capture representing the active speaker can be provided
1417	      (voice activity based camera selection to a given encoder input
1418	      port implemented locally in the Endpoint)

1420	   o  A single Capture representing the active speaker with the other
1421	      2 Captures shown picture in picture within the stream can be
1422	      provided (again, implemented inside the endpoint)

1424	   o  A Capture showing a zoomed out view of all 6 seats in the room
1425	      can be provided

1427	   The audio and video Captures for this Endpoint can be described as
1428	   follows.

1430	   Video Captures:

1432	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1433	      switched=false, view=table

1435	   o  VC1- (the center camera stream), encoding group=EG1,
1436	      switched=false, view=table

1438	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1439	      switched=false, view=table

1441	   o  VC3- (the loudest panel stream), encoding group=EG1,
1442	      switched=true, view=table

1444	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1445	      composed=true, switched=true, view=room

1447	   o  VC5- (the zoomed out view of all people in the room), encoding
1448	      group=EG1, composed=false, switched=false, view=room

1450	   o  VC6- (presentation stream), encoding group=EG1, presentation,
1451	      switched=false

1453	   The following diagram is a top view of the room with 3 cameras, 3
1454	   displays, and 6 seats.  Each camera is capturing 2 people.  The
1455	   six seats are not all in a straight line.

1457	      ,-. d
1458	     (   )`--.__        +---+
1459	      `-' /     `--.__  |   |
1460	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1461	   (   ).'        ___..-+-''`+-+
1462	    `-' |_...---''      |   |
1463	    ,-.c+-..__          +---+
1464	   (   )|     ``--..__  |   |
1465	    `-' |             ``+-..|_-+Camera 1 (VC1)
1466	    ,-. |            __..--'|+-+
1467	   (   )|     __..--'   |   |
1468	    `-'b|..--'          +---+
1469	    ,-. |``---..___     |   |
1470	   (   )\          ```--..._|_-+Camera 0 (VC0)
1471	    `-'  \             _..-''`-+
1472	     ,-. \      __.--'' |   |
1473	    (   ) |..-''        +---+
1474	     `-' a

1476	   The two points labeled b and c are intended to be at the midpoint
1477	   between the seating positions, and where the fields of view of the
1478	   cameras intersect.

1480	   The plane of interest for VC0 is a vertical plane that intersects
1481	   points 'a' and 'b'.

1483	   The plane of interest for VC1 intersects points 'b' and 'c'. The
1484	   plane of interest for VC2 intersects points 'c' and 'd'.

1486	   This example uses an area scale of millimeters.

1488	   Areas of capture:

1490	       bottom left    bottom right  top left         top right
1491	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1492	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1493	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1494	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1495	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1496	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1497	   VC6 none

1499	   Points of capture:
1500	   VC0 (-1678,0,800)
1501	   VC1 (0,0,800)
1502	   VC2 (1678,0,800)
1503	   VC3 none
1504	   VC4 none
1505	   VC5 (0,0,800)
1506	   VC6 none

1508	   In this example, the right edge of the VC0 area lines up with the
1509	   left edge of the VC1 area.  It doesn't have to be this way.  There
1510	   could be a gap or an overlap.  One additional thing to note for
1511	   this example is the distance from a to b is equal to the distance
1512	   from b to c and the distance from c to d.  All these distances are
1513	   1346 mm. This is the planar width of each area of capture for VC0,
1514	   VC1, and VC2.

1516	   Note the text in parentheses (e.g. "the camera-left camera
1517	   stream") is not explicitly part of the model, it is just
1518	   explanatory text for this example, and is not included in the
1519	   model with the media captures and attributes.  Also, the
1520	   "composed" boolean attribute doesn't say anything about how a
1521	   capture is composed, so the media consumer can't tell based on
1522	   this attribute that VC4 is composed of a "loudest panel with
1523	   PiPs".

1525	   Audio Captures:

1527	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1528	      format=mono

1530	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1531	      format=mono

1533	   o  AC2 (center) encoding group=EG3, content=main, channel
1534	      format=mono

1536	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1537	      encoding group=EG3, content=main, channel format=mono

1539	   o  AC4 audio stream associated with the presentation video (mono)
1540	      encoding group=EG3, content=slides, channel format=mono

1542	   Areas of capture:

1544	       bottom left    bottom right  top left         top right

1546	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1547	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1548	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1549	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1550	   AC4 none

1552	   The physical simultaneity information is:

1554	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1556	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1558	   This constraint indicates it is not possible to use all the VCs at
1559	   the same time.  VC5 can not be used at the same time as VC1 or VC3
1560	   or VC4.  Also, using every member in the set simultaneously may
1561	   not make sense - for example VC3(loudest) and VC4 (loudest with
1562	   PIP).  (In addition, there are encoding constraints that make
1563	   choosing all of the VCs in a set impossible.  VC1, VC3, VC4, VC5,
1564	   VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint shows up
1565	   in the encoding groups, not in the simultaneous transmission
1566	   sets.)

1568	   In this example there are no restrictions on which audio captures
1569	   can be sent simultaneously.

1571	   Encoding Groups:

1573	   This example has three encoding groups associated with the video
1574	   captures.  Each group can have 3 encodings, but with each
1575	   potential encoding having a progressively lower specification.  In
1576	   this example, 1080p60 transmission is possible (as ENC0 has a
1577	   maxPps value compatible with that) as long as it is the only
1578	   active encoding in the group(as maxGroupPps for the entire
1579	   encoding group is also 124416000).  Significantly, as up to 3
1580	   encodings are available per group, it is possible to transmit some
1581	   video captures simultaneously that are not in the same entry in
1582	   the capture scene.  For example VC1 and VC3 at the same time.

1584	   It is also possible to transmit multiple capture encodings of a
1585	   single video capture.  For example VC0 can be encoded using ENC0
1586	   and ENC1 at the same time, as long as the encoding parameters
1587	   satisfy the constraints of ENC0, ENC1, and EG0, such as one at
1588	   1080p30 and one at 720p30.

1590	   encodeGroupID=EG0, maxGroupPps=124416000 maxGroupBandwidth=6000000
1591	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1592	                      maxPps=124416000, maxBandwidth=4000000
1593	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1594	                      maxPps=27648000, maxBandwidth=4000000
1595	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1596	                      maxPps=15552000, maxBandwidth=4000000
1597	   encodeGroupID=EG1  maxGroupPps=124416000 maxGroupBandwidth=6000000
1598	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1599	                      maxPps=124416000, maxBandwidth=4000000
1600	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1601	                      maxPps=27648000, maxBandwidth=4000000
1602	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1603	                      maxPps=15552000, maxBandwidth=4000000
1604	   encodeGroupID=EG2  maxGroupPps=124416000 maxGroupBandwidth=6000000
1605	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1606	                      maxPps=124416000, maxBandwidth=4000000
1607	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1608	                      maxPps=27648000, maxBandwidth=4000000
1609	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1610	                      maxPps=15552000, maxBandwidth=4000000

1612	                Figure 2: Example Encoding Groups for Video

1614	   For audio, there are five potential encodings available, so all
1615	   five audio captures can be encoded at the same time.

1617	   encodeGroupID=EG3, maxGroupPps =0, maxGroupBandwidth=320000
1618	       encodeID=ENC9, maxBandwidth=64000
1619	       encodeID=ENC10, maxBandwidth=64000
1620	       encodeID=ENC11, maxBandwidth=64000
1621	       encodeID=ENC12, maxBandwidth=64000
1622	       encodeID=ENC13, maxBandwidth=64000

1624	                Figure 3: Example Encoding Group for Audio

1626	   Capture Scenes:

1628	   The following table represents the capture scenes for this
1629	   provider. Recall that a capture scene is composed of alternative
1630	   capture scene entries covering the same spatial region.  Capture
1631	   Scene #1 is for the main people captures, and Capture Scene #2 is
1632	   for presentation.

1634	   Each row in the table is a separate Capture Scene Entry
1635	                           +------------------+
1636	                           | Capture Scene #1 |
1637	                           +------------------+
1638	                           | VC0, VC1, VC2    |
1639	                           | VC3              |
1640	                           | VC4              |
1641	                           | VC5              |
1642	                           | AC0, AC1, AC2    |
1643	                           | AC3              |
1644	                           +------------------+

1646	                           +------------------+
1647	                           | Capture Scene #2 |
1648	                           +------------------+
1649	                           | VC6              |
1650	                           | AC4              |
1651	                           +------------------+

1653	   Different capture scenes are unique to each other, non-
1654	   overlapping. A consumer can choose an entry from each capture
1655	   scene.  In this case the three captures VC0, VC1, and VC2 are one
1656	   way of representing the video from the endpoint.  These three
1657	   captures should appear adjacent next to each other.
1658	   Alternatively, another way of representing the Capture Scene is
1659	   with the capture VC3, which automatically shows the person who is
1660	   talking.  Similarly for the VC4 and VC5 alternatives.

1662	   As in the video case, the different entries of audio in Capture
1663	   Scene #1 represent the "same thing", in that one way to receive
1664	   the audio is with the 3 audio captures (AC0, AC1, AC2), and
1665	   another way is with the mixed AC3.  The Media Consumer can choose
1666	   an audio capture entry it is capable of receiving.

1668	   The spatial ordering is understood by the media capture attributes
1669	   Area of Capture and Point of Capture.

1671	   A Media Consumer would likely want to choose a capture scene entry
1672	   to receive based in part on how many streams it can simultaneously
1673	   receive.  A consumer that can receive three people streams would
1674	   probably prefer to receive the first entry of Capture Scene #1
1675	   (VC0, VC1, VC2) and not receive the other entries.  A consumer
1676	   that can receive only one people stream would probably choose one
1677	   of the other entries.

1679	   If the consumer can receive a presentation stream too, it would
1680	   also choose to receive the only entry from Capture Scene #2 (VC6).

1682	11.1.2. Encoding Group Example

1684	   This is an example of an encoding group to illustrate how it can
1685	   express dependencies between encodings.

1687	   encodeGroupID=EG0 maxGroupPps=124416000 maxGroupBandwidth=6000000
1688	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1689	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1690	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1691	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1692	       encodeID=AUDENC0, maxBandwidth=96000
1693	       encodeID=AUDENC1, maxBandwidth=96000
1694	       encodeID=AUDENC2, maxBandwidth=96000

1696	   Here, the encoding group is EG0.  It can transmit up to two
1697	   1080p30 capture encodings (Pps for 1080p = 62208000), but it is
1698	   capable of transmitting a maxFrameRate of 60 frames per second
1699	   (fps).  To achieve the maximum resolution (1920 x 1088) the frame
1700	   rate is limited to 30 fps.  However 60 fps can be achieved at a
1701	   lower resolution if required by the consumer.  Although the
1702	   encoding group is capable of transmitting up to 6Mbit/s, no
1703	   individual video encoding can exceed 4Mbit/s.

1705	   This encoding group also allows up to 3 audio encodings, AUDENC<0-
1706	   2>. It is not required that audio and video encodings reside
1707	   within the same encoding group, but if so then the group's overall
1708	   maxBandwidth value is a limit on the sum of all audio and video
1709	   encodings configured by the consumer.  A system that does not wish
1710	   or need to combine bandwidth limitations in this way should
1711	   instead use separate encoding groups for audio and video in order
1712	   for the bandwidth limitations on audio and video to not interact.

1714	   Audio and video can be expressed in separate encoding groups, as
1715	   in this illustration.

1717	   encodeGroupID=EG0 maxGroupPps=124416000 maxGroupBandwidth=6000000
1718	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1719	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1720	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1721	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1722	   encodeGroupID=EG1, maxGroupPps=0, maxGroupBandwidth=500000
1723	       encodeID=AUDENC0, maxBandwidth=96000
1724	       encodeID=AUDENC1, maxBandwidth=96000
1725	       encodeID=AUDENC2, maxBandwidth=96000

1727	11.1.3. The MCU Case

1729	   This section shows how an MCU might express its Capture Scenes,
1730	   intending to offer different choices for consumers that can handle
1731	   different numbers of streams.  A single audio capture stream is
1732	   provided for all single and multi-screen configurations that can
1733	   be associated (e.g. lip-synced) with any combination of video
1734	   captures at the consumer.

1736	   +--------------------+--------------------------------------------
1737	   | Capture Scene #1   | note
1738	   |
1739	   +--------------------+--------------------------------------------
1740	   | VC0                | video capture for single screen consumer
1741	   |
1742	   | VC1, VC2           | video capture for 2 screen consumer
1743	   |
1744	   | VC3, VC4, VC5      | video capture for 3 screen consumer
1745	   |
1746	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer
1747	   |
1748	   | AC0                | audio capture representing all participants
1749	   |
1750	   +--------------------+--------------------------------------------

1752	   If / when a presentation stream becomes active within the
1753	   conference the MCU might re-advertise the available media as:

1755	        +------------------+--------------------------------------+
1756	        | Capture Scene #2 | note                                 |
1757	        +------------------+--------------------------------------+
1758	        | VC10             | video capture for presentation       |
1759	        | AC1              | presentation audio to accompany VC10 |
1760	        +------------------+--------------------------------------+

1762	11.2. Media Consumer Behavior

1764	   This section gives an example of how a Media Consumer might behave
1765	   when deciding how to request streams from the three screen
1766	   endpoint described in the previous section.

1768	   The receive side of a call needs to balance its requirements,
1769	   based on number of screens and speakers, its decoding capabilities
1770	   and available bandwidth, and the provider's capabilities in order
1771	   to optimally configure the provider's streams.  Typically it would
1772	   want to receive and decode media from each Capture Scene
1773	   advertised by the Provider.

1775	   A sane, basic, algorithm might be for the consumer to go through
1776	   each Capture Scene in turn and find the collection of Video
1777	   Captures that best matches the number of screens it has (this
1778	   might include consideration of screens dedicated to presentation
1779	   video display rather than "people" video) and then decide between
1780	   alternative entries in the video Capture Scenes based either on
1781	   hard-coded preferences or user choice.  Once this choice has been
1782	   made, the consumer would then decide how to configure the
1783	   provider's encoding groups in order to make best use of the
1784	   available network bandwidth and its own decoding capabilities.

1786	11.2.1. One screen Media Consumer

1788	   VC3, VC4 and VC5 are all different entries by themselves, not
1789	   grouped together in a single entry, so the receiving device should
1790	   choose between one of those.  The choice would come down to
1791	   whether to see the greatest number of participants simultaneously
1792	   at roughly equal precedence (VC5), a switched view of just the
1793	   loudest region (VC3) or a switched view with PiPs (VC4).  An
1794	   endpoint device with a small amount of knowledge of these
1795	   differences could offer a dynamic choice of these options, in-
1796	   call, to the user.

1798	11.2.2. Two screen Media Consumer configuring the example

1800	   Mixing systems with an even number of screens, "2n", and those
1801	   with "2n+1" cameras (and vice versa) is always likely to be the
1802	   problematic case.  In this instance, the behavior is likely to be
1803	   determined by whether a "2 screen" system is really a "2 decoder"
1804	   system, i.e., whether only one received stream can be displayed
1805	   per screen or whether more than 2 streams can be received and
1806	   spread across the available screen area.  To enumerate 3 possible
1807	   behaviors here for the 2 screen system when it learns that the far
1808	   end is "ideally" expressed via 3 capture streams:

1810	   1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1811	      per the 1 screen consumer case above) and either leave one
1812	      screen blank or use it for presentation if / when a
1813	      presentation becomes active.

1815	   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2
1816	      screens (either with each capture being scaled to 2/3 of a
1817	      screen and the center capture being split across 2 screens) or,
1818	      as would be necessary if there were large bezels on the
1819	      screens, with each stream being scaled to 1/2 the screen width
1820	      and height and there being a 4th "blank" panel.  This 4th panel
1821	      could potentially be used for any presentation that became
1822	      active during the call.

1824	   3. Receive 3 streams, decode all 3, and use control information
1825	      indicating which was the most active to switch between showing
1826	      the left and center streams (one per screen) and the center and
1827	      right streams.

1829	   For an endpoint capable of all 3 methods of working described
1830	   above, again it might be appropriate to offer the user the choice
1831	   of display mode.

1833	11.2.3. Three screen Media Consumer configuring the example

1835	   This is the most straightforward case - the Media Consumer would
1836	   look to identify a set of streams to receive that best matched its
1837	   available screens and so the VC0 plus VC1 plus VC2 should match
1838	   optimally.  The spatial ordering would give sufficient information
1839	   for the correct video capture to be shown on the correct screen,
1840	   and the consumer would either need to divide a single encoding
1841	   group's capability by 3 to determine what resolution and frame
1842	   rate to configure the provider with or to configure the individual
1843	   video captures' encoding groups with what makes most sense (taking
1844	   into account the receive side decode capabilities, overall call
1845	   bandwidth, the resolution of the screens plus any user preferences
1846	   such as motion vs sharpness).

1848	12. Acknowledgements

1850	   Allyn Romanow and Brian Baldino were authors of early versions.
1851	   Mark Gorzyinski contributed much to the approach.  We want to
1852	   thank Stephen Botzko for helpful discussions on audio.

1854	13. IANA Considerations

1856	   None.

1858	14. Security Considerations

1860	   TBD

1862	15. Changes Since Last Version

1864	   NOTE TO THE RFC-Editor: Please remove this section prior to
1865	   publication as an RFC.

1867	   Changes from 10 to 11:

1869	     1. Add description attribute to Media Capture and Capture Scene
1870	        Entry.

1872	     2. Remove contradiction and change the note about open issue
1873	        regarding always responding to Advertisement with a Configure
1874	        message.

1876	     3. Update example section, to cleanup formatting and make the
1877	        media capture attributes and encoding parameters consistent
1878	        with the rest of the document.

1880	   Changes from 09 to 10:

1882	     1. Several minor clarifications such as about SDP usage, Media
1883	        Captures, Configure message.

1885	     2. Simultaneous Set can be expressed in terms of Capture Scene
1886	        and Capture Scene Entry.

1888	     3. Removed Area of Scene attribute.

1890	     4. Add attributes from draft-groves-clue-capture-attr-01.

1892	     5. Move some of the Media Capture attribute descriptions back
1893	        into this document, but try to leave detailed syntax to the
1894	        data model.  Remove the OUTSOURCE sections, which are already
1895	        incorporated into the data model document.

1897	   Changes from 08 to 09:

1899	     1. Use "document" instead of "memo".

1901	     2. Add basic call flow sequence diagram to introduction.

1903	     3. Add definitions for Advertisement and Configure messages.

1905	     4. Add definitions for Capture and Provider.

1907	     5. Update definition of Capture Scene.

1909	     6. Update definition of Individual Encoding.

1911	     7. Shorten definition of Media Capture and add key points in the
1912	        Media Captures section.

1914	     8. Reword a bit about capture scenes in overview.

1916	     9. Reword about labeling Media Captures.

1918	     10. Remove the Consumer Capability message.

1920	     11. New example section heading for media provider behavior

1922	     12. Clarifications in the Capture Scene section.

1924	     13. Clarifications in the Simultaneous Transmission Set section.

1926	     14. Capitalize defined terms.

1928	     15. Move call flow example from introduction to overview section

1930	     16. General editorial cleanup

1932	     17. Add some editors' notes requesting input on issues

1934	     18. Summarize some sections, and propose details be outsourced
1935	        to other documents.

1937	   Changes from 06 to 07:

1939	     1. Ticket #9.  Rename Axis of Capture Point attribute to Point
1940	        on Line of Capture.  Clarify the description of this
1941	        attribute.

1943	     2. Ticket #17.  Add "capture encoding" definition.  Use this new
1944	        term throughout document as appropriate, replacing some usage
1945	        of the terms "stream" and "encoding".

1947	     3. Ticket #18.  Add Max Capture Encodings media capture
1948	        attribute.

1950	     4. Add clarification that different capture scene entries are
1951	        not necessarily mutually exclusive.

1953	   Changes from 05 to 06:

1955	   1. Capture scene description attribute is a list of text strings,
1956	      each in a different language, rather than just a single string.

1958	   2. Add new Axis of Capture Point attribute.

1960	   3. Remove appendices A.1 through A.6.

1962	   4. Clarify that the provider must use the same coordinate system
1963	      with same scale and origin for all coordinates within the same
1964	      capture scene.

1966	   Changes from 04 to 05:

1968	   1. Clarify limitations of "composed" attribute.

1970	   2. Add new section "capture scene entry attributes" and add the
1971	      attribute "scene-switch-policy".

1973	   3. Add capture scene description attribute and description
1974	      language attribute.

1976	   4. Editorial changes to examples section for consistency with the
1977	      rest of the document.

1979	   Changes from 03 to 04:

1981	   1. Remove sentence from overview - "This constitutes a significant
1982	      change ..."

1984	   2. Clarify a consumer can choose a subset of captures from a
1985	      capture scene entry or a simultaneous set (in section "capture
1986	      scene" and "consumer's choice...").

1988	   3. Reword first paragraph of Media Capture Attributes section.

1990	   4. Clarify a stereo audio capture is different from two mono audio
1991	      captures (description of audio channel format attribute).

1993	   5. Clarify what it means when coordinate information is not
1994	      specified for area of capture, point of capture, area of scene.

1996	   6. Change the term "producer" to "provider" to be consistent (it
1997	      was just in two places).

1999	   7. Change name of "purpose" attribute to "content" and refer to
2000	      RFC4796 for values.

2002	   8. Clarify simultaneous sets are part of a provider advertisement,
2003	      and apply across all capture scenes in the advertisement.

2005	   9. Remove sentence about lip-sync between all media captures in a
2006	      capture scene.

2008	   10.   Combine the concepts of "capture scene" and "capture set"
2009	      into a single concept, using the term "capture scene" to
2010	      replace the previous term "capture set", and eliminating the
2011	      original separate capture scene concept.

2013	   Informative References

2015	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2016	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2018	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
2019	   Johnston,
2020	              A., Peterson, J., Sparks, R., Handley, M., and E.
2021	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
2022	              June 2002.

2024	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2025	              Jacobson, "RTP: A Transport Protocol for Real-Time
2026	              Applications", STD 64, RFC 3550, July 2003.

2028	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
2029	              Session Initiation Protocol (SIP)", RFC 4353,
2030	              February 2006.

2032	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
2033	   5117,
2034	              January 2008.

2036	16. Authors' Addresses

2038	   Mark Duckworth (editor)
2039	   Polycom
2040	   Andover, MA  01810
2041	   USA

2043	   Email: mark.duckworth@polycom.com

2045	   Andrew Pepperell
2046	   Acano
2047	   Uxbridge, England
2048	   UK

2050	   Email: apeppere@gmail.com

2052	   Stephan Wenger
2053	   Vidyo, Inc.
2054	   433 Hackensack Ave.
2055	   Hackensack, N.J. 07601
2056	   USA

2058	   Email: stewe@stewe.org