idnits 2.17.1 

draft-ietf-clue-framework-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1533 has weird spacing: '...om left    bot...'

  == Line 1587 has weird spacing: '...om left    bot...'

  -- The document date (October 19, 2013) is 3841 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC4566' is mentioned on line 1197, but not defined

  ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866)

  == Unused Reference: 'RFC4579' is defined on line 2098, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	CLUE WG                                              M. Duckworth, Ed.
2	Internet Draft                                                  Polycom
3	Intended status: Standards Track                           A. Pepperell
4	Expires: April 19, 2014                                           Acano
5	                                                              S. Wenger
6	                                                                  Vidyo
7	                                                       October 19, 2013

9	                Framework for Telepresence Multi-Streams
10	                    draft-ietf-clue-framework-12.txt

12	Abstract

14	   This document defines a framework for a protocol to enable devices
15	   in a telepresence conference to interoperate.  The protocol enables
16	   communication of information about multiple media streams so a
17	   sending system and receiving system can make reasonable decisions
18	   about transmitting, selecting and rendering the media streams.
19	   This protocol is used in addition to SIP signaling for setting up a
20	   telepresence session.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current
30	   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six
33	   months and may be updated, replaced, or obsoleted by other
34	   documents at any time.  It is inappropriate to use Internet-Drafts
35	   as reference material or to cite them other than as "work in
36	   progress."

38	   This Internet-Draft will expire on April 19, 2013.

40	Copyright Notice

42	   Copyright (c) 2013 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with
50	   respect to this document.  Code Components extracted from this
51	   document must include Simplified BSD License text as described in
52	   Section 4.e of the Trust Legal Provisions and are provided without
53	   warranty as described in the Simplified BSD License.

55	Table of Contents

57	   1. Introduction...................................................3
58	   2. Terminology....................................................3
59	   3. Definitions....................................................4
60	   4. Overview & Motivation..........................................6
61	   5. Overview of the Framework/Model................................9
62	   6. Spatial Relationships.........................................15
63	   7. Media Captures and Capture Scenes.............................16
64	      7.1. Media Captures...........................................16
65	         7.1.1. Media Capture Attributes............................17
66	      7.2. Capture Scene............................................22
67	         7.2.1. Capture Scene attributes............................25
68	         7.2.2. Capture Scene Entry attributes......................25
69	      7.3. Simultaneous Transmission Set Constraints................26
70	   8. Encodings.....................................................28
71	      8.1. Individual Encodings.....................................28
72	      8.2. Encoding Group...........................................29
73	   9. Associating Captures with Encoding Groups.....................30
74	   10. Consumer's Choice of Streams to Receive from the Provider....31
75	      10.1. Local preference........................................33
76	      10.2. Physical simultaneity restrictions......................33
77	      10.3. Encoding and encoding group limits......................33
78	   11. Extensibility................................................34
79	   12. Examples - Using the Framework (Informative).................34
80	      12.1. Provider Behavior.......................................34
81	         12.1.1. Three screen Endpoint Provider.....................35
82	         12.1.2. Encoding Group Example.............................42
83	         12.1.3. The MCU Case.......................................42
84	      12.2. Media Consumer Behavior.................................43
85	         12.2.1. One screen Media Consumer..........................44
86	         12.2.2. Two screen Media Consumer configuring the example..44
87	         12.2.3. Three screen Media Consumer configuring the example45
88	   13. Acknowledgements.............................................45
89	   14. IANA Considerations..........................................45
90	   15. Security Considerations......................................46
91	   16. Changes Since Last Version...................................46
92	   17. Authors' Addresses...........................................50

94	1. Introduction

96	   Current telepresence systems, though based on open standards such
97	   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with
98	   each other.  A major factor limiting the interoperability of
99	   telepresence systems is the lack of a standardized way to describe
100	   and negotiate the use of the multiple streams of audio and video
101	   comprising the media flows.  This document provides a framework for
102	   protocols to enable interoperability by handling multiple streams
103	   in a standardized way.  The framework is intended to support the
104	   use cases described in draft-ietf-clue-telepresence-use-cases and
105	   to meet the requirements in draft-ietf-clue-telepresence-
106	   requirements.

108	   The basic session setup for the use cases is based on SIP [RFC3261]
109	   and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP
110	   offer/answer, CLUE specific signaling is required to exchange the
111	   information describing the multiple media streams. The motivation
112	   for this framework, an overview of the signaling, and information
113	   required to be exchanged is described in subsequent sections of
114	   this document.  The signaling details and data model are provided
115	   in subsequent documents.

117	2. Terminology

119	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
120	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
121	   this document are to be interpreted as described in RFC 2119
122	   [RFC2119].

124	3. Definitions

126	   The terms defined below are used throughout this document and
127	   companion documents and they are normative.  In order to easily
128	   identify the use of a defined term, those terms are capitalized.

130	   Advertisement: a CLUE message a Media Provider sends to a Media
131	   Consumer describing specific aspects of the content of the media,
132	   the formatting of the media streams it can send, and any
133	   restrictions it has in terms of being able to provide certain
134	   Streams simultaneously.

136	   Audio Capture: Media Capture for audio.  Denoted as ACn in the
137	   example cases in this document.

139	   Camera-Left and Right: For Media Captures, camera-left and camera-
140	   right are from the point of view of a person observing the rendered
141	   media.  They are the opposite of Stage-Left and Stage-Right.

143	   Capture: Same as Media Capture.

145	   Capture Device: A device that converts audio and video input into
146	   an electrical signal, in most cases to be fed into a media encoder.

148	   Capture Encoding: A specific encoding of a Media Capture, to be
149	   sent by a Media Provider to a Media Consumer via RTP.

151	   Capture Scene: a structure representing a spatial region containing
152	   one or more Capture Devices, each capturing media representing a
153	   portion of the region. The spatial region represented by a Capture
154	   Scene MAY or may not correspond to a real region in physical space,
155	   such as a room.  A Capture Scene includes attributes and one or
156	   more Capture Scene Entries, with each entry including one or more
157	   Media Captures.

159	   Capture Scene Entry: a list of Media Captures of the same media
160	   type that together form one way to represent the entire Capture
161	   Scene.

163	   Conference: used as defined in [RFC4353], A Framework for
164	   Conferencing within the Session Initiation Protocol (SIP).

166	   Configure Message: A CLUE message a Media Consumer sends to a Media
167	   Provider specifying which content and media streams it wants to
168	   receive, based on the information in a corresponding Advertisement
169	   message.

171	   Consumer: short for Media Consumer.

173	   Encoding or Individual Encoding: a set of parameters representing a
174	   way to encode a Media Capture to become a Capture Encoding.

176	   Encoding Group: A set of encoding parameters representing a total
177	   media encoding capability to be sub-divided across potentially
178	   multiple Individual Encodings.

180	   Endpoint: The logical point of final termination through receiving,
181	   decoding and rendering, and/or initiation through capturing,
182	   encoding, and sending of media streams.  An endpoint consists of
183	   one or more physical devices which source and sink media streams,
184	   and exactly one [RFC4353] Participant (which, in turn, includes
185	   exactly one SIP User Agent).  Endpoints can be anything from
186	   multiscreen/multicamera rooms to handheld devices.

188	   Front: the portion of the room closest to the cameras.  In going
189	   towards back you move away from the cameras.

191	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
192	   more endpoints together into one single multimedia conference
193	   [RFC5117].  An MCU includes an [RFC4353] like Mixer, without the
194	   [RFC4353] requirement to send media to each participant.

196	   Media: Any data that, after suitable encoding, can be conveyed over
197	   RTP, including audio, video or timed text.

199	   Media Capture: a source of Media, such as from one or more Capture
200	   Devices or constructed from other Media streams.

202	   Media Consumer: an Endpoint or middle box that receives Media
203	   streams

205	   Media Provider: an Endpoint or middle box that sends Media streams

207	   Model: a set of assumptions a telepresence system of a given vendor
208	   adheres to and expects the remote telepresence system(s) also to
209	   adhere to.

211	   Plane of Interest: The spatial plane containing the most relevant
212	   subject matter.

214	   Provider: Same as Media Provider.

216	   Render: the process of generating a representation from a media,
217	   such as displayed motion video or sound emitted from loudspeakers.

219	   Simultaneous Transmission Set: a set of Media Captures that can be
220	   transmitted simultaneously from a Media Provider.

222	   Spatial Relation: The arrangement in space of two objects, in
223	   contrast to relation in time or other relationships.  See also
224	   Camera-Left and Right.

226	   Stage-Left and Right: For Media Captures, Stage-left and Stage-
227	   right are the opposite of Camera-left and Camera-right.  For the
228	   case of a person facing (and captured by) a camera, Stage-left and
229	   Stage-right are from the point of view of that person.

231	   Stream: a Capture Encoding sent from a Media Provider to a Media
232	   Consumer via RTP [RFC3550].

234	   Stream Characteristics: the media stream attributes commonly used
235	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
236	   resolution, profile/level etc.) as well as CLUE specific
237	   attributes, such as the Capture ID or a spatial location.

239	   Video Capture: Media Capture for video.  Denoted as VCn in the
240	   example cases in this document.

242	   Video Composite: A single image that is formed, normally by an RTP
243	   mixer inside an MCU, by combining visual elements from separate
244	   sources.

246	4. Overview & Motivation

248	   This section provides an overview of the functional elements
249	   defined in this document to represent a telepresence system.  The
250	   motivations for the framework described in this document are also
251	   provided.

253	   Two key concepts introduced in this document are the terms "Media
254	   Provider" and "Media Consumer". A Media Provider represents the
255	   entity that is sending the media and a Media Consumer represents
256	   the entity that is receiving the media. A Media Provider provides
257	   Media in the form of RTP packets, a Media Consumer consumes those
258	   RTP packets.  Media Providers and Media Consumers can reside in
259	   Endpoints or in middleboxes such as Multipoint Control Units
260	   (MCUs).  A Media Provider in an Endpoint is usually associated
261	   with the generation of media for Media Captures; these Media
262	   Captures are typically sourced from cameras, microphones, and the
263	   like.  Similarly, the Media Consumer in an Endpoint is usually
264	   associated with renderers, such as screens and loudspeakers.  In
265	   middleboxes, Media Providers and Consumers can have the form of
266	   outputs and inputs, respectively, of RTP mixers, RTP translators,
267	   and similar devices.  Typically, telepresence devices such as
268	   Endpoints and middleboxes would perform as both Media Providers
269	   and Media Consumers, the former being concerned with those
270	   devices' transmitted media and the latter with those devices'
271	   received media.  In a few circumstances, a CLUE Endpoint middlebox
272	   includes only Consumer or Provider functionality, such as
273	   recorder-type Consumers or webcam-type Providers.

275	   The motivations for the framework outlined in this document
276	   include the following:

278	   (1) Endpoints in telepresence systems typically have multiple Media
279	   Capture and Media Render devices, e.g., multiple cameras and
280	   screens. While previous system designs were able to set up calls
281	   that would capture media using all cameras and display media on all
282	   screens, for example, there is no mechanism that can associate
283	   these Media Captures with each other in space and time.

285	   (2) The mere fact that there are multiple capture and rendering
286	   devices, each of which may be configurable in aspects such as zoom,
287	   leads to the difficulty that a variable number of such devices can
288	   be used to capture different aspects of a region.  The Capture
289	   Scene concept allows for the description of multiple setups for
290	   those multiple capture devices that could represent sensible
291	   operation points of the physical capture devices in a room, chosen
292	   by the operator.  A Consumer can pick and choose from those
293	   configurations based on its rendering abilities and inform the
294	   Provider about its choices.  Details are provided in section 7.

296	   (3) In some cases, physical limitations or other reasons disallow
297	   the concurrent use of a device in more than one setup.  For
298	   example, the center camera in a typical three-camera conference
299	   room can set its zoom objective either to capture only the middle
300	   few seats, or all seats of a room, but not both concurrently.  The
301	   Simultaneous Transmission Set concept allows a Provider to signal
302	   such limitations.  Simultaneous Transmission Sets are part of the
303	   Capture Scene description, and discussed in section 7.3.

305	   (4) Often, the devices in a room do not have the computational
306	   complexity or connectivity to deal with multiple encoding options
307	   simultaneously, even if each of these options is sensible in
308	   certain scenarios, and even if the simultaneous transmission is
309	   also sensible (i.e. in case of multicast media distribution to
310	   multiple endpoints).   Such constraints can be expressed by the
311	   Provider using the Encoding Group concept, described in section 8.

313	   (5) Due to the potentially large number of RTP flows required for a
314	   Multimedia Conference involving potentially many Endpoints, each of
315	   which can have many Media Captures and media renderers, it has
316	   become common to multiplex multiple RTP media flows onto the same
317	   transport address, so to avoid using the port number as a
318	   multiplexing point and the associated shortcomings such as
319	   NAT/firewall traversal.  While the actual mapping of those RTP
320	   flows to the header fields of the RTP packets is not subject of
321	   this specification, the large number of possible permutations of
322	   sensible options a Media Provider can make available to a Media
323	   Consumer makes a mechanism desirable that allows to narrow down the
324	   number of possible options that a SIP offer-answer exchange has to
325	   consider.  Such information is made available using protocol
326	   mechanisms specified in this document and companion documents,
327	   although it should be stressed that its use in an implementation is
328	   OPTIONAL.  Also, there are aspects of the control of both Endpoints
329	   and middleboxes/MCUs that dynamically change during the progress of
330	   a call, such as audio-level based screen switching, layout changes,
331	   and so on, which need to be conveyed.  Note that these control
332	   aspects are complementary to those specified in traditional SIP
333	   based conference management such as BFCP.  An exemplary call flow
334	   can be found in section 4.

336	   Finally, all this information needs to be conveyed, and the notion
337	   of support for it needs to be established.  This is done by the
338	   negotiation of a "CLUE channel", a data channel negotiated early
339	   during the initiation of a call.  An Endpoint or MCU that rejects
340	   the establishment of this data channel, by definition, is not
341	   supporting CLUE based mechanisms, whereas an Endpoint or MCU that
342	   accepts it is REQUIRED to use it to the extent specified in this
343	   document and its companion documents.

345	5. Overview of the Framework/Model

347	   The CLUE framework specifies how multiple media streams are to be
348	   handled in a telepresence conference.

350	   A Media Provider (transmitting Endpoint or MCU) describes specific
351	   aspects of the content of the media and the formatting of the media
352	   streams it can send in an Advertisement; and the Media Consumer
353	   responds to the Media Provider by specifying which content and
354	   media streams it wants to receive in a Configure message.  The
355	   Provider then transmits the asked-for content in the specified
356	   streams.

358	   This Advertisement and Configure MUST occur during call initiation
359	   but MAY also happen at any time throughout the call, whenever there
360	   is a change in what the Consumer wants to receive or (perhaps less
361	   common) the Provider can send.

363	   An Endpoint or MCU typically act as both Provider and Consumer at
364	   the same time, sending Advertisements and sending Configurations in
365	   response to receiving Advertisements.  (It is possible to be just
366	   one or the other.)

368	   The data model is based around two main concepts: a Capture and an
369	   Encoding.  A Media Capture (MC), such as audio or video, describes
370	   the content a Provider can send.  Media Captures are described in
371	   terms of CLUE-defined attributes, such as spatial relationships and
372	   purpose of the capture.  Providers tell Consumers which Media
373	   Captures they can provide, described in terms of the Media Capture
374	   attributes.

376	   A Provider organizes its Media Captures into one or more Capture
377	   Scenes, each representing a spatial region, such as a room.  A
378	   Consumer chooses which Media Captures it wants to receive from each
379	   Capture Scene.

381	   In addition, the Provider can send the Consumer a description of
382	   the Individual Encodings it can send in terms of the media
383	   attributes of the Encodings, in particular, audio and video
384	   parameters such as bandwidth, frame rate, macroblocks per second.
385	   Note that this is OPTIONAL, and intended to minimize the number of
386	   options a later SDP offer-answer would have to include in the SDP
387	   in case of complex setups, as should become clearer shortly when
388	   discussing an outline of the call flow.

390	   The Provider can also specify constraints on its ability to provide
391	   Media, and a sensible design choice for a Consumer is to take these
392	   into account when choosing the content and Capture Encodings it
393	   requests in the later offer-answer exchange.  Some constraints are
394	   due to the physical limitations of devices--for example, a camera
395	   may not be able to provide zoom and non-zoom views simultaneously.
396	   Other constraints are system based, such as maximum bandwidth and
397	   maximum video coding performance measured in macroblocks/second.

399	   The following diagram illustrates the information contained in an
400	   Advertisement.

402	   ...................................................................
403	   .  Provider Advertisement                                         .
404	   .                                                                 .
405	   .        +------------------------+   +--------------------+      .
406	   .        |       Capture Scene N  |   | Simultaneous       |      .
407	   .      +-+----------------------+ |   +--------------------+      .
408	   .      |       Capture Scene 2  | |                               .
409	   .    +-+----------------------+ | |      +----------------------+ .
410	   .    |  Capture Scene 1       | | |      |  Encoding Group N    | .
411	   .    |    +---------------+   | | |    +-+--------------------+ | .
412	   .    |    | Attributes    |   | | |    |   Encoding Group 2   | | .
413	   .    |    +---------------+   | | |  +-+--------------------+ | | .
414	   .    |                        | | |  |   Encoding Group 1   | | | .
415	   .    |    +----------------+  | | |  |     parameters       | | | .
416	   .    |    | E n t r i e s  |  | | |  |                      | | | .
417	   .    |    |  +---------+   |  | | |  | +-------------------+| | | .
418	   .    |    |  |Attribute|   |  | | |  | | V i d e o         || | | .
419	   .    |    |  +---------+   |  | | |  | | E n c o d i n g s || | | .
420	   .    |    |                |  | | |  | | Encoding 1        || | | .
421	   .    |    | Entry 1        |  | | |  | | (parameters)      || | | .
422	   .    |    |  (list of MCs) |  | |-+  | +-------------------+| | | .
423	   .    |    +----|-|--|------+  |-+    |                      | | | .
424	   .    +---------|-|--|---------+      | +-------------------+| | | .
425	   .              | |  |                | | A u d i o         || | | .
426	   .              | |  |                | | E n c o d i n g s || | | .
427	   .              v |  |                | | Encoding 1        || | | .
428	   .      +---------|--|--------+       | | (ID,maxBandwidth) || | | .
429	   .      | Media Capture N     |------>| +-------------------+| | | .
430	   .    +-+---------v--|------+ |       |                      | | | .
431	   .    | Media Capture 2     | |       |                      | |-+ .
432	   .  +-+--------------v----+ |-------->|                      | |   .
433	   .  | Media Capture  1    | | |       |                      |-+   .
434	   .  |  +----------------+ |---------->|                      |     .
435	   .  |  | Attributes     | | |_+       +----------------------+     .
436	   .  |  +----------------+ |_+                                      .
437	   .  +---------------------+                                        .
438	   .                                                                 .
439	   ...................................................................

441	   A very brief outline of the call flow used by a simple system (two
442	   Endpoints) in compliance with this document can be described as
443	   follows, and as shown in the following figure.

445	         +-----------+                     +-----------+
446	         | Endpoint1 |                     | Endpoint2 |
447	         +----+------+                     +-----+-----+
448	              | INVITE (BASIC SDP+CLUECHANNEL)   |
449	              |--------------------------------->|
450	              |    200 0K (BASIC SDP+CLUECHANNEL)|
451	              |<---------------------------------|
452	              | ACK                              |
453	              |--------------------------------->|
454	              |                                  |
455	              |<################################>|
456	              |     BASIC SDP MEDIA SESSION      |
457	              |<################################>|
458	              |                                  |
459	              |    CONNECT (CLUE CTRL CHANNEL)   |
460	              |=================================>|
461	              |            ...                   |
462	              |<================================>|
463	              |   CLUE CTRL CHANNEL ESTABLISHED  |
464	              |<================================>|
465	              |                                  |
466	              | ADVERTISEMENT 1                  |
467	              |*********************************>|
468	              |                  ADVERTISEMENT 2 |
469	              |<*********************************|
470	              |                                  |
471	              |                      CONFIGURE 1 |
472	              |<*********************************|
473	              | CONFIGURE 2                      |
474	              |*********************************>|
475	              |                                  |
476	              | REINVITE (UPDATED SDP)           |
477	              |--------------------------------->|
478	              |              200 0K (UPDATED SDP)|
479	              |<---------------------------------|
480	              | ACK                              |
481	              |--------------------------------->|
482	              |                                  |
483	              |<################################>|
484	              |   UPDATED SDP MEDIA SESSION      |
485	              |<################################>|
486	              |                                  |
487	              v                                  v

489	   An initial offer/answer exchange establishes a basic media session,
490	   for example audio-only, and a CLUE channel between two Endpoints.
491	   With the establishment of that channel, the endpoints have
492	   consented to use the CLUE protocol mechanisms and, therefore, MUST
493	   adhere to the CLUE protocol suite as outlined herein.

495	   Over this CLUE channel, the Provider in each Endpoint conveys its
496	   characteristics and capabilities by sending an Advertisement as
497	   specified herein.  The Advertisement is typically not sufficient to
498	   set up all media.  The Consumer in the Endpoint receives the
499	   information provided by the Provider, and can use it for two
500	   purposes.  First, it MUST construct and send a CLUE Configure
501	   message to tell the Provider what the Consumer wishes to receive.
502	   Second, it MAY, but is not necessarily REQUIRED to, use the
503	   information provided to tailor the SDP it is going to send during
504	   the following SIP offer/answer exchange, and its reaction to SDP it
505	   receives in that step.  It is often a sensible implementation
506	   choice to do so, as the representation of the media information
507	   conveyed over the CLUE channel can dramatically cut down on the
508	   size of SDP messages used in the O/A exchange that follows.
509	   Spatial relationships associated with the Media can be included in
510	   the Advertisement, and it is often sensible for the Media Consumer
511	   to take those spatial relationships into account when tailoring the
512	   SDP.

514	   This CLUE exchange MUST be followed by an SDP offer answer exchange
515	   that not only establishes those aspects of the media that have not
516	   been "negotiated" over CLUE, but has also the side effect of
517	   setting up the media transmission itself, involving potentially
518	   security exchanges, ICE, and whatnot.  This step is plain vanilla
519	   SIP, with the exception that the SDP used herein, in most (but not
520	   necessarily all) cases can be considerably smaller than the SDP a
521	   system would typically need to exchange if there were no pre-
522	   established knowledge about the Provider and Consumer
523	   characteristics.  (The need for cutting down SDP size is not quite
524	   obvious for a point-to-point call involving simple endpoints;
525	   however, when considering a large multipoint conference involving
526	   many multi-screen/multi-camera endpoints, each of which can operate
527	   using multiple codecs for each camera and microphone, it becomes
528	   perhaps somewhat more intuitive.)

530	   During the lifetime of a call, further exchanges MAY occur over the
531	   CLUE channel.  In some cases, those further exchanges lead to a
532	   modified system behavior of Provider or Consumer (or both) without
533	   any other protocol activity such as further offer/answer exchanges.

535	   For example, voice-activated screen switching, signaled over the
536	   CLUE channel, ought not to lead to heavy-handed mechanisms like SIP
537	   re-invites.  However, in other cases, after the CLUE negotiation an
538	   additional offer/answer exchange becomes necessary.  For example,
539	   if both sides decide to upgrade the call from a single screen to a
540	   multi-screen call and more bandwidth is required for the additional
541	   video channels compared to what was previously negotiated using
542	   offer/answer, a new O/A exchange is REQUIRED.

544	   Numerous optimizations are possible, and are the implementer's
545	   choice.  For example, it can be sensible to establish one or more
546	   initial media channels during the initial offer/answer exchange,
547	   which would allow, for example, for a fast startup of audio.
548	   Depending on the system design, it can be possible to re-use this
549	   established channel for more advanced media negotiated only by CLUE
550	   mechanisms, thereby avoiding further offer/answer exchanges.

552	   Edt. note: The editors are not sure whether the mentioned
553	   overloading of established RTP channels using only CLUE messages is
554	   possible, or desired by the WG.  If it were, certainly there is
555	   need for specification work.  One possible issue: a Provider which
556	   thinks that it can switch, say, a audio codec algorithm by CLUE
557	   only, talks to a Consumer which thinks that it has to faithfully
558	   answer the Providers Advertisement through a Configure, but does
559	   not dare setting up its internal resource until such time it has
560	   received its authoritative O/A exchange.  Working group input is
561	   solicited.

563	   One aspect of the protocol outlined herein and specified in more
564	   detail in companion documents is that it makes available
565	   information regarding the Provider's capabilities to deliver Media,
566	   and attributes related to that Media such as their spatial
567	   relationship, to the Consumer.  The operation of the renderer
568	   inside the Consumer is unspecified in that it can choose to ignore
569	   some information provided by the Provider, and/or not render media
570	   streams available from the Provider (although it MUST follow the
571	   CLUE protocol and, therefore, MUST gracefully receive and respond
572	   (through a Configure) to the Provider's information).  All CLUE
573	   protocol mechanisms are OPTIONAL in the Consumer in the sense that,
574	   while the Consumer MUST be able to receive (and, potentially,
575	   gracefully acknowledge) CLUE messages, it is free to ignore the
576	   information provided therein.  Obviously, this is not a
577	   particularly sensible design choice in almost all conceivable
578	   cases.

580	   A CLUE-implementing device interoperates with a device that does
581	   not support CLUE, because the non-CLUE device does, by definition,
582	   not understand the offer of a CLUE channel in the initial
583	   offer/answer exchange and, therefore, will reject it. This
584	   rejection MUST be used as the indication to the CLUE-implementing
585	   device that the other side of the communication is not compliant
586	   with CLUE, and to fall back to behavior that does not require CLUE.

588	   As for the media, Provider and Consumer have an end-to-end
589	   communication relationship with respect to (RTP transported) media;
590	   and the mechanisms described herein and in companion documents do
591	   not change the aspects of setting up those RTP flows and sessions.
592	   In other words, the RTP media sessions conform to the negotiated
593	   SDP whether or not CLUE is used.

595	   Edt. note (StW): what's written below is likely correct, but is not
596	   the result of the introduction of CLUE, but rather the result of a
597	   generational overhaul of RTP usage that would have happened with or
598	   without CLUE.  Suggest to delete the sentences below until begin of
599	   section 6.  Is having a CLUE RTP Mapping document still the plan?
600	   If yes, we should have a real draft and a real reference.

602	   However, some form of RTP multiplexing is likely to be used by CLUE
603	   devices.  More information about relating RTP flows to CLUE
604	   entities is in the CLUE RTP Mapping document.

606	6. Spatial Relationships

608	   In order for a Consumer to perform a proper rendering, it is often
609	   necessary or at least helpful for the Consumer to have received
610	   spatial information about the streams it is receiving.  CLUE
611	   defines a coordinate system that allows Media Providers to describe
612	   the spatial relationships of their Media Captures to enable proper
613	   scaling and spatially sensible rendering of their streams.  The
614	   coordinate system is based on a few principles:

616	   o  Simple systems which do not have multiple Media Captures to
617	      associate spatially need not use the coordinate model.

619	   o  Coordinates can either be in real, physical units (millimeters),
620	      have an unknown scale or have no physical scale.  Systems which
621	      know their physical dimensions (for example professionally
622	      installed Telepresence room systems) MUST always provide those
623	      real-world measurements.  Systems which don't know specific
624	      physical dimensions but still know relative distances MUST use
625	      'unknown scale'.  'No scale' is intended to be used where Media
626	      Captures from different devices (with potentially different
627	      scales) will be forwarded alongside one another (e.g. in the
628	      case of a middle box).

630	      *  "millimeters" means the scale is in millimeters

632	      *  "Unknown" means the scale is not necessarily millimeters, but
633	         the scale is the same for every Capture in the Capture Scene.

635	      *  "No Scale" means the scale could be different for each
636	         capture- an MCU provider that advertises two adjacent
637	         captures and picks sources (which can change quickly) from
638	         different endpoints might use this value; the scale could be
639	         different and changing for each capture.  But the areas of
640	         capture still represent a spatial relation between captures.

642	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
643	      spatial location of the provider's choosing.  The Provider MUST
644	      use the same coordinate system with same scale and origin for
645	      all coordinates within the same Capture Scene.

647	   The direction of increasing coordinate values is:
648	   X increases from Camera-Left to Camera-Right
649	   Y increases from Front to back
650	   Z increases from low to high (i.e. floor to ceiling)

652	7. Media Captures and Capture Scenes

654	   This section describes how Providers can describe the content of
655	   media to Consumers.

657	7.1. Media Captures

659	   Media Captures are the fundamental representations of streams that
660	   a device can transmit.  What a Media Capture actually represents is
661	   flexible:

663	   o  It can represent the immediate output of a physical source (e.g.
664	      camera, microphone) or 'synthetic' source (e.g. laptop computer,
665	      DVD player).

667	   o  It can represent the output of an audio mixer or video composer

669	   o  It can represent a concept such as 'the loudest speaker'

671	   o  It can represent a conceptual position such as 'the leftmost
672	      stream'

674	   To identify and distinguish between multiple instances, video and
675	   audio captures are labeled.  For instance: VC1, VC2 and AC1, AC2,
676	   where  VC1 and VC2 refer to two different video captures and AC1
677	   and AC2 refer to two different audio captures.

679	   Some key points about Media Captures:

681	     . A Media Capture is of a single media type (e.g. audio or
682	        video)
683	     . A Media Capture is associated with exactly one Capture Scene
684	     . A Media Capture is associated with one or more Capture Scene
685	        Entries
686	     . A Media Capture has exactly one set of spatial information
687	     . A Media Capture can be the source of one or more Capture
688	        Encodings

690	   Each Media Capture can be associated with attributes to describe
691	   what it represents.

693	7.1.1. Media Capture Attributes

695	   Media Capture Attributes describe information about the Captures.
696	   A Provider can use the Media Capture Attributes to describe the
697	   Captures for the benefit of the Consumer in the Advertisement
698	   message.  Media Capture Attributes include:

700	     . spatial information, such as point of capture, point on line
701	        of capture, and area of capture, all of which, in combination
702	        define the capture field of, for example, a camera;
703	     . Capture multiplexing information (composed/switched video,
704	        mono/stereo audio, maximum number of simultaneous encodings
705	        per Capture and so on); and

707	     . Other descriptive information to help the Consumer choose
708	        between captures (description, presentation, view, priority,
709	        language, role).
710	     . Control information for use inside the CLUE protocol suite.

712	   Point of Capture:

714	   A field with a single Cartesian (X, Y, Z) point value which
715	   describes the spatial location of the capturing device (such as
716	   camera).

718	   Point on Line of Capture:

720	   A field with a single Cartesian (X, Y, Z) point value which
721	   describes a position in space of a second point on the axis of the
722	   capturing device; the first point being the Point of Capture (see
723	   above).

725	   Together, the Point of Capture and Point on Line of Capture define
726	   an axis of the capturing device, for example the optical axis of a
727	   camera.  The Media Consumer can use this information to adjust how
728	   it renders the received media if it so chooses.

730	   Area of Capture:

732	   A field with a set of four (X, Y, Z) points as a value which
733	   describe the spatial location of what is being "captured".  By
734	   comparing the Area of Capture for different Media Captures within
735	   the same Capture Scene a consumer can determine the spatial
736	   relationships between them and render them correctly.

738	   The four points MUST be co-planar, forming a quadrilateral, which
739	   defines the Plane of Interest for the particular media capture.

741	   If the Area of Capture is not specified, it means the Media Capture
742	   is not spatially related to any other Media Capture.

744	   For a switched capture that switches between different sections
745	   within a larger area, the area of capture MUST use coordinates for
746	   the larger potential area.

748	   Mobility of Capture:

750	   This attribute indicates whether or not the point of capture, line
751	   on point of capture, and area of capture values stay the same over
752	   time, or are expected to change (potentially frequently).  Possible
753	   values are static, dynamic, and highly dynamic.

755	   An example for "dynamic" is a camera mounted on a stand which is
756	   occasionally hand-carried and placed at different positions in
757	   order to provide the best angle to capture a work task.  A camera
758	   worn by a participant who moves around the room is an example for
759	   "highly dynamic". In either case, the effect is that the capture
760	   point, capture axis and area of capture change with time.

762	   The capture point of a static capture MUST NOT move for the life of
763	   the conference. The capture point of dynamic captures is
764	   categorized by a change in position followed by a reasonable period
765	   of stability--in the order of magnitude of minutes. High dynamic
766	   captures are categorized by a capture point that is constantly
767	   moving.  If the "area of capture", "capture point" and "line of
768	   capture" attributes are included with dynamic or highly dynamic
769	   captures they indicate spatial information at the time of the
770	   Advertisement.

772	   Composed:

774	   A boolean field which indicates whether or not the Media Capture is
775	   a mix (audio) or composition (video) of streams.

777	   This attribute is useful for a media consumer to avoid nesting a
778	   composed video capture into another composed capture or rendering.
779	   This attribute is not intended to describe the layout a media
780	   provider uses when composing video streams.

782	   Switched:

784	   A boolean field which indicates whether or not the Media Capture
785	   represents the (dynamic) most appropriate subset of a 'whole'.
786	   What is 'most appropriate' is up to the provider and could be the
787	   active speaker, a lecturer or a VIP.

789	   Audio Channel Format:

791	   A field with enumerated values which describes the method of
792	   encoding used for audio. A value of 'mono' means the Audio Capture
793	   has one channel.  'stereo' means the Audio Capture has two audio
794	   channels, left and right.

796	   This attribute applies only to Audio Captures.  A single stereo
797	   capture is different from two mono captures that have a left-right
798	   spatial relationship.  A stereo capture maps to a single Capture
799	   Encoding, while each mono audio capture maps to a separate Capture
800	   Encoding.

802	   Max Capture Encodings:

804	   An optional attribute indicating the maximum number of Capture
805	   Encodings that can be simultaneously active for the Media Capture.
806	   The number of simultaneous Capture Encodings is also limited by the
807	   restrictions of the Encoding Group for the Media Capture.

809	   Description:

811	   Human-readable description of the Capture, which could be in
812	   multiple languages.

814	   Presentation:

816	   This attribute indicates that the capture originates from a
817	   presentation device, that is one that provides supplementary
818	   information to a conference through slides, video, still images,
819	   data etc.  Where more information is known about the capture it MAY
820	   be expanded hierarchically to indicate the different types of
821	   presentation media, e.g. presentation.slides, presentation.image
822	   etc.

824	   Note: It is expected that a number of keywords will be defined that
825	   provide more detail on the type of presentation.

827	   View:

829	   A field with enumerated values, indicating what type of view the
830	   capture relates to.  The Consumer can use this information to help
831	   choose which Media Captures it wishes to receive.  The value MUST
832	   be one of:

834	   Room - Captures the entire scene

836	   Table - Captures the conference table with seated participants

838	   Individual - Captures an individual participant
839	   Lectern - Captures the region of the lectern including the
840	   presenter, for example in a classroom style conference room

842	   Audience - Captures a region showing the audience in a classroom
843	   style conference room

845	   Language:

847	   This attribute indicates one or more languages used in the content
848	   of the media capture.  Captures MAY be offered in different
849	   languages in case of multilingual and/or accessible conferences.  A
850	   Consumer can use this attribute to differentiate between them and
851	   pick the appropriate one.

853	   Note that the Language atttribute is defined and meaningful both
854	   for audio and video captures.  In case of audio captures, the
855	   meaning is obvious.  For a video capture, "Language" could, for
856	   example, be sign interpretation or text.

858	   Role:

860	   Edt. Note -- this is a placeholder for a role attribute, as
861	   discussed in draft-groves-clue-capture-attr.  We expect to continue
862	   discussing the role attribute in the context of that draft, and
863	   follow-on drafts, before adding it to this framework document.

865	   Priority:

867	   This attribute indicates a relative priority between different
868	   Media Captures.  The Provider sets this priority, and the Consumer
869	   MAY use the priority to help decide which captures it wishes to
870	   receive.

872	   The "priority" attribute is an integer which indicates a relative
873	   priority between captures. For example it is possible to assign a
874	   priority between two presentation captures that would allow a
875	   remote endpoint to determine which presentation is more important.
876	   Priority is assigned at the individual capture level. It represents
877	   the Provider's view of the relative priority between captures with
878	   a priority. The same priority number MAY be used across multiple
879	   captures. It indicates they are equally important. If no priority
880	   is assigned no assumptions regarding relative important of the
881	   capture can be assumed.

883	   Embedded Text:

885	   This attribute indicates that a capture provides embedded textual
886	   information. For example the video capture MAY contain speech to
887	   text information composed with the video image. This attribute is
888	   only applicable to video captures and presentation streams with
889	   visual information.

891	   Related To:

893	   This attribute indicates the capture contains additional
894	   complementary information related to another capture.  The value
895	   indicates the other capture to which this capture is providing
896	   additional information.

898	   For example, a conferences can utilize translators or facilitators
899	   that provide an additional audio stream (i.e. a translation or
900	   description or commentary of the conference).  Where multiple
901	   captures are available, it may be advantageous for a Consumer to
902	   select a complementary capture instead of or in addition to a
903	   capture it relates to.

905	7.2. Capture Scene

907	   In order for a Provider's individual Captures to be used
908	   effectively by a Consumer, the provider organizes the Captures into
909	   one or more Capture Scenes, with the structure and contents of
910	   these Capture Scenes being sent from the Provider to the Consumer
911	   in the Advertisement.

913	   A Capture Scene is a structure representing a spatial region
914	   containing one or more Capture Devices, each capturing media
915	   representing a portion of the region.  A Capture Scene includes one
916	   or more Capture Scene entries, with each entry including one or
917	   more Media Captures.  A Capture Scene represents, for example, the
918	   video image of a group of people seated next to each other, along
919	   with the sound of their voices, which could be represented by some
920	   number of VCs and ACs in the Capture Scene Entries.  A middle box
921	   can also describe in Capture Scenes what it constructs from media
922	   Streams it receives.

924	   A Provider MAY advertise one or more Capture Scenes .  What
925	   constitutes an entire Capture Scene is up to the Provider.  A
926	   simple Provider might typically use one Capture Scene for
927	   participant media (live video from the room cameras) and another
928	   Capture Scene for a computer generated presentation.  In more
929	   complex systems, the use of additional Capture Scenes is also
930	   sensible.  For example, a classroom may advertise two Capture
931	   Scenes involving live video, one including only the camera
932	   capturing the instructor (and associated audio), the other
933	   including camera(s) capturing students (and associated audio).

935	   A Capture Scene MAY (and typically will) include more than one type
936	   of media.  For example, a Capture Scene can include several Capture
937	   Scene Entries for Video Captures, and several Capture Scene Entries
938	   for Audio Captures.  A particular Capture MAY be included in more
939	   than one Capture Scene Entry.

941	   A provider MAY express spatial relationships between Captures that
942	   are included in the same Capture Scene.  However, there is not
943	   necessarily the same spatial relationship between Media Captures
944	   that are in different Capture Scenes.  In other words, Capture
945	   Scenes can use their own spatial measurement system as outlined
946	   above in section 6.

948	   A Provider arranges Captures in a Capture Scene to help the
949	   Consumer choose which captures it wants to render.  The Capture
950	   Scene Entries in a Capture Scene are different alternatives the
951	   Provider is suggesting for representing the Capture Scene.  The
952	   order of Capture Scene Entries within a Capture Scene has no
953	   significance.  The Media Consumer can choose to receive all Media
954	   Captures from one Capture Scene Entry for each media type (e.g.
955	   audio and video), or it can pick and choose Media Captures
956	   regardless of how the Provider arranges them in Capture Scene
957	   Entries.  Different Capture Scene Entries of the same media type
958	   are not necessarily mutually exclusive alternatives.  Also note
959	   that the presence of multiple Capture Scene Entries (with
960	   potentially multiple encoding options in each entry) in a given
961	   Capture Scene does not necessarily imply that a Provider is able to
962	   serve all the associated media simultaneously (although the
963	   construction of such an over-rich Capture Scene is probably not
964	   sensible in many cases).  What a Provider can send simultaneously
965	   is determined through the Simultaneous Transmission Set mechanism,
966	   described in section 7.3.

968	   Captures within the same Capture Scene entry MUST be of the same
969	   media type - it is not possible to mix audio and video captures in
970	   the same Capture Scene Entry, for instance.  The Provider MUST be
971	   capable of encoding and sending all Captures in a single Capture
972	   Scene Entry simultaneously.  The order of Captures within a Capture
973	   Scene Entry has no significance.  A Consumer can decide to receive
974	   all the Captures in a single Capture Scene Entry, but a Consumer
975	   could also decide to receive just a subset of those captures.  A
976	   Consumer can also decide to receive Captures from different Capture
977	   Scene Entries, all subject to the constraints set by Simultaneous
978	   Transmission Sets, as discussed in section 7.3.

980	   When a Provider advertises a Capture Scene with multiple entries,
981	   it is essentially signaling that there are multiple representations
982	   of the same Capture Scene available.  In some cases, these multiple
983	   representations would typically be used simultaneously (for
984	   instance a "video entry" and an "audio entry").  In some cases the
985	   entries would conceptually be alternatives (for instance an entry
986	   consisting of three Video Captures covering the whole room versus
987	   an entry consisting of just a single Video Capture covering only
988	   the center if a room).  In this latter example, one sensible choice
989	   for a Consumer would be to indicate (through its Configure and
990	   possibly through an additional offer/answer exchange) the Captures
991	   of that Capture Scene Entry that most closely matched the
992	   Consumer's number of display devices or screen layout.

994	   The following is an example of 4 potential Capture Scene Entries
995	   for an endpoint-style Provider:

997	   1.  (VC0, VC1, VC2) - left, center and right camera Video Captures

999	   2.  (VC3) - Video Capture associated with loudest room segment

1001	   3.  (VC4) - Video Capture zoomed out view of all people in the room

1003	   4.  (AC0) - main audio

1005	   The first entry in this Capture Scene example is a list of Video
1006	   Captures which have a spatial relationship to each other.
1007	   Determination of the order of these captures (VC0, VC1 and VC2) for
1008	   rendering purposes is accomplished through use of their Area of
1009	   Capture attributes.  The second entry (VC3) and the third entry
1010	   (VC4) are alternative representations of the same room's video,
1011	   which might be better suited to some Consumers' rendering
1012	   capabilities.  The inclusion of the Audio Capture in the same
1013	   Capture Scene indicates that AC0 is associated with all of those
1014	   Video Captures, meaning it comes from the same spatial region.
1015	   Therefore, if audio were to be rendered at all, this audio would be
1016	   the correct choice irrespective of which Video Captures were
1017	   chosen.

1019	7.2.1. Capture Scene attributes

1021	   Capture Scene Attributes can be applied to Capture Scenes as well
1022	   as to individual media captures.  Attributes specified at this
1023	   level apply to all constituent Captures.  Capture Scene attributes
1024	   include

1026	     . Human-readable description of the Capture Scene, which could
1027	        be in multiple languages;
1028	     . Scale information (millimeters, unknown, no scale), as
1029	        described in Section 5.

1031	7.2.2. Capture Scene Entry attributes

1033	   A Capture Scene can include one or more Capture Scene Entries in
1034	   addition to the Capture Scene wide attributes described above.
1035	   Capture Scene Entry attributes apply to the Capture Scene Entry as
1036	   a whole, i.e. to all Captures that are part of the Capture Scene
1037	   Entry.

1039	   Capture Scene Entry attributes include:

1041	     . Human-readable description of the Capture Scene Entry, which
1042	        could be in multiple languages;
1043	     . Scene-switch-policy: {site-switch, segment-switch}

1045	   A media provider uses this scene-switch-policy attribute to
1046	   indicate its support for different switching policies.  If a
1047	   provider supports both policies, it MAY advertise separate Capture
1048	   Scene Entries containing separate Captures, each entry with a
1049	   separate scene-switch-policy value.  If the provider does not
1050	   support any of these policies, it MUST omit this attribute.

1052	   The "site-switch" policy means all captures are switched at the
1053	   same time to keep captures from the same endpoint site together.
1054	   Let's say the speaker is at site A and everyone else is at a
1055	   "remote" site.

1057	   When the room at site A shown, all the camera images from site A
1058	   are forwarded to the remote sites.  Therefore at each receiving
1059	   remote site, all the screens display camera images from site A.

1061	   This can be used to preserve full size image display, and also
1062	   provide full visual context of the displayed far end, site A. In
1063	   site switching, there is a fixed relation between the cameras in
1064	   each room and the displays in remote rooms.  The room or
1065	   participants being shown can be switched from time to time based
1066	   on, for example, who is speaking or by manual control.

1068	   The "segment-switch" policy means different captures can switch at
1069	   different times, and can be coming from different endpoints.  Still
1070	   using site A as where the speaker is, and "remote" to refer to all
1071	   the other sites, in segment switching, rather than sending all the
1072	   images from site A, only the image containing the speaker at site A
1073	   is shown.  The camera images of the current speaker and previous
1074	   speakers (if any) are forwarded to the other sites in the
1075	   conference.

1077	   Therefore the screens in each site are usually displaying images
1078	   from different remote sites - the current speaker at site A and the
1079	   previous ones.  This strategy can be used to preserve full size
1080	   image display, and also capture the non-verbal communication
1081	   between the speakers.  In segment switching, the display depends on
1082	   the activity in the remote rooms - generally, but not necessarily
1083	   based on audio / speech detection.

1085	7.3. Simultaneous Transmission Set Constraints

1087	   In many practical cases, a Provider has constraints or limitations
1088	   on its ability to send Captures simultaneously.  One type of
1089	   limitation is caused by the physical limitations of capture
1090	   mechanisms; these constraints are represented by a simultaneous
1091	   transmission set.  The second type of limitation reflects the
1092	   encoding resources available, such as bandwidth or video encoding
1093	   throughput (macroblocks/second).  This type of constraint is
1094	   captured by encoding groups, discussed below.

1096	   Some Endpoints or MCUs can send multiple Captures simultaneously,
1097	   however sometimes there are constraints that limit which Captures
1098	   can be sent simultaneously with other Captures.  A device may not
1099	   be able to be used in different ways at the same time.  Provider
1100	   Advertisements are made so that the Consumer can choose one of
1101	   several possible mutually exclusive usages of the device.  This
1102	   type of constraint is expressed in a Simultaneous Transmission Set,
1103	   which lists all the Captures of a particular media type (e.g.
1104	   audio, video, text) that can be sent at the same time.  There are
1105	   different Simultaneous Transmission Sets for each media type in the
1106	   Advertisement.  This is easier to show in an example.

1108	   Consider the example of a room system where there are three cameras
1109	   each of which can send a separate capture covering two persons
1110	   each- VC0, VC1, VC2.  The middle camera can also zoom out (using an
1111	   optical zoom lens) and show all six persons, VC3.  But the middle
1112	   camera cannot be used in both modes at the same time - it has to
1113	   either show the space where two participants sit or the whole six
1114	   seats, but not both at the same time.  As a result, VC1 and VC3
1115	   cannot be sent simultaneously.

1117	   Simultaneous transmission sets are expressed as sets of the Media
1118	   Captures that the Provider could transmit at the same time (though,
1119	   in some cases, it is not intuitive to do so).  In this example the
1120	   two simultaneous sets are shown in Table 1.  If a Provider
1121	   advertises one or more mutually exclusive Simultaneous Transmission
1122	   Sets, then for each media type the Consumer MUST ensure that it
1123	   chooses Media Captures that lie wholly within one of those
1124	   Simultaneous Transmission Sets.

1126	                           +-------------------+
1127	                           | Simultaneous Sets |
1128	                           +-------------------+
1129	                           | {VC0, VC1, VC2}   |
1130	                           | {VC0, VC3, VC2}   |
1131	                           +-------------------+

1133	                Table 1: Two Simultaneous Transmission Sets

1135	   A Provider OPTIONALLY can include the simultaneous sets in its
1136	   provider Advertisement.  These simultaneous set constraints apply
1137	   across all the Capture Scenes in the Advertisement.  It is a syntax
1138	   conformance requirement that the simultaneous transmission sets
1139	   MUST allow all the media captures in any particular Capture Scene
1140	   Entry to be used simultaneously.

1142	   For shorthand convenience, a Provider MAY describe a Simultaneous
1143	   Transmission Set in terms of Capture Scene Entries and Capture
1144	   Scenes.  If a Capture Scene Entry is included in a Simultaneous
1145	   Transmission Set, then all Media Captures in the Capture Scene
1146	   Entry are included in the Simultaneous Transmission Set.  If a
1147	   Capture Scene is included in a Simultaneous Transmission Set, then
1148	   all its Capture Scene Entries (of the corresponding media type) are
1149	   included in the Simultaneous Transmission Set.  The end result
1150	   reduces to a set of Media Captures in either case.

1152	   If an Advertisement does not include Simultaneous Transmission
1153	   Sets, then the Provider MUST be able to provide all Capture Scenes
1154	   simultaneously.  If multiple capture Scene Entries are in a Capture
1155	   Scene then the Consumer chooses at most one Capture Scene Entry per
1156	   Capture Scene for each media type.

1158	   If an Advertisement includes multiple Capture Scene Entries in a
1159	   Capture Scene then the Consumer MAY choose one Capture Scene Entry
1160	   for each media type, or MAY choose individual Captures based on the
1161	   Simultaneous Transmission Sets.

1163	8. Encodings

1165	   Individual encodings and encoding groups are CLUE's mechanisms
1166	   allowing a Provider to signal its limitations for sending Captures,
1167	   or combinations of Captures, to a Consumer.  Consumers can map the
1168	   Captures they want to receive onto the Encodings, with encoding
1169	   parameters they want.  As for the relationship between the CLUE-
1170	   specified mechanisms based on Encodings and the SIP Offer-Answer
1171	   exchange, please refer to section 4.

1173	8.1. Individual Encodings

1175	   An Individual Encoding represents a way to encode a Media Capture
1176	   to become a Capture Encoding, to be sent as an encoded media stream
1177	   from the Provider to the Consumer.  An Individual Encoding has a
1178	   set of parameters characterizing how the media is encoded.

1180	   Different media types have different parameters, and different
1181	   encoding algorithms may have different parameters.  An Individual
1182	   Encoding can be assigned to at most one Capture Encoding at any
1183	   given time.

1185	   The parameters of an Individual Encoding represent the maximum
1186	   values for certain aspects of the encoding.  A particular
1187	   instantiation into a Capture Encoding MAY use lower values than
1188	   these maximums if that is applicable for the media in question.
1189	   For example, most video codec specifications require a conformant
1190	   decoder to decode resolutions and frame rates smaller than what has
1191	   been negotiated as a maximum, so downgrading the CLUE maximum
1192	   values for macroblocks/second is appropriate.  On the other hand,
1193	   downgrading the sample rate of G.711 audio below 8kHz is not
1194	   specified in G.711 and therefore not applicable in the sense
1195	   described here.

1197	   Individual Encoding parameters are represented in SDP [RFC4566],
1198	   not in CLUE messages.  For example, for a video encoding using
1199	   H.26x compression technologies, this can include parameters such
1200	   as:

1202	     . Maximum bandwidth;
1203	     . Maximum picture size in pixels;
1204	     . Maxmimum number of pixels to be processed per second;

1206	   The bandwidth parameter is the only one that specifically relates
1207	   to a CLUE Advertisement, as it can be further constrained by the
1208	   maximum group bandwidth in an Encoding Group.

1210	8.2. Encoding Group

1212	   An Encoding Group includes a set of one or more Individual
1213	   Encodings, and parameters that apply to the group as a whole.  By
1214	   grouping multiple individual Encodings together, an Encoding Group
1215	   describes additional constraints on bandwidth for the group.

1217	   The Encoding Group data structure contains:

1219	     . Maximum bitrate for all encodings in the group combined;
1220	     . A list of identifiers for audio and video encodings,
1221	        respectively, belonging to the group.

1223	   When the Individual Encodings in a group are instantiated into
1224	   Capture Encodings, each Capture Encoding has a bitrate that MUST be
1225	   less than or equal to the max bitrate for the particular individual
1226	   encoding.  The "maximum bitrate for all encodings in the group"
1227	   parameter gives the additional restriction that the sum of all the
1228	   individual capture encoding bitrates MUST be less than or equal to
1229	   the this group value.

1231	   The following diagram illustrates one example of the structure of a
1232	   media provider's Encoding Groups and their contents.

1234	   ,-------------------------------------------------.
1235	   |             Media Provider                      |
1236	   |                                                 |
1237	   |  ,--------------------------------------.       |
1238	   |  | ,--------------------------------------.     |
1239	   |  | | ,--------------------------------------.   |
1240	   |  | | |          Encoding Group              |   |
1241	   |  | | | ,-----------.                        |   |
1242	   |  | | | |           | ,---------.            |   |
1243	   |  | | | |           | |         | ,---------.|   |
1244	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
1245	   |  `.| | |           | |         | `---------'|   |
1246	   |    `.| `-----------' `---------'            |   |
1247	   |      `--------------------------------------'   |
1248	   `-------------------------------------------------'

1250	                    Figure 1: Encoding Group Structure

1252	   A Provider advertises one or more Encoding Groups.  Each Encoding
1253	   Group includes one or more Individual Encodings.  Each Individual
1254	   Encoding can represent a different way of encoding media.  For
1255	   example one Individual Encoding may be 1080p60 video, another could
1256	   be 720p30, with a third being CIF, all in, for example, H.264
1257	   format.

1259	   While a typical three codec/display system might have one Encoding
1260	   Group per "codec box" (physical codec, connected to one camera and
1261	   one screen), there are many possibilities for the number of
1262	   Encoding Groups a Provider may be able to offer and for the
1263	   encoding values in each Encoding Group.

1265	   There is no requirement for all Encodings within an Encoding Group
1266	   to be instantiated at the same time.

1268	9. Associating Captures with Encoding Groups

1270	   Every Capture MUST be associated with at least one Encoding Group,
1271	   which is used to instantiate that Capture into one or more Capture
1272	   Encodings.  More than one Capture MAY use the same Encoding Group.

1274	   The maximum number of streams that can result from a particular
1275	   Encoding Group constraint is equal to the number of individual
1276	   Encodings in the group.  The actual number of Capture Encodings
1277	   used at any time MAY be less than this maximum.  Any of the
1278	   Captures that use a particular Encoding Group can be encoded
1279	   according to any of the Individual Encodings in the group.  If
1280	   there are multiple Individual Encodings in the group, then the
1281	   Consumer can configure the Provider, via a Configure message, to
1282	   encode a single Media Capture into multiple different Capture
1283	   Encodings at the same time, subject to the Max Capture Encodings
1284	   constraint, with each capture encoding following the constraints of
1285	   a different Individual Encoding.

1287	   It is a protocol conformance requirement that the Encoding Groups
1288	   MUST allow all the Captures in a particular Capture Scene Entry to
1289	   be used simultaneously.

1291	10. Consumer's Choice of Streams to Receive from the Provider

1293	   After receiving the Provider's Advertisement message (that includes
1294	   media captures and associated constraints), the Consumer composes
1295	   its reply to the Provider in the form of a Configure message.  The
1296	   Consumer is free to use the information in the Advertisement as it
1297	   chooses, but there are a few obviously sensible design choices,
1298	   which are outlined below.

1300	   If multiple Providers connect to the same Consumer (i.e. in a n
1301	   MCU-less multiparty call), it is the responsibility of the Consumer
1302	   to compose Configures for each Provider that both fulfill each
1303	   Provider's constraints as expressed in the Advertisement, as well
1304	   as its own capabilities.

1306	   In an MCU-based multiparty call, the MCU can logically terminate
1307	   the Advertisement/Configure negotiation in that it can hide the
1308	   characteristics of the receiving endpoint and rely on its own
1309	   capabilities (transcoding/transrating/...) to create Media Streams
1310	   that can be decoded at the Endpoint Consumers.  The timing of an
1311	   MCU's sending of Advertisements (for its outgoing ports) and
1312	   Configures (for its incoming ports, in response to Advertisements
1313	   received there) is up to the MCU and implementation dependent.

1315	   As a general outline, A Consumer can choose, based on the
1316	   Advertisement it has received, which Captures it wishes to receive,
1317	   and which Individual Encodings it wants the Provider to use to
1318	   encode the Captures.  Each Capture has an Encoding Group ID
1319	   attribute which specifies which Individual Encodings are available
1320	   to be used for that Capture.

1322	   A Configure Message includes a list of Capture Encodings.  These
1323	   are the Capture Encodings the Consumer wishes to receive from the
1324	   Provider.  Each Capture Encoding refers to one Media Capture, one
1325	   Individual Encoding, and includes the encoding parameter values.  A
1326	   Configure Message does not include references to Capture Scenes or
1327	   Capture Scene Entries.

1329	   For each Capture the Consumer wants to receive, it configures one
1330	   or more of the encodings in that capture's encoding group.  The
1331	   Consumer does this by telling the Provider, in its Configure
1332	   Message, parameters such as the resolution, frame rate, bandwidth,
1333	   etc. for each Capture Encodings for its chosen Captures.  Upon
1334	   receipt of this Configure from the Consumer, common knowledge is
1335	   established between Provider and Consumer regarding sensible
1336	   choices for the media streams and their parameters.  The setup of
1337	   the actual media channels, at least in the simplest case, is left
1338	   to a following offer-answer exchange.  Optimized implementations
1339	   MAY speed up the reaction to the offer-answer exchange by reserving
1340	   the resources at the time of finalization of the CLUE handshake.

1342	   Edt. Note (StW): is the sentence below still correct?

1344	   Even more advanced devices MAY choose to establish media streams
1345	   without an offer-answer exchange, for example by overloading
1346	   existing 5 tuple connections with the negotiated media.

1348	   In order to meaningfully create and send an initial Configure, the
1349	   Consumer needs to have received at least one Advertisement from the
1350	   Provider.

1352	   In addition, the Consumer can send a Configure at any time during
1353	   the call.  The Configure MUST be valid according to the most
1354	   recently received Advertisement.  The Consumer can send a Configure
1355	   either in response to a new Advertisement from the Provider or on
1356	   its own, for example because of a local change in conditions
1357	   (people leaving the room, connectivity changes, multipoint related
1358	   considerations).

1360	   When choosing which Media Streams to receive from the Provider, and
1361	   the encoding characteristics of those Media Streams, the Consumer
1362	   advantageously takes several things into account: its local
1363	   preference, simultaneity restrictions, and encoding limits.

1365	10.1. Local preference

1367	   A variety of local factors influence the Consumer's choice of
1368	   Media Streams to be received from the Provider:

1370	   o  if the Consumer is an Endpoint, it is likely that it would
1371	      choose, where possible, to receive video and audio Captures that
1372	      match the number of display devices and audio system it has

1374	   o  if the Consumer is a middle box such as an MCU, it MAY choose to
1375	      receive loudest speaker streams (in order to perform its own
1376	      media composition) and avoid pre-composed video Captures

1378	   o  user choice (for instance, selection of a new layout) MAY result
1379	      in a different set of Captures, or different encoding
1380	      characteristics, being required by the Consumer

1382	10.2. Physical simultaneity restrictions

1384	   Often there are physical simultaneity constraints of the Provider
1385	   that affect the Provider's ability to simultaneously send all of
1386	   the captures the Consumer would wish to receive.  For instance, a
1387	   middle box such as an MCU, when connected to a multi-camera room
1388	   system, might prefer to receive both individual video streams of
1389	   the people present in the room and an overall view of the room
1390	   from a single camera.  Some Endpoint systems might be able to
1391	   provide both of these sets of streams simultaneously, whereas
1392	   others might not (if the overall room view were produced by
1393	   changing the optical zoom level on the center camera, for
1394	   instance).

1396	10.3. Encoding and encoding group limits

1398	   Each of the Provider's encoding groups has limits on bandwidth and
1399	   computational complexity, and the constituent potential encodings
1400	   have limits on the bandwidth, computational complexity, video
1401	   frame rate, and resolution that can be provided.  When choosing
1402	   the Captures to be received from a Provider, a Consumer device
1403	   MUST ensure that the encoding characteristics requested for each
1404	   individual Capture fits within the capability of the encoding it
1405	   is being configured to use, as well as ensuring that the combined
1406	   encoding characteristics for Captures fit within the capabilities
1407	   of their associated encoding groups.  In some cases, this could
1408	   cause an otherwise "preferred" choice of capture encodings to be
1409	   passed over in favor of different Capture Encodings--for instance,
1410	   if a set of three Captures could only be provided at a low
1411	   resolution then a three screen device could switch to favoring a
1412	   single, higher quality, Capture Encoding.

1414	11. Extensibility

1416	   One important characteristics of the Framework is its
1417	   extensibility.  Telepresence is a relatively new industry and
1418	   while we can foresee certain directions, we also do not know
1419	   everything about how it will develop.  The standard for
1420	   interoperability and handling multiple streams must be future-
1421	   proof. The framework itself is inherently extensible through
1422	   expanding the data model types.  For example:

1424	   o  Adding more types of media, such as telemetry, can done by
1425	      defining additional types of Captures in addition to audio and
1426	      video.

1428	   o  Adding new functionalities , such as 3-D, say, may require
1429	      additional attributes describing the Captures.

1431	   o  Adding a new codecs, such as H.265, can be accomplished by
1432	      defining new encoding variables.

1434	   The infrastructure is designed to be extended rather than
1435	   requiring new infrastructure elements.  Extension comes through
1436	   adding to defined types.

1438	12. Examples - Using the Framework (Informative)

1440	   This section gives some examples, first from the point of view of
1441	   the Provider, then the Consumer.

1443	12.1. Provider Behavior

1445	   This section shows some examples in more detail of how a Provider
1446	   can use the framework to represent a typical case for telepresence
1447	   rooms.  First an endpoint is illustrated, then an MCU case is
1448	   shown.

1450	12.1.1. Three screen Endpoint Provider

1452	   Consider an Endpoint with the following description:

1454	   3 cameras, 3 displays, a 6 person table

1456	   o  Each camera can provide one Capture for each 1/3 section of the
1457	      table

1459	   o  A single Capture representing the active speaker can be provided
1460	      (voice activity based camera selection to a given encoder input
1461	      port implemented locally in the Endpoint)

1463	   o  A single Capture representing the active speaker with the other
1464	      2 Captures shown picture in picture within the stream can be
1465	      provided (again, implemented inside the endpoint)

1467	   o  A Capture showing a zoomed out view of all 6 seats in the room
1468	      can be provided

1470	   The audio and video Captures for this Endpoint can be described as
1471	   follows.

1473	   Video Captures:

1475	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1476	      switched=false, view=table

1478	   o  VC1- (the center camera stream), encoding group=EG1,
1479	      switched=false, view=table

1481	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1482	      switched=false, view=table

1484	   o  VC3- (the loudest panel stream), encoding group=EG1,
1485	      switched=true, view=table

1487	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1488	      composed=true, switched=true, view=room

1490	   o  VC5- (the zoomed out view of all people in the room), encoding
1491	      group=EG1, composed=false, switched=false, view=room

1493	   o  VC6- (presentation stream), encoding group=EG1, presentation,
1494	      switched=false

1496	   The following diagram is a top view of the room with 3 cameras, 3
1497	   displays, and 6 seats.  Each camera is capturing 2 people.  The
1498	   six seats are not all in a straight line.

1500	      ,-. d
1501	     (   )`--.__        +---+
1502	      `-' /     `--.__  |   |
1503	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1504	   (   ).'        ___..-+-''`+-+
1505	    `-' |_...---''      |   |
1506	    ,-.c+-..__          +---+
1507	   (   )|     ``--..__  |   |
1508	    `-' |             ``+-..|_-+Camera 1 (VC1)
1509	    ,-. |            __..--'|+-+
1510	   (   )|     __..--'   |   |
1511	    `-'b|..--'          +---+
1512	    ,-. |``---..___     |   |
1513	   (   )\          ```--..._|_-+Camera 0 (VC0)
1514	    `-'  \             _..-''`-+
1515	     ,-. \      __.--'' |   |
1516	    (   ) |..-''        +---+
1517	     `-' a

1519	   The two points labeled b and c are intended to be at the midpoint
1520	   between the seating positions, and where the fields of view of the
1521	   cameras intersect.

1523	   The plane of interest for VC0 is a vertical plane that intersects
1524	   points 'a' and 'b'.

1526	   The plane of interest for VC1 intersects points 'b' and 'c'. The
1527	   plane of interest for VC2 intersects points 'c' and 'd'.

1529	   This example uses an area scale of millimeters.

1531	   Areas of capture:

1533	       bottom left    bottom right  top left         top right
1534	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1535	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1536	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1537	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1538	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1539	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1540	   VC6 none

1542	   Points of capture:
1543	   VC0 (-1678,0,800)
1544	   VC1 (0,0,800)
1545	   VC2 (1678,0,800)
1546	   VC3 none
1547	   VC4 none
1548	   VC5 (0,0,800)
1549	   VC6 none

1551	   In this example, the right edge of the VC0 area lines up with the
1552	   left edge of the VC1 area.  It doesn't have to be this way.  There
1553	   could be a gap or an overlap.  One additional thing to note for
1554	   this example is the distance from a to b is equal to the distance
1555	   from b to c and the distance from c to d.  All these distances are
1556	   1346 mm. This is the planar width of each area of capture for VC0,
1557	   VC1, and VC2.

1559	   Note the text in parentheses (e.g. "the camera-left camera
1560	   stream") is not explicitly part of the model, it is just
1561	   explanatory text for this example, and is not included in the
1562	   model with the media captures and attributes.  Also, the
1563	   "composed" boolean attribute doesn't say anything about how a
1564	   capture is composed, so the media consumer can't tell based on
1565	   this attribute that VC4 is composed of a "loudest panel with
1566	   PiPs".

1568	   Audio Captures:

1570	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1571	      format=mono

1573	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1574	      format=mono

1576	   o  AC2 (center) encoding group=EG3, content=main, channel
1577	      format=mono

1579	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1580	      encoding group=EG3, content=main, channel format=mono

1582	   o  AC4 audio stream associated with the presentation video (mono)
1583	      encoding group=EG3, content=slides, channel format=mono

1585	   Areas of capture:

1587	       bottom left    bottom right  top left         top right

1589	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1590	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1591	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1592	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1593	   AC4 none

1595	   The physical simultaneity information is:

1597	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1599	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1601	   This constraint indicates it is not possible to use all the VCs at
1602	   the same time.  VC5 can not be used at the same time as VC1 or VC3
1603	   or VC4.  Also, using every member in the set simultaneously may
1604	   not make sense - for example VC3(loudest) and VC4 (loudest with
1605	   PIP).  (In addition, there are encoding constraints that make
1606	   choosing all of the VCs in a set impossible.  VC1, VC3, VC4, VC5,
1607	   VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint shows up
1608	   in the encoding groups, not in the simultaneous transmission
1609	   sets.)

1611	   In this example there are no restrictions on which audio captures
1612	   can be sent simultaneously.

1614	   Encoding Groups:

1616	   This example has three encoding groups associated with the video
1617	   captures.  Each group can have 3 encodings, but with each
1618	   potential encoding having a progressively lower specification.  In
1619	   this example, 1080p60 transmission is possible (as ENC0 has a
1620	   maxPps value compatible with that).  Significantly, as up to 3
1621	   encodings are available per group, it is possible to transmit some
1622	   video captures simultaneously that are not in the same entry in
1623	   the capture scene.  For example VC1 and VC3 at the same time.

1625	   It is also possible to transmit multiple capture encodings of a
1626	   single video capture.  For example VC0 can be encoded using ENC0
1627	   and ENC1 at the same time, as long as the encoding parameters
1628	   satisfy the constraints of ENC0, ENC1, and EG0, such as one at
1629	   4000000 bps and one at 2000000 bps.

1631	   encodeGroupID=EG0, maxGroupBandwidth=6000000
1632	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1633	                      maxPps=124416000, maxBandwidth=4000000

1635	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1636	                      maxPps=27648000, maxBandwidth=4000000
1637	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1638	                      maxPps=15552000, maxBandwidth=4000000
1639	   encodeGroupID=EG1  maxGroupBandwidth=6000000
1640	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1641	                      maxPps=124416000, maxBandwidth=4000000
1642	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1643	                      maxPps=27648000, maxBandwidth=4000000
1644	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1645	                      maxPps=15552000, maxBandwidth=4000000
1646	   encodeGroupID=EG2  maxGroupBandwidth=6000000
1647	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1648	                      maxPps=124416000, maxBandwidth=4000000
1649	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1650	                      maxPps=27648000, maxBandwidth=4000000
1651	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1652	                      maxPps=15552000, maxBandwidth=4000000

1654	                Figure 2: Example Encoding Groups for Video

1656	   For audio, there are five potential encodings available, so all
1657	   five audio captures can be encoded at the same time.

1659	   encodeGroupID=EG3, maxGroupBandwidth=320000
1660	       encodeID=ENC9, maxBandwidth=64000
1661	       encodeID=ENC10, maxBandwidth=64000
1662	       encodeID=ENC11, maxBandwidth=64000
1663	       encodeID=ENC12, maxBandwidth=64000
1664	       encodeID=ENC13, maxBandwidth=64000

1666	                Figure 3: Example Encoding Group for Audio

1668	   Capture Scenes:

1670	   The following table represents the capture scenes for this
1671	   provider. Recall that a capture scene is composed of alternative
1672	   capture scene entries covering the same spatial region.  Capture
1673	   Scene #1 is for the main people captures, and Capture Scene #2 is
1674	   for presentation.

1676	   Each row in the table is a separate Capture Scene Entry

1678	                           +------------------+
1679	                           | Capture Scene #1 |
1680	                           +------------------+
1681	                           | VC0, VC1, VC2    |
1682	                           | VC3              |
1683	                           | VC4              |
1684	                           | VC5              |
1685	                           | AC0, AC1, AC2    |
1686	                           | AC3              |
1687	                           +------------------+

1689	                           +------------------+
1690	                           | Capture Scene #2 |
1691	                           +------------------+
1692	                           | VC6              |
1693	                           | AC4              |
1694	                           +------------------+

1696	   Different capture scenes are unique to each other, non-
1697	   overlapping. A consumer can choose an entry from each capture
1698	   scene.  In this case the three captures VC0, VC1, and VC2 are one
1699	   way of representing the video from the endpoint.  These three
1700	   captures should appear adjacent next to each other.
1701	   Alternatively, another way of representing the Capture Scene is
1702	   with the capture VC3, which automatically shows the person who is
1703	   talking.  Similarly for the VC4 and VC5 alternatives.

1705	   As in the video case, the different entries of audio in Capture
1706	   Scene #1 represent the "same thing", in that one way to receive
1707	   the audio is with the 3 audio captures (AC0, AC1, AC2), and
1708	   another way is with the mixed AC3.  The Media Consumer can choose
1709	   an audio capture entry it is capable of receiving.

1711	   The spatial ordering is understood by the media capture attributes
1712	   Area of Capture and Point of Capture.

1714	   A Media Consumer would likely want to choose a capture scene entry
1715	   to receive based in part on how many streams it can simultaneously
1716	   receive.  A consumer that can receive three people streams would
1717	   probably prefer to receive the first entry of Capture Scene #1
1718	   (VC0, VC1, VC2) and not receive the other entries.  A consumer
1719	   that can receive only one people stream would probably choose one
1720	   of the other entries.

1722	   If the consumer can receive a presentation stream too, it would
1723	   also choose to receive the only entry from Capture Scene #2 (VC6).

1725	12.1.2. Encoding Group Example

1727	   This is an example of an encoding group to illustrate how it can
1728	   express dependencies between encodings.

1730	   encodeGroupID=EG0 maxGroupBandwidth=6000000
1731	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1732	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1733	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1734	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1735	       encodeID=AUDENC0, maxBandwidth=96000
1736	       encodeID=AUDENC1, maxBandwidth=96000
1737	       encodeID=AUDENC2, maxBandwidth=96000

1739	   Here, the encoding group is EG0.  Although the encoding group is
1740	   capable of transmitting up to 6Mbit/s, no individual video
1741	   encoding can exceed 4Mbit/s.

1743	   This encoding group also allows up to 3 audio encodings, AUDENC<0-
1744	   2>. It is not required that audio and video encodings reside
1745	   within the same encoding group, but if so then the group's overall
1746	   maxBandwidth value is a limit on the sum of all audio and video
1747	   encodings configured by the consumer.  A system that does not wish
1748	   or need to combine bandwidth limitations in this way should
1749	   instead use separate encoding groups for audio and video in order
1750	   for the bandwidth limitations on audio and video to not interact.

1752	   Audio and video can be expressed in separate encoding groups, as
1753	   in this illustration.

1755	   encodeGroupID=EG0 maxGroupBandwidth=6000000
1756	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1757	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1758	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1759	         maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000
1760	   encodeGroupID=EG1 maxGroupBandwidth=500000
1761	       encodeID=AUDENC0, maxBandwidth=96000
1762	       encodeID=AUDENC1, maxBandwidth=96000
1763	       encodeID=AUDENC2, maxBandwidth=96000

1765	12.1.3. The MCU Case

1767	   This section shows how an MCU might express its Capture Scenes,
1768	   intending to offer different choices for consumers that can handle
1769	   different numbers of streams.  A single audio capture stream is
1770	   provided for all single and multi-screen configurations that can
1771	   be associated (e.g. lip-synced) with any combination of video
1772	   captures at the consumer.

1774	   +--------------------+--------------------------------------------
1775	   | Capture Scene #1   | note
1776	   |
1777	   +--------------------+--------------------------------------------
1778	   | VC0                | video capture for single screen consumer
1779	   |
1780	   | VC1, VC2           | video capture for 2 screen consumer
1781	   |
1782	   | VC3, VC4, VC5      | video capture for 3 screen consumer
1783	   |
1784	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer
1785	   |
1786	   | AC0                | audio capture representing all participants
1787	   |
1788	   +--------------------+--------------------------------------------

1790	   If / when a presentation stream becomes active within the
1791	   conference the MCU might re-advertise the available media as:

1793	        +------------------+--------------------------------------+
1794	        | Capture Scene #2 | note                                 |
1795	        +------------------+--------------------------------------+
1796	        | VC10             | video capture for presentation       |
1797	        | AC1              | presentation audio to accompany VC10 |
1798	        +------------------+--------------------------------------+

1800	12.2. Media Consumer Behavior

1802	   This section gives an example of how a Media Consumer might behave
1803	   when deciding how to request streams from the three screen
1804	   endpoint described in the previous section.

1806	   The receive side of a call needs to balance its requirements,
1807	   based on number of screens and speakers, its decoding capabilities
1808	   and available bandwidth, and the provider's capabilities in order
1809	   to optimally configure the provider's streams.  Typically it would
1810	   want to receive and decode media from each Capture Scene
1811	   advertised by the Provider.

1813	   A sane, basic, algorithm might be for the consumer to go through
1814	   each Capture Scene in turn and find the collection of Video
1815	   Captures that best matches the number of screens it has (this
1816	   might include consideration of screens dedicated to presentation
1817	   video display rather than "people" video) and then decide between
1818	   alternative entries in the video Capture Scenes based either on
1819	   hard-coded preferences or user choice.  Once this choice has been
1820	   made, the consumer would then decide how to configure the
1821	   provider's encoding groups in order to make best use of the
1822	   available network bandwidth and its own decoding capabilities.

1824	12.2.1. One screen Media Consumer

1826	   VC3, VC4 and VC5 are all different entries by themselves, not
1827	   grouped together in a single entry, so the receiving device should
1828	   choose between one of those.  The choice would come down to
1829	   whether to see the greatest number of participants simultaneously
1830	   at roughly equal precedence (VC5), a switched view of just the
1831	   loudest region (VC3) or a switched view with PiPs (VC4).  An
1832	   endpoint device with a small amount of knowledge of these
1833	   differences could offer a dynamic choice of these options, in-
1834	   call, to the user.

1836	12.2.2. Two screen Media Consumer configuring the example

1838	   Mixing systems with an even number of screens, "2n", and those
1839	   with "2n+1" cameras (and vice versa) is always likely to be the
1840	   problematic case.  In this instance, the behavior is likely to be
1841	   determined by whether a "2 screen" system is really a "2 decoder"
1842	   system, i.e., whether only one received stream can be displayed
1843	   per screen or whether more than 2 streams can be received and
1844	   spread across the available screen area.  To enumerate 3 possible
1845	   behaviors here for the 2 screen system when it learns that the far
1846	   end is "ideally" expressed via 3 capture streams:

1848	   1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1849	      per the 1 screen consumer case above) and either leave one
1850	      screen blank or use it for presentation if / when a
1851	      presentation becomes active.

1853	   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2
1854	      screens (either with each capture being scaled to 2/3 of a
1855	      screen and the center capture being split across 2 screens) or,
1856	      as would be necessary if there were large bezels on the
1857	      screens, with each stream being scaled to 1/2 the screen width
1858	      and height and there being a 4th "blank" panel.  This 4th panel
1859	      could potentially be used for any presentation that became
1860	      active during the call.

1862	   3. Receive 3 streams, decode all 3, and use control information
1863	      indicating which was the most active to switch between showing
1864	      the left and center streams (one per screen) and the center and
1865	      right streams.

1867	   For an endpoint capable of all 3 methods of working described
1868	   above, again it might be appropriate to offer the user the choice
1869	   of display mode.

1871	12.2.3. Three screen Media Consumer configuring the example

1873	   This is the most straightforward case - the Media Consumer would
1874	   look to identify a set of streams to receive that best matched its
1875	   available screens and so the VC0 plus VC1 plus VC2 should match
1876	   optimally.  The spatial ordering would give sufficient information
1877	   for the correct video capture to be shown on the correct screen,
1878	   and the consumer would either need to divide a single encoding
1879	   group's capability by 3 to determine what resolution and frame
1880	   rate to configure the provider with or to configure the individual
1881	   video captures' encoding groups with what makes most sense (taking
1882	   into account the receive side decode capabilities, overall call
1883	   bandwidth, the resolution of the screens plus any user preferences
1884	   such as motion vs sharpness).

1886	13. Acknowledgements

1888	   Allyn Romanow and Brian Baldino were authors of early versions.
1889	   Mark Gorzyinski contributed much to the approach.  We want to
1890	   thank Stephen Botzko for helpful discussions on audio.

1892	14. IANA Considerations

1894	   None.

1896	15. Security Considerations

1898	   TBD

1900	16. Changes Since Last Version

1902	   NOTE TO THE RFC-Editor: Please remove this section prior to
1903	   publication as an RFC.

1905	   Changes from 11 to 12:

1907	     1. Ticket #44. Remove note questioning about requiring a
1908	        Consumer to send a Configure after receiving Advertisement.

1910	     2. Ticket #43. Remove ability for consumer to choose value of
1911	        attribute for scene-switch-policy.

1913	     3. Ticket #36. Remove computational complexity parameter,
1914	        MaxGroupPps, from Encoding Groups.

1916	     4. Reword the Abstract and parts of sections 1 and 4 (now 5)
1917	        based on Mary's suggestions as discussed on the list.  Move
1918	        part of the Introduction into a new section Overview &
1919	        Motivation.

1921	     5. Add diagram of an Advertisement, in the Overview of the
1922	        Framework/Model section.

1924	     6. Change Intended Status to Standards Track.

1926	     7. Clean up RFC2119 keyword language.

1928	   Changes from 10 to 11:

1930	     1. Add description attribute to Media Capture and Capture Scene
1931	        Entry.

1933	     2. Remove contradiction and change the note about open issue
1934	        regarding always responding to Advertisement with a Configure
1935	        message.

1937	     3. Update example section, to cleanup formatting and make the
1938	        media capture attributes and encoding parameters consistent
1939	        with the rest of the document.

1941	   Changes from 09 to 10:

1943	     1. Several minor clarifications such as about SDP usage, Media
1944	        Captures, Configure message.

1946	     2. Simultaneous Set can be expressed in terms of Capture Scene
1947	        and Capture Scene Entry.

1949	     3. Removed Area of Scene attribute.

1951	     4. Add attributes from draft-groves-clue-capture-attr-01.

1953	     5. Move some of the Media Capture attribute descriptions back
1954	        into this document, but try to leave detailed syntax to the
1955	        data model.  Remove the OUTSOURCE sections, which are already
1956	        incorporated into the data model document.

1958	   Changes from 08 to 09:

1960	     1. Use "document" instead of "memo".

1962	     2. Add basic call flow sequence diagram to introduction.

1964	     3. Add definitions for Advertisement and Configure messages.

1966	     4. Add definitions for Capture and Provider.

1968	     5. Update definition of Capture Scene.

1970	     6. Update definition of Individual Encoding.

1972	     7. Shorten definition of Media Capture and add key points in the
1973	        Media Captures section.

1975	     8. Reword a bit about capture scenes in overview.

1977	     9. Reword about labeling Media Captures.

1979	     10. Remove the Consumer Capability message.

1981	     11. New example section heading for media provider behavior

1983	     12. Clarifications in the Capture Scene section.

1985	     13. Clarifications in the Simultaneous Transmission Set section.

1987	     14. Capitalize defined terms.

1989	     15. Move call flow example from introduction to overview section

1991	     16. General editorial cleanup

1993	     17. Add some editors' notes requesting input on issues

1995	     18. Summarize some sections, and propose details be outsourced
1996	        to other documents.

1998	   Changes from 06 to 07:

2000	     1. Ticket #9.  Rename Axis of Capture Point attribute to Point
2001	        on Line of Capture.  Clarify the description of this
2002	        attribute.

2004	     2. Ticket #17.  Add "capture encoding" definition.  Use this new
2005	        term throughout document as appropriate, replacing some usage
2006	        of the terms "stream" and "encoding".

2008	     3. Ticket #18.  Add Max Capture Encodings media capture
2009	        attribute.

2011	     4. Add clarification that different capture scene entries are
2012	        not necessarily mutually exclusive.

2014	   Changes from 05 to 06:

2016	   1. Capture scene description attribute is a list of text strings,
2017	      each in a different language, rather than just a single string.

2019	   2. Add new Axis of Capture Point attribute.

2021	   3. Remove appendices A.1 through A.6.

2023	   4. Clarify that the provider must use the same coordinate system
2024	      with same scale and origin for all coordinates within the same
2025	      capture scene.

2027	   Changes from 04 to 05:

2029	   1. Clarify limitations of "composed" attribute.

2031	   2. Add new section "capture scene entry attributes" and add the
2032	      attribute "scene-switch-policy".

2034	   3. Add capture scene description attribute and description
2035	      language attribute.

2037	   4. Editorial changes to examples section for consistency with the
2038	      rest of the document.

2040	   Changes from 03 to 04:

2042	   1. Remove sentence from overview - "This constitutes a significant
2043	      change ..."

2045	   2. Clarify a consumer can choose a subset of captures from a
2046	      capture scene entry or a simultaneous set (in section "capture
2047	      scene" and "consumer's choice...").

2049	   3. Reword first paragraph of Media Capture Attributes section.

2051	   4. Clarify a stereo audio capture is different from two mono audio
2052	      captures (description of audio channel format attribute).

2054	   5. Clarify what it means when coordinate information is not
2055	      specified for area of capture, point of capture, area of scene.

2057	   6. Change the term "producer" to "provider" to be consistent (it
2058	      was just in two places).

2060	   7. Change name of "purpose" attribute to "content" and refer to
2061	      RFC4796 for values.

2063	   8. Clarify simultaneous sets are part of a provider advertisement,
2064	      and apply across all capture scenes in the advertisement.

2066	   9. Remove sentence about lip-sync between all media captures in a
2067	      capture scene.

2069	   10.   Combine the concepts of "capture scene" and "capture set"
2070	      into a single concept, using the term "capture scene" to
2071	      replace the previous term "capture set", and eliminating the
2072	      original separate capture scene concept.

2074	   Informative References
2075	   Edt. Note: Decide which of these really are Normative References.

2077	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2078	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2080	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
2081	   Johnston,
2082	              A., Peterson, J., Sparks, R., Handley, M., and E.
2083	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
2084	              June 2002.

2086	   [RFC3264]  Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model
2087	              with the Session Description Protocol (SDP)", RFC 3264,
2088	              June 2002.

2090	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2091	              Jacobson, "RTP: A Transport Protocol for Real-Time
2092	              Applications", STD 64, RFC 3550, July 2003.

2094	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
2095	              Session Initiation Protocol (SIP)", RFC 4353,
2096	              February 2006.

2098	   [RFC4579]  Johnston, A., Levin, O., "SIP Call Control -
2099	              Conferencing for User Agents", RFC 4579, August 2006

2101	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
2102	   5117,
2103	              January 2008.

2105	17. Authors' Addresses

2107	   Mark Duckworth (editor)
2108	   Polycom
2109	   Andover, MA  01810
2110	   USA

2112	   Email: mark.duckworth@polycom.com

2114	   Andrew Pepperell
2115	   Acano
2116	   Uxbridge, England
2117	   UK

2119	   Email: apeppere@gmail.com

2121	   Stephan Wenger
2122	   Vidyo, Inc.
2123	   433 Hackensack Ave.
2124	   Hackensack, N.J. 07601
2125	   USA

2127	   Email: stewe@stewe.org