idnits 2.17.1 

draft-romanow-clue-framework-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 18 instances of too long lines in the document, the longest
     one being 29 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 3, 2011) is 4679 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'RFC3261' is defined on line 1054, but no explicit
     reference was found in the text

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                             Cisco Systems
4	Intended status: Informational                              M. Duckworth
5	Expires: January 4, 2012                                         Polycom
6	                                                            A. Pepperell
7	                                                              B. Baldino
8	                                                           Cisco Systems
9	                                                           M. Goryzinski
10	                                                 HP Visual Collaboration
11	                                                            July 3, 2011

13	                Framework for Telepresence Multi-Streams
14	                  draft-romanow-clue-framework-00.txt

16	Abstract

18	   This memo offers a framework for a protocol that enables devices in a
19	   telepresence conference to interoperate by specif;ying the
20	   relationships between multiple RTP streams.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on January 4, 2012.

39	Copyright Notice

41	   Copyright (c) 2011 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
57	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  6
58	   3.  Definitions  . . . . . . . . . . . . . . . . . . . . . . . . .  6
59	   4.  Two Necessary Functions  . . . . . . . . . . . . . . . . . . .  9
60	   5.  Protocol Features  . . . . . . . . . . . . . . . . . . . . . .  9
61	   6.  Stream Content . . . . . . . . . . . . . . . . . . . . . . . . 10
62	     6.1.  Media capture  . . . . . . . . . . . . . . . . . . . . . . 10
63	     6.2.  Attributes . . . . . . . . . . . . . . . . . . . . . . . . 11
64	     6.3.  Capture Set  . . . . . . . . . . . . . . . . . . . . . . . 12
65	   7.  Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 13
66	     7.1.  Physical Simultaneity  . . . . . . . . . . . . . . . . . . 14
67	     7.2.  Encoding Groups  . . . . . . . . . . . . . . . . . . . . . 15
68	       7.2.1.  Sample video encoding group specification #1 . . . . . 17
69	       7.2.2.  Sample video encoding group specification #2 . . . . . 18
70	   8.  Media provider behavior  . . . . . . . . . . . . . . . . . . . 19
71	   9.  Putting it together - using the Capture Set  . . . . . . . . . 19
72	   10. Media consumer behaviour . . . . . . . . . . . . . . . . . . . 22
73	     10.1. One screen receiver configuring the example
74	           capture-side device above  . . . . . . . . . . . . . . . . 23
75	     10.2. Two screen receiver configuring the example
76	           capture-side device above  . . . . . . . . . . . . . . . . 23
77	     10.3. Three screen receiver configuring the example
78	           capture-side device above  . . . . . . . . . . . . . . . . 24
79	     10.4. Configuration of sender streams by a receiver  . . . . . . 24
80	     10.5. Advertisement of capabilities sent by receiver to
81	           sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24
82	   11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25
83	   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
84	   13. Security Considerations  . . . . . . . . . . . . . . . . . . . 25
85	   14. Informative References . . . . . . . . . . . . . . . . . . . . 25
86	   Appendix A.  Attributes  . . . . . . . . . . . . . . . . . . . . . 26
87	     A.1.  Purpose  . . . . . . . . . . . . . . . . . . . . . . . . . 26
88	       A.1.1.  Main . . . . . . . . . . . . . . . . . . . . . . . . . 26
89	       A.1.2.  Presentation . . . . . . . . . . . . . . . . . . . . . 26
90	     A.2.  Audio mixed  . . . . . . . . . . . . . . . . . . . . . . . 26
91	     A.3.  Audio Channel Format . . . . . . . . . . . . . . . . . . . 26
92	       A.3.1.  Linear Array . . . . . . . . . . . . . . . . . . . . . 26
93	       A.3.2.  Stereo . . . . . . . . . . . . . . . . . . . . . . . . 27
94	       A.3.3.  Mono . . . . . . . . . . . . . . . . . . . . . . . . . 27
95	     A.4.  Audio Linear Position  . . . . . . . . . . . . . . . . . . 27
96	     A.5.  Video Scale  . . . . . . . . . . . . . . . . . . . . . . . 28
97	     A.6.  Video composed . . . . . . . . . . . . . . . . . . . . . . 28
98	     A.7.  Video Auto-switched  . . . . . . . . . . . . . . . . . . . 28
99	   Appendix B.  Spatial Relationship  . . . . . . . . . . . . . . . . 28
100	     B.1.  Spatial relationship of audio with video . . . . . . . . . 29
101	   Appendix C.  Capture sets for the MCU Case . . . . . . . . . . . . 29
102	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30

104	1.  Introduction

106	   Current telepresence systems, though based on open standards such as
107	   RTP and SIP, cannot easily interoperate with each other.  A major
108	   factor limiting the interoperability of telepresence systems is the
109	   lack of a standardized way to describe and negotiate the use of the
110	   multiple streams of audio and video comprising the media flows.  This
111	   draft provides a framework for a protocol to enable interoperability
112	   by handling multiple streams in a standardized way.  It is intended
113	   to support the use cases described in
114	   draft-ietf-clue-telepresence-use-cases-00 and to meet the
115	   requirements in draft-romanow-clue-requirements-xx.

117	   The solution described here is strongly focused on what is being done
118	   today, rather than a vision of future conferencing.  However, the
119	   highest priority has been given to creating an extensible framework
120	   to make it easy to add new information needed to accommodate future
121	   conferencing functionality.

123	   The purpose of this effort is to make it possible to handle multiple
124	   streams of media in such a way that a satisfactory user experience is
125	   possible even when participants are on different vendor equipment and
126	   when they are using devices with different types of communication
127	   capabilities.  Information about the relationship of media streams
128	   must be communicated so that audio/video rendering can be done in the
129	   best possible manner.  In addition, it is necessary to choose which
130	   media streams are sent.

132	   This first draft of the CLUE framework is to introduce the basic
133	   approach.  The draft is deliberately as simple as possible in order
134	   to make it possible to focus discussion on the basic approach.  Some
135	   of the more descriptive material has been put into appendices in this
136	   version, in order to keep the framework material from being
137	   overwhelmed by detail.  In addition, only the basic mechanism is
138	   described here.  In subsequent drafts, additional mechanisms
139	   consistent with the basic approach will be added to handle more use
140	   cases.

142	   Several important use cases require such additional mechanism to be
143	   handled.  Nonetheless, we feel that it is better to go step by step,
144	   and we are defering that material until the next version of the
145	   model.  It will provide a good illustration of how to use the
146	   extensible feature of the framework to handle new use cases.

148	   If you look at this framework from the perspective of trying to
149	   catch-it-out and see where it breaks down in a special case, you will
150	   easily be able to succeed.  But we urge you to hold that perspective
151	   temporarily in order to concentrate on how this model works in common
152	   cases, and how it can be expanded to other use cases.

154	   [Edt. Similarly, some of the wording is not as precise and accurate
155	   as might be possible.  Although of course this is very important, it
156	   might be useful to postpone definition issues temporarily where
157	   possible in order to concentrate on the framework.]

159	   After the following Definitions, two short sections introduce key
160	   concepts.  The body of the text comprises three sections that deal
161	   with in turn stream content, choosing streams and an implementation
162	   example.  The media provider and media consumer behavior are
163	   described in separate sections as well.  Several appendices describe
164	   further details for using the framework.

166	2.  Terminology

168	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
169	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
170	   document are to be interpreted as described in RFC 2119 [RFC2119].

172	3.  Definitions

174	   The definitions marked with an "*" are new; all the others are from
175	   draft-wenger-clue-definitions-00-01.txt.

177	      *Audio Capture: Media Capture for audio.  Denoted as ACn.

179	      Capture Device: A device that converts audio and video input into
180	      an electrical signal, in most cases to be fed into a media
181	      encoder.  Cameras and microphones are examples for capture
182	      devices.

184	      Capture Scene: the scene that is captured by a collection of
185	      Capture Devices.  A Capture Scene may be represented by more than
186	      one type of Media.  A Capture Scene may include more than one
187	      Media Capture of the same type.  An example of a Capture Scene is
188	      the video image of a group of people seated next to each other,
189	      along with the sound of their voices, which could be represented
190	      by some number of VCs and ACs.  A middle box may also express
191	      Capture Scenes that it constructs from Media streams it receives.

193	      A Capture Set includes Media Captures that all represent some
194	      aspect of the same Capture Scene.  The items (rows) in a Capture
195	      Set represent different alternatives for representing the same
196	      Capture Scene.

198	      Conference: used as defined in [RFC4353], A Framework for
199	      Conferencing within the Session Initiation Protocol (SIP).

201	      *Encoding Group: A set of encoding parameters representing one or
202	      more media encoders.  An Encoding Group describes constraints on
203	      encoding parameters used for mapping Media Captures to encoded
204	      Streams.

206	      Endpoint: The logical point of final termination through
207	      receiving, decoding and rendering, and/or initiation through
208	      capturing, encoding, and sending of media streams.  An endpoint
209	      consists of one or more physical devices which source and sink
210	      media streams, and exactly one [RFC4353] Participant (which, in
211	      turn, includes exactly one SIP User Agent).  In contrast to an
212	      endpoint, an MCU may also send and receive media streams, but it
213	      is not the initiator nor the final terminator in the sense that
214	      Media is Captured or Rendered.  Endpoints can be anything from
215	      multiscreen/multicamera rooms to handheld devices.

217	      Endpoint Characteristics: include placement of Capture and
218	      Rendering Devices, capture/render angle, resolution of cameras and
219	      screens, spatial location and mixing parameters of microphones.
220	      Endpoint characteristics are not specific to individual media
221	      streams sent by the endpoint.

223	      Left: to be interpreted as a stage direction, see also
224	      [StageDirection(Wikipadia)] (Edt. note: needs more clarification)

226	      MCU: Multipoint Control Unit (MCU) - a device that connects two or
227	      more endpoints together into one single multimedia conference
228	      [RFC5117].  An MCU includes an [RFC4353] Mixer.  Edt. Note:
229	      RFC4353 is tardy in requireing that media from the mixer be sent
230	      to EACH participant.  I think we have practical use cases where
231	      this is not the case.  But the bug (if it is one) is in 4353 and
232	      not herein.

234	      Media: Any data that, after suitable encoding, can be conveyed
235	      over RTP, including audio, video or timed text.

237	      *Media Capture: a source of Media, such as from one or more
238	      Capture Devices.  A Media Capture may be the source of one or more
239	      Media streams.  A Media Capture may also be constructed from other
240	      Media streams.  A middle box can express Media Captures that it
241	      constructs from Media streams it receives.

243	      *Media Consumer: an Endpoint or middle box that receives Media
244	      streams

246	      *Media Provider: an Endpoint or middle box that sends Media
247	      streams

249	      Model: a set of assumptions a telepresence system of a given
250	      vendor adheres to and expects the remote telepresence system(s)
251	      also to adhere to.

253	      Right: to be interpreted as stage direction, see also
254	      [StageDirection(Wikipadia)] (Edt. note: needs more clarification)

256	      Render: the process of generating a representation from a media,
257	      such as displayed motion video or sound emitted from loudspeakers.

259	      *Simultaneous Transmission Set: a set of media captures that can
260	      be transmitted simultaneously from a Media Sender.

262	      Spatial Relation: The arrangement in space of two objects, in
263	      contrast to relation in time or other relationships.  See also
264	      Left and Right.

266	      *Stream: RTP stream as in RFC 3550.

268	      Stream Characteristics: include media stream attributes commonly
269	      used in non-CLUE SIP/SDP environments (such as: media codec, bit
270	      rate, resolution, profile/level etc.) as well as CLUE specific
271	      attributes (which could include for example and depending on the
272	      solution found: the I-D or spatial location of a capture device a
273	      stream originates from).

275	      Telepresence: an environment that gives non co-located users or
276	      user groups a feeling of (co-located) presence - the feeling that
277	      a Local user is in the same room with other Local users and the
278	      Remote parties.  The inclusion of Remote parties is achieved
279	      through multimedia communication including at least audio and
280	      video signals of high fidelity.

282	      *Video Capture: Media Capture for video.  Denoted as VCn.

284	      Video composite: A single image that is formed from combining
285	      visual elements from separate sources.

287	4.  Two Necessary Functions

289	   In simplified terms, here is a description of the functions in a
290	   telepresence conference.

292	   1.   Capture media

294	   2.   FIGURE OUT WHICH MEDIA STREAMS TO SEND (CHOOSING STREAMS)

296	   3.   Encode it

298	   4.   ADD SOME NOTES (STREAM CONTENT)

300	   5.   Package it

302	   6.   Send it

304	   7.   Unpack it

306	   8.   Decode it

308	   9.   Understand the notes

310	   10.  Render the stream content according to the notes

312	   This gross oversimplification is to show clearly that there are only
313	   2 functions that the CLUE protocol needs to accomplish - choose which
314	   streams the sender should send to the receiver, and add the right
315	   information to the streams that get sent.  The framework/model that
316	   we are presenting can be understood as addressing these two issues.

318	5.  Protocol Features

320	   Central to the framework are stream providers and media stream
321	   consumers.  The provider's job is to advertise its capabilities (as
322	   described here) to the consumer, whose job it is to configure the
323	   provider's encoding capabilities (described below).  Both providers
324	   and consumers can each send and receive information, that is, we do
325	   not have one party as the sender and one as the receiver exclusively,
326	   but all parties have both sending and receiving parts to them.  Most
327	   devices function as both a media provider and as a media consumer.
328	   For two devices to communicate bidirectionally, with media flowing in
329	   both directions, both devices act as both a media provider and a
330	   media consumer.  The protocol exchange shown later in the "Choosing
331	   Streams" section including hints, announcement and request messages,
332	   happens twice independently between the 2 bidirectional devices.

334	   For short we will sometimes refer to the media stream provider as the
335	   "sender" and the media stream consumer as the "receiver".

337	   Both endpoints and MCUs, or more generally a "middleboxes" can be
338	   media senders and receivers.

340	   The protocol resulting from the framework will be declarative rather
341	   than negotiative.  What this means here is that information is passed
342	   in either direction, but there is no formalized or explicit agreement
343	   between participants in the protocol.

345	6.  Stream Content

347	   This section describes the structure for communicating information
348	   between senders and receivers.  Figure illustrates how information to
349	   be communicated is organized.  Each construct is discussed in the
350	   sections below.  This diagram is for reference.

352	   Diagram for Stream Content

354	                             +---------------+
355	                             |               |
356	                             |  Capture Set  |
357	                             |               |
358	                             +-------+-------+
359	                          _..-'      |    ``-._
360	                      _.-'           |         ``-._
361	                  _.-'               |              ``-._
362	         +----------------+  +----------------+  +----------------+
363	         | Media Capture  |  | Media Capture  |  | Media Capture  |
364	         | Audio or Video |  | Audio or Video |  | Audio or Video |
365	         +----------------+  +----------------+  +----------------+
366	            .'         `.
367	          .'             `.
368	      ,-----.         ,---------.
369	    ,' Encode`.     ,'           `.
370	   (   Group   )   (  Attributes   )
371	    `.       ,'     `.           ,'
372	      `-----'         `---------'

374	6.1.  Media capture

376	   A media capture (defined in definitions) is a fundamental concept of
377	   the model.  Media can be captured in different ways, for example by
378	   various arrangements of cameras and microphones.  The model uses the
379	   terms "video capture" (VC) and "audio capture" (AC) to refer to
380	   sources of media streams.  To distinguish between multiple instances,
381	   they are numbered for example VC1, VC2, and VC3 could refer to three
382	   different video captures that can be used simultaneously.

384	   Media captures are dynamic.  They can come and go in a conference -
385	   and their parameters can change.  A sender can advertise a new list
386	   of captures at any time.  Both the media sender and media receiver
387	   can send "their messages" (i.e., capture set advertisements, stream
388	   configurations) any number of times during a call, and the other end
389	   is always required to act on any new information received (e.g.,
390	   stopping streams it had previously configured that are no longer
391	   valid).

393	   A media capture can be a media source such as video from a specific
394	   camera, or it can be more conceptual such as a composite image from
395	   several cameras, or an automatic dynamically switched capture
396	   choosing from several cameras depending on who is talking or other
397	   factors.

399	   A media capture is described by Attributes and associated with an
400	   Encode Group.  Audio and video captures are aggregated into Capture
401	   Sets.

403	6.2.  Attributes

405	   Audio and video capture attributes carry the information about
406	   streams and their relationships that a sender or receiver wants to
407	   communicate.  [Edt: We do not mean to duplicate SDP, if an SDP
408	   description can be used, great.]

410	   The attributes of media streams refer to the current state of a
411	   stream, rather than the capabilities of a video capture device which
412	   are described in the encode capabilities, as descried below.

414	   The mechanism of Attributes make the framework extensible.  Although
415	   we are defining some attributes now based on the most common use
416	   cases, new attributes can be added for new use cases as they arise.
417	   If the model does not do something you want it to, chances are
418	   defining an attribute will handle your case.

420	   We describe attributes by variables and their values.  The current
421	   attributes are listed below.  The variable is shown in parentheses,
422	   and the values follow after the colon:

424	   o  (Purpose): main audio, main video, presentation

426	   o  (Audio mixed): true, false
427	   o  (Audio Channel Format): linear array, mono, stereo, tbd

429	   o  (Audio linear position): integer 0 to 100

431	   o  (Video scale): integer indicating scale

433	   o  (Video composed): true, false

435	   o  (Video auto-switched): true, false

437	   The attributes listed here are discussed in Appendix A, in order to
438	   keep the emphasis of this draft on the overall approach, rather than
439	   the more specific details.

441	6.3.  Capture Set

443	   A sender describes its ability to send alternatives of media streams
444	   by defining capture sets.

446	   A capture set is a list of media captures expressed in rows.  Each
447	   row of the capture set or list consists of either a single capture or
448	   groups of captures.  A group means the individual captures in the
449	   group are spatially related, and the order of the captures within the
450	   group, along with attribute values, defines the spatial ordering of
451	   the captures.  Spatial relationships are discussed in detail in
452	   Appendix B.

454	   The items (rows) in a capture set represent different alternatives
455	   for representing the same Capture Scene.  For example the following
456	   are alternative ways of capturing the same Capture Scene - two
457	   cameras each viewing half of a room, or one camera viewing the whole
458	   room, or one stream that automatically captures the person in the
459	   room who is currently speaking.  Each row of the Capture Set contains
460	   either a single media capture or one group of media captures.

462	   The following example shows a capture set for an endpoint media
463	   sender where:

465	   o  (VC0 - left camera capture, VC1 - center camera capture, VC2 -
466	      right camera capture)

468	   o  (VC3 - capture associated with loudest)

470	   o  (VC4 - zoomed out view of all people in the room.)

472	   o  (AC0 - room audio)

474	   The first item in this capture set example is a group of video
475	   captures with a spatial relationship to each other.  VC1 is to the
476	   left of VC2, and VC0 is to the left of VC1.  VC3 and VC4 are other
477	   alternatives of how to capture the same room in different ways.  The
478	   audio capture is included in the same capture set to indicate AC0 is
479	   associated with those video captures, meaning the audio should be
480	   rendered along with the video in the same set.

482	   The idea is to have sets of captures that represent the same
483	   information ("information" in this context might be a set of people
484	   and their associated audio / video streams, or might be a
485	   presentation supplied by a laptop, perhaps with an accompanying audio
486	   commentary).  Spatial ordering of media captures is imposed here by
487	   the simplicity of a left to right ordering among media captures in a
488	   group in the set.

490	   A media receiver could choose one row of each media type (e.g., audio
491	   and video) from a capture set.  For example a three stream receiver
492	   could choose the first video row plus the audio row, while a single
493	   stream receiver could choose the second or third video row plus the
494	   audio row.  An MCU receiver might choose to receive multiple rows.

496	   The simultaneity groups and encoding groups as discussed in the next
497	   section apply to media captures listed in capture sets.  The
498	   simultaneity groups and encoding groups MUST allow all the Media
499	   Captures in a particular group to be used simultaneously.

501	7.  Choosing Streams

503	   The following diagram shows the flow of information messages between
504	   a media provider and a media consumer.  The provider sends
505	   information about its capabilities (as specified in this section),
506	   then the consumer chooses which streams it wants, which we refer to
507	   as "configure".  Optionally, the consumer may send hints to the
508	   provider about its own capabilities, in which case the provider might
509	   tailor its announcements to the consumer.

511	   Diagram for Choosing Streams
512	    Media Receiver                         Media Sender
513	    --------------                         ------------
514	          |                                     |
515	          |------------- Hints ---------------->|
516	          |                                     |
517	          |                                     |
518	          |<---- Capabilities (announce) -------|
519	          |                                     |
520	          |                                     |
521	          |------ Configure (request) --------->|
522	          |                                     |

524	   In order for appropriate streams to be sent from senders to
525	   receivers, certain characteristics of the multiple streams must be
526	   understood by both senders and receivers.  Two separate aspects of
527	   streams suffice to describe the necessary information to be shared by
528	   senders and receivers.  The first aspect we call "physical
529	   simultaneity" and the other aspect we refer to as "encoding group".
530	   These are described in the following sections.

532	7.1.  Physical Simultaneity

534	   An endpoint or MCU can send multiple captures simultaneously.
535	   However, there may be constraints that limit which captures can be
536	   sent simultaneously with other captures.

538	   Physical or device simultaneity refers to fact that a device may not
539	   be able to be used in different ways at the same time.  This shapes
540	   the way that offers are made from the sender.  The offers are made so
541	   that the receiver will choose one of several possible usages of the
542	   device.  This is easier to show in an example.

544	   Consider the example of a room system where there are 3 cameras each
545	   of which can send a separate capture covering 2 persons each- VC0,
546	   VC1, VC2.  The middle camera can also zoom out and show all 6
547	   persons, VC3.  But the middle camera cannot be used in both modes at
548	   the same time - it has to either show the space where 2 participants
549	   sit or the whole 6 seats.  We refer to this as a physical device
550	   simultaneity constraint.

552	   The following illustration shows 3 cameras with 4 video streams.  The
553	   middle camera can be used as main video zoomed in on 2 people or it
554	   could be used in zoomed out mode and capture the whole endpoint.  The
555	   idea here is that the middle camera cannot be used for both zoomed in
556	   and zoomed out captures simultaneously.  This is a constraint imposed
557	   by the physical limitations of the devices.

559	   Diagram for Simultaneity
560	   `-.   +--------+   VC2
561	      .-'_Camera 3|---------->
562	   .-'   +--------+
563	                       VC3
564	                     -------->
565	   `-.   +--------+ /
566	      .-'|Camera 2|<
567	   .-'   +--------+ \  VC1
568	                     -------->

570	   `-.   +--------+   VC0
571	      .-'|Camera 1|---------->
572	   .-'   +--------+

574	   VC0- video zoomed in on 2 people        VC2- video zoomed in on 2 people
575	   VC1- video zoomed in on 2 people        VC3- video zoomed out on 6 people

577	   Simultaneous transmission sets can be expressed as sets of the VCs
578	   that could physically be transmitted at the same time, though it may
579	   not make sense to do so.

581	   In this example the two simultaneous sets are:

583	   o  {VC0, VC1, VC2}

585	   o  {VC0, VC3, VC2}

587	   In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2.
588	   Only one set can be transmitted at a time.  These are physical
589	   capabilities describing what can physically be sent at the same time,
590	   not what might make sense to send.  For example, in the second set
591	   both VC0 and VC2 are redundant if VC3 is included.

593	   In describing its capabilities, the provider must take physical
594	   simultaneity into account and send a list of its simultaneity groups
595	   to the consumer.

597	7.2.  Encoding Groups

599	   The second aspect of multiple streams that must be understood by
600	   senders and receivers in order to create the best experience
601	   possible, i. e., for the "right" or "best" streams to be sent, is the
602	   encoding characteristics of the possible streams that can be sent.
603	   Just in the way that there is a constraint imposed on the multiple
604	   streams due to the physical limitations, there are also constraints
605	   due to encoding limitations.  These are described in an Encoding
606	   Group as follows.

608	   An encoding group is an attribute of a video capture (VC) as
609	   discussed above.

611	   An encoding group has the following variables, as shown in the
612	   following table.

614	   +--------------+----------------------------------------------------+
615	   | Name         | Description                                        |
616	   +--------------+----------------------------------------------------+
617	   | maxBandwidth | Maximum number of bits per second relating to a    |
618	   |              | single video encoding                              |
619	   | maxMbps      | Maximum number of macroblocks per second relating  |
620	   |              | to a single video encoding: ((width + 15) / 16) *  |
621	   |              | ((height + 15) / 16) * framesPerSecond             |
622	   | maxWidth     | Video resolution's maximum supported width,        |
623	   |              | expressed in pixels                                |
624	   | maxHeight    | Video resolution's maximum supported height,       |
625	   |              | expressed in pixels                                |
626	   | maxFrameRate | Maximum supported frame rate                       |
627	   +--------------+----------------------------------------------------+

629	   An encoding group is the basic method of describing encoding
630	   capability.  There may be multiple encoding groups per endpoint.  For
631	   example, each video capture device might have an associated encoding
632	   group that describes the video streams that can result from that
633	   capture.

635	   An encoding group EG<n> comprises one or more potential encodings
636	   ENC<n>.  For example,

638	EG0:  maxMbps=489600, maxBandwidth=6000000
639	     VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
640	     VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
641	     AUDIO_ENC0: maxBandwidth=96000
642	     AUDIO_ENC1: maxBandwidth=96000
643	     AUDIO_ENC2: maxBandwidth=96000

645	   Here, the encoding group is EG0.  It can transmit up to two 1080p30
646	   encodings (Mbps for 1080p = 244800), but it is capable of
647	   transmitting a maxFrameRate of 60 frames per second (fps).  To
648	   achieve the maximum resolution (1920 x 1088) the frame rate is
649	   limited to 30 fps.  However 60 fps can be achieved at a lower
650	   resolution if required by the receiver.  Although the encoding group
651	   is capable of transmitting up to 6Mbit/s, no individual video
652	   encoding can exceed 4Mbit/s.

654	   This encoding group also allows up to 3 audio encodings,
655	   AUDIO_ENC<0-2>.  It is not required that audio and video encodings
656	   reside within the same encoding group, but if so then the group's
657	   overall maxBandwidth value is a limit on the sum of all audio and
658	   video encodings configured by the receiver.  A system that does not
659	   wish or need to combine bandwidth limitations in this way should
660	   instead use separate encoding groups for audio and video in order for
661	   the bandwidth limitations on audio and video to not interact.

663	   Here is an example written with separate audio and video encode
664	   groups.

666	VIDEO_EG0:  maxMbps=489600, maxBandwidth=6000000
667	     VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
668	     VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
669	AUDIO_EG0: maxBandwidth=500000
670	     AUDIO_ENC0: maxBandwidth=96000
671	     AUDIO_ENC1: maxBandwidth=96000
672	     AUDIO_ENC2: maxBandwidth=96000

674	   The following two sections describe further examples of encoding
675	   groups.  In the first example, the capability parameters are the same
676	   across ENCs.  In the second example, they vary.

678	7.2.1.  Sample video encoding group specification #1

680	   An endpoint that has 3 similar video capture devices would advertise
681	   3 encoding groups that can each transmit up to 2 1080p30 encondings,
682	   as follows:

684	EG0:  maxMbps = 489600, maxBandwidth=6000000
685	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
686	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
687	EG1:  maxMbps = 489600, maxBandwidth=6000000
688	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
689	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
690	EG2:  maxMbps = 489600, maxBandwidth=6000000
691	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
692	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000

694	   A remote receiver configures some or all of the specific encodings
695	   such that:

697	   o  The configuration of each active ENC<n> parameter values does not
698	      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
699	      exceeded

701	   o  The total bandwidth of the configured ENC <n> encodings does not
702	      exceed the maxBandwidth of the encoding group

704	   o  The sum of the "macroblocks per second" values of each configured
705	      encoding does not exceed the maxMbps of the encoding group

707	   There is no requirement for all encodings within an encoding group to
708	   be activated when configured by the receiver.

710	   Depending on the sender's encoding methods, the receiver may be able
711	   to request fixed encode values or choose encode values in the range
712	   less than the maximum offered.  We will discuss receiver behavior in
713	   more detail in a section below.

715	7.2.2.  Sample video encoding group specification #2

717	   An endpoint that has 3 similar video capture devices would advertise
718	   3 encoding groups that can each transmit up to 2 1080p30 encondings,
719	   as follows:

721	EG0:  maxMbps = 489600, maxBandwidth=6000000
722	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
723	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
724	EG1:  maxMbps = 489600, maxBandwidth=6000000
725	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
726	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
727	EG2:  maxMbps = 489600, maxBandwidth=6000000
728	        ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000
729	        ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000

731	   A remote receiver configures some or all of the specific encodings
732	   such that:

734	   o  The configuration of each active ENC<n> parameter values does not
735	      cause that encoding's maxWidth, maxHeight, maxFrameRate to be
736	      exceeded

738	   o  The total bandwidth of the configured ENC <n> encodings does not
739	      exceed the maxBandwidth of the encoding group

741	   o  The sum of the "macroblocks per second" values of each configured
742	      encoding does not exceed the maxMbps of the encoding group

744	   There is no requirement for all encodings within an encoding group to
745	   be activated when configured by the receiver.

747	   Depending on the sender's encoding methods, the receiver may be able
748	   to request fixed encode values or choose encode values in the range
749	   less than the maximum offered.  We will discuss receiver behavior in
750	   more detail in a section below.

752	8.  Media provider behavior

754	   In summary, what is included in the sender capabilities announce
755	   messing includes:

757	   o  the list of captures and their attributes

759	   o  the list of capture sets

761	   o  the list of physical simultaneity groups

763	   o  the list of the encoding groups

765	9.  Putting it together - using the Capture Set

767	   This section shows how to use the framework to represent a typical
768	   case for telepresence rooms.

770	   Appendix B includes an additional example showing the MCU case.
771	   [Edt. It is in the Appendix just to allow the body of the document to
772	   focus on the basic ideas.  It can be brought in to the main text in a
773	   later draft.]

775	   Consider an endpoint with the following characteristics:

777	   o  3 cameras, 3 displays, a 6 person table

779	   o  Each video device can provide one capture for each 1/3 section of
780	      the table

782	   o  A single capture representing the active speaker can be provided

784	   o  A single capture representing the active speaker with the other 2
785	      captures shown picture in picture within the stream can be
786	      provided

788	   o  A capture showing a zoomed out view of all 6 seats in the room can
789	      be provided

791	   The audio and video captures for this endpoint can be described as
792	   follows.  The Encode Group specifications can be found above in
793	   section 6.2.2, Sample video encoding group specification #2.

795	   Video Captures:

797	   1.  VC0- (the left camera stream), encoding group:EG0, attributes:
798	       purpose=main;auto-switched:no

800	   2.  VC1- (the center camera stream), encoding group:EG1, attributes:
801	       purpose=main; auto-switched:no

803	   3.  VC2- (the right camera stream), encoding group:EG2, attributes:
804	       purpose=main;auto-switched:no

806	   4.  VC3- (the loudest panel stream), encoding group:EG1, attributes:
807	       purpose=main;auto-switched:yes

809	   5.  VC4- (the loudest panel stream with PiPs), encoding group:EG1,
810	       attributes: purpose=main; composed=true; auto-switched:yes

812	   6.  VC5- (the zoomed out view of all people in the room), encoding
813	       group:EG1, attributes: purpose=main;auto-switched:no

815	   7.  VC6- (presentation stream), encoding group:EG1, attributes:
816	       purpose=presentation;auto-switched:no

818	   Summary of video captures - 3 codecs, center one is used for center
819	   camera stream, presentation stream, auto-switched, and zoomed views.
820	   [edt.  It is arbitrary that for this example the alternative views
821	   are on EG1 - they could have been spread out- it was not a necessary
822	   choice.]

824	   Audio Captures:

826	   o  AC0 (left), attributes: purpose=main;channel format=linear array;
827	      linear position=0;

829	   o  AC1 (right), attributes: purpose=main;channel format=linear array;
830	      linear position=100;

832	   o  AC2 (center) attributes: purpose=main;channel format=linear array;
833	      linear position=50;

835	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
836	      attributes: purpose=main;channel format=linear array; linear
837	      position=50; mixed=true

839	   o  AC4 audio stream associated with the presentation video (mono)
840	      attributes: purpose=presentation;channel format=linear array;
841	      linear position=50;

843	   The physical simultaneity information is:

845	      {VC0, VC1, VC2, VC3, VC4, VC6}

847	      {VC0, VC2, VC5, VC6}

849	   You can physically do any selection within one set at the same time.
850	   This is strictly what is possible from the devices.  However, using
851	   every member in the set simultaneously may not make sense- for
852	   example VC3(loudest) and VC4 (loudest with PIP).  (In addition, there
853	   are encoding constraints that make choosing all of the VCs in a set
854	   impossible.  VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1has only 3
855	   ENCs.  This constraint shows up in the Capture list, not in the
856	   physical simultaneity list.)

858	   In this example there are no restrictions on which audio captures can
859	   be sent simultaneously.

861	   The following table represents the capture sets for this sender.
862	   Recall that a capture set is composed of alternative captures
863	   covering the same scene.  Capture Set #1 is for the main people
864	   captures, and Capture Set #2 is for presentation.

866	                            +----------------+
867	                            | Capture Set #1 |
868	                            +----------------+
869	                            | VC0, VC1, VC2  |
870	                            | VC3            |
871	                            | VC4            |
872	                            | VC5            |
873	                            | AC0, AC1, AC2  |
874	                            | AC3            |
875	                            +----------------+

877	                            +----------------+
878	                            | Capture Set #2 |
879	                            +----------------+
880	                            | VC6            |
881	                            | AC4            |
882	                            +----------------+

884	   Different capture sets are unique to each other, non-overlapping.  A
885	   receiver chooses a capture row from each capture set.  In this case
886	   the three captures VC0, VC1, and VC2 are one way of representing the
887	   video from the endpoint.  These three captures should appear adjacent
888	   next to each other.  Alternatively, another way of representing the
889	   Capture Scene is with the capture VC3, which automatically shows the
890	   person who is talking.  Similarly for the VC4 and VC5 alternatives.

892	   As in the video case, the different rows of audio in Capture Set #1
893	   represent the "same thing", in that one way to receive the audio is
894	   with the 3 linear position audio captures (AC0, AC1, AC2), and
895	   another way is with the single channel monaural format AC3.  The
896	   Media Consumer would choose the one audio capture row it is capable
897	   of receiving.

899	   The spatial ordering is understood by the left to right ordering
900	   among the VC7lt;n&gtr;s on the same row of the table.

902	   The receiver finds a "row" in each capture set #x section of the
903	   table that it wants.  It configures the streams according to the
904	   encoding group for the row.

906	   A Media Receiver would likely want to choose a row to receive based
907	   in part on how many streams it can simultaneously receive.  A
908	   receiver that can receive three people streams would probably prefer
909	   to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not
910	   receive the other rows.  A receiver that can receive only one people
911	   stream would probably choose one of the other rows.

913	   If the receiver can receive a presentation stream too, it would also
914	   choose to receive the only row from Capture Set #2 (VC6).

916	10.  Media consumer behaviour

918	   The receive side of a call needs to balance its requirements, based
919	   on number of screens and speakers, its decoding capabilities and
920	   available bandwidth, and the sender's capabilities in order to
921	   optimally configure the sender's streams.  Typically it would want to
922	   receive and decode media from each capture set advertised by the
923	   sender.

925	   A sane, basic, algorithm might be for the receiver to go through each
926	   capture set in turn and find the collection of video captures that
927	   best matches the number of screens it has (this might include
928	   consideration of screens dedicated to presentation video display
929	   rather than "people" video) and then decide between alternative rows
930	   in the video capture sets based either on hard-coded preferences or
931	   user choice.  Once this choice has been made, the receiver would then
932	   decide how to configure the sender's encode groups in order to make
933	   best use of the available network bandwidth and its own decoding
934	   capabilities.

936	10.1.  One screen receiver configuring the example capture-side device
937	       above

939	   The receive side of a call needs to balance its requirements, based
940	   on number of screens and speakers, its decoding capabilities and
941	   available bandwidth, and the sender's capabilities in order to
942	   optimally configure the sender's streams.  Typically it would want to
943	   receive and decode media from each capture set advertised by the
944	   sender.

946	   A sane, basic, algorithm might be for the receiver to go through each
947	   capture set in turn and find the collection of video captures that
948	   best matches the number of screens it has (this might include
949	   consideration of screens dedicated to presentation video display
950	   rather than "people" video) and then decide between alternative rows
951	   in the video capture sets based either on hard-coded preferences or
952	   user choice.  Once this choice has been made, the receiver would then
953	   decide how to configure the sender's encode groups in order to make
954	   best use of the available network bandwidth and its own decoding
955	   capabilities.

957	10.2.  Two screen receiver configuring the example capture-side device
958	       above

960	   Mixing systems with an even number of screens, "2n", and those with
961	   "2n+1" cameras (and vice versa) is always likely to be the
962	   problematic case.  In this instance, the behaviour is likely to be
963	   determined by whether a "2 screen" system is really a "2 decoder"
964	   system, i.e., whether only one received stream can be displayed per
965	   screen or whether more than 2 streams can be received and spread
966	   across the available screen area.  To enumerate 3 possible behaviours
967	   here for the 2 screen system when it learns that the far end is
968	   "ideally" expressed via 3 capture streams:

970	   1.  Fall back to receiving just a single stream (VC3, VC4 or VC5 as
971	       per the 1 screen receiver case above) and either leave one screen
972	       blank or use it for presentation if / when a presentation becomes
973	       active

975	   2.  Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens
976	       (either with each capture being scaled to 2/3 of a screen and the
977	       centre capture being split across 2 screens) or, as would be
978	       necessary if there were large bezels on the screens, with each
979	       stream being scaled to 1/2 the screen width and height and there
980	       being a 4th "blank" panel.  This 4th panel could potentially be
981	       used for any presentation that became active during the call.

983	   3.  Receive 3 streams, decode all 3, and use control information
984	       indicating which was the most active to switch between showing
985	       the left and centre streams (one per screen) and the centre and
986	       right streams.

988	   For an endpoint capable of all 3 methods of working described above,
989	   again it might be appropriate to offer the user the choice of display
990	   mode.

992	10.3.  Three screen receiver configuring the example capture-side device
993	       above

995	   This is the most straightforward case - the receiver would look to
996	   identify a set of streams to receive that best matched its available
997	   screens and so the VC0 plus VC1 plus VC2 should match optimally.  The
998	   spatial ordering would give sufficient information for the correct
999	   video capture to be shown on the correct screen, and the receiver
1000	   would either need to divide a single encode group's capability by 3
1001	   to determine what resolution and frame rate to configure the sender
1002	   with or to configure the individual video captures' encode groups
1003	   with what makes most sense (taking into account the receive side
1004	   decode capabilities, overall call bandwidth, the resolution of the
1005	   screens plus any user preferences such as motion vs sharpness).

1007	10.4.  Configuration of sender streams by a receiver

1009	   After receiving a set of video capture information from a sender and
1010	   making its choice of what media streams to receive based on the
1011	   receiver's own capabilities and any sender-side simultaneity
1012	   restrictions, the receiver needs to essentially configure the sender
1013	   to transmit the chosen set.

1015	   The expectation is that this message will enumerate each of the
1016	   encoding groups and potential encoders within those groups that the
1017	   receiver wishes to be active (this may well be a subset of the
1018	   complete set available).  For each such encoder within an encoding
1019	   group, the receiver would specify the video capture (i.e., VC<n&t; as
1020	   described above) along with the specifics of the video encoding
1021	   required, i.e. width, height, frame rate and bit rate.  At this
1022	   stage, the receiver would also provide RTP demultiplexing information
1023	   as required to distinguish each stream from the others being
1024	   configured by the same mechanism.

1026	10.5.  Advertisement of capabilities sent by receiver to sender

1028	   In order for a maximally-capable sender to be able to advertise a
1029	   manageable number of video captures to a receiver, there is a
1030	   potential use for the receiver being able, at the start of CLUE to be
1031	   able to inform the sender of its capabilities.  One example here
1032	   would be the video capture attribute set - a receiver could tell the
1033	   sender the complete set of video capture attributes it is able to
1034	   understand and so the sender would be able to reduce the capture set
1035	   it advertises to be tailored to the receiver.

1037	11.  Acknowledgements

1039	   We want to thank Stephen Botzko for helpful discussions on audio.

1041	12.  IANA Considerations

1043	   TBD

1045	13.  Security Considerations

1047	   TBD

1049	14.  Informative References

1051	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1052	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1054	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
1055	              A., Peterson, J., Sparks, R., Handley, M., and E.
1056	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1057	              June 2002.

1059	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1060	              Session Initiation Protocol (SIP)", RFC 4353,
1061	              February 2006.

1063	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117,
1064	              January 2008.

1066	   [StageDirection(Wikipedia)]
1067	              Wikipedia, "Blocking (stage), available from http://
1068	              en.wikipedia.org/wiki/Stage_direction#Stage_directions",
1069	              May 2011, <http://en.wikipedia.org/wiki/
1070	              Stage_direction#Stage_directions>.

1072	Appendix A.  Attributes

1074	   This section discusses the attributes and their values in more
1075	   detail, and many have additional details provided elsewhere in the
1076	   draft.  In general, the way to extend the solution to handle new
1077	   features is by adding attributes and/or values.

1079	A.1.  Purpose

1081	   A variable with enumerated values describing the purpose or role of
1082	   the Media Capture.  It could be applied to any media type.  Possible
1083	   values: main, presentation, others TBD.

1085	A.1.1.  Main

1087	   The audio or video capture is of one or more people participating in
1088	   a conference (or where they would be if they were there).  It is of
1089	   part or all of the Capture Scene.

1091	A.1.2.  Presentation

1093	A.2.  Audio mixed

1095	A.3.  Audio Channel Format

1097	   The "channel format" attribute of an Audio Capture indicates how the
1098	   meaning of the channels is determined.  It is an enumerated variable
1099	   describing the type of audio channel or channels in the Aucio
1100	   Capture.  The possible values of the "channel format" attribute are:

1102	   o  linear array (linear position)

1104	   o  stereo

1106	   o  TBD - other possible future values (to potentially include other
1107	      things like 3.0, 3.1, 5.1 surround sound and binaural)

1109	   All ACs in the same row of a Capture Set MUST have the same value of
1110	   the "channel format" attribute.

1112	A.3.1.  Linear Array

1114	   An AC with channel format = "linear array" has exactly one audio
1115	   channel.  For the "linear array" channel format, there is another
1116	   required attribute to specify position within the array.  This is the
1117	   "linear position" attribute, which is an integer value within the
1118	   range 0 to 100. 0 means leftmost, 100 means rightmost, with other
1119	   values spaced equally between.  A value of 50 means in the center,
1120	   spatially.  Any AC can have any value, even multiple ACs in a capture
1121	   set row can have the same value.  The 0-100 linear position is
1122	   intentionally dimensionless, since we are presuming that receivers
1123	   will use different sized video displays, and the audio spatial
1124	   location can be adjusted at the receiving side to correspond to the
1125	   displays.

1127	   The linear position value is fixed until the receiver asks for a
1128	   different AC from the capture set, which may be triggered by the
1129	   provider sending an updated capture set.

1131	   The streams being sent might be correlated (that is, someone talking
1132	   might be heard in multiple captures from the same room).  Echo
1133	   cancellation and stream synchronization in receivers should take this
1134	   into account.

1136	   With three audio channels representing left, center, and right:

1138	      AC0 - channel format = linear array; linear position = 0

1140	      AC1 - channel format = linear array; linear position = 50

1142	      AC2 - channel format = linear array; linear position = 100

1144	A.3.2.  Stereo

1146	   An AC with channel format = "stereo" has exactly two audio channels,
1147	   left and right, as part of the same AC.  [Edt: should we mention RFC
1148	   3551 here?  The channel format may be related to how Audio Captures
1149	   are mapped to RTP streams.  This stereo is not the same as the effect
1150	   produced from two mono ACs one from the left and one from the right.
1151	   ]

1153	A.3.3.  Mono

1155	   An AC with channel format="mono" has one audio channel.  This can be
1156	   represented by audio linear position with a single member at a single
1157	   integer location.  [Edt. Mono can be represented as an as a
1158	   particular case of linear array (=1]

1160	A.4.  Audio Linear Position

1162	   An integer valued variable from 0 - 100, where 0 signifies the left
1163	   and 100 signifies the right.

1165	A.5.  Video Scale

1167	   An optional integer valued variable indicating the spatial scale of
1168	   the video capture, for example centimeters for horizontal image
1169	   width.

1171	A.6.  Video composed

1173	   An optional Boolean variable indicating if the VC is constructed by
1174	   composing multiple other video captures together. stream incorporates
1175	   multiple composed panes (This could indicate for example a continuous
1176	   presence view of multiple images in a grid, or a large image with
1177	   smaller picture-in-picture images in it.)

1179	A.7.  Video Auto-switched

1181	   A Boolean variable.  In this case the offered VC varies depending on
1182	   some rule; it is auto-switched between possible VCs.  The most common
1183	   example of this is sending the video capture associated with the
1184	   "loudest" speaker according to an audio detection algorithm.

1186	Appendix B.  Spatial Relationship

1188	   Here is an example of a simple capture set with three video captures
1189	   and three audio channels, each in a separate row:

1191	      (VC0, VC1, VC2)

1193	      (AC0, AC1, AC2)

1195	   The three ACs together in a row indicate those channels are spatially
1196	   related to each other, and spatially related to the VCs in the same
1197	   capture set.

1199	   Multiple Media Captures of the same media type are often spatially
1200	   related to each other.  Typically multiple Video Captures should be
1201	   rendered next to each other in a particular order, or multiple audio
1202	   channels should be rendered to match different speakers in a
1203	   particular way.  Also, media of different types are often associated
1204	   with each other, for example a group of Video Captures can be
1205	   associated with a group of Audio Captures meaning they should be
1206	   rendered together.

1208	   Media Captures of the same media type are associated with each other
1209	   by grouping them together in a single row of a Capture Set. Media
1210	   Captures of different media types are associated with each other by
1211	   putting them in different rows of the same Capture Set.

1213	   For video the spatial relationship is horizontal adjacency in one
1214	   dimension.  So Video Captures can be described as being adjacent to
1215	   each other, in a horizontal row, ordered left to right.  When VCs are
1216	   grouped together in a capture set row, it means they are horizontally
1217	   adjacent to each other, such that when more than one of them are
1218	   rendered together they should be rendered next to each other in the
1219	   proper order.  The first VC in the group is the leftmost (from the
1220	   point of view of a person looking at the rendered images), and so on
1221	   towards the right.

1223	   [Edt: Additional attributes can be added, such as the ability to
1224	   handle two dimensional array instead of just a one dimensional row of
1225	   video images.]

1227	   Audio Captures that are in the same Capture Set with Video Captures
1228	   are related to each other spatially, such that the multiple audio
1229	   channels should be rendered such that the overall audio field covers
1230	   roughly the same horizontal extent as the rendered video.  This gives
1231	   a reasonable spatial correlation between audio and video.  A more
1232	   exact relationship is out of scope of this framework.

1234	B.1.  Spatial relationship of audio with video

1236	   A row of audio is spatially related to a row of video in the same
1237	   capture set.  The audio and video should be rendered such that they
1238	   appear spatially coincident.  Audio with a linear position of 0
1239	   corresponds to the leftmost side of the group of VCs in the same
1240	   capture set.  Audio with a linear position of 50 corresponds to the
1241	   center of the group of VCs.  Audio with a linear position of 100
1242	   corresponds to the rightmost side of the group of VCs.

1244	   Likewise, for stereo audio, the spatial extent of the audio should be
1245	   coincident with the spatial extent of the corresponding video.

1247	Appendix C.  Capture sets for the MCU Case

1249	   This shows how an MCU might express its Capture Sets, intending to
1250	   offer different choices for receivers that can handle different
1251	   numbers of streams.  A single audio capture stream is provided for
1252	   all single and multi-screen configurations that can be associated
1253	   (e.g. lip-synced) with any combination of video captures at the
1254	   receiver.

1256	   +--------------------+---------------------------------------------+
1257	   | Capture Set #1     | note                                        |
1258	   +--------------------+---------------------------------------------+
1259	   | VC0                | video capture for single screen receiver    |
1260	   | VC1, VC2           | video capture for 2 screen receiver         |
1261	   | VC3, VC4, VC5      | video capture for 3 screen receiver         |
1262	   | VC6, VC7, VC8, VC9 | video capture for 4 screen receiver         |
1263	   | AC0                | audio capture representing all participants |
1264	   +--------------------+---------------------------------------------+

1266	   If / when a presentation stream becomes active within the conference,
1267	   the MCU might re-advertise the available media as:

1269	         +----------------+--------------------------------------+
1270	         | Capture Set #2 | note                                 |
1271	         +----------------+--------------------------------------+
1272	         | VC10           | video capture for presentation       |
1273	         | AC1            | presentation audio to accompany VC10 |
1274	         +----------------+--------------------------------------+

1276	Authors' Addresses

1278	   Allyn Romanow
1279	   Cisco Systems
1280	   San Jose, CA  95134
1281	   USA

1283	   Email: allyn@cisco.com

1285	   Mark Duckworth
1286	   Polycom
1287	   Andover, MA  01810
1288	   US

1290	   Email: mark.duckworth@polycom.com

1292	   Andrew Pepperell
1293	   Cisco Systems
1294	   Langely, England
1295	   UK

1297	   Email: apeppere@cisco.com
1298	   Brian Baldino
1299	   Cisco Systems
1300	   San Jose, CA  95134
1301	   US

1303	   Email: bbaldino@polycom.com

1305	   Mark Goryzinski
1306	   HP Visual Collaboration
1307	   Corvallis, OR
1308	   USA

1310	   Email: mark.gorzynski@hp.com