idnits 2.17.1 

draft-ietf-clue-framework-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 59 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1298 has weird spacing: '...om left    bot...'

  == Line 1352 has weird spacing: '...om left    bot...'

  -- The document date (December 24, 2012) is 4140 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 5117
     (Obsoleted by RFC 7667)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	CLUE WG                                            M. Duckworth, Ed.
3	Internet Draft                                                Polycom
4	Intended status: Informational                            A. Pepperell
5	Expires: June, 2013                                        Silverflare
6	                                                             S. Wenger
7	                                                                 Vidyo
8	                                                     December 24, 2012

10	               Framework for Telepresence Multi-Streams
11	                   draft-ietf-clue-framework-08.txt

13	Abstract

15	   This memo offers a framework for a protocol that enables devices
16	   in a telepresence conference to interoperate by specifying the
17	   relationships between multiple media streams.

19	Status of this Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current
27	   Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six
30	   months and may be updated, replaced, or obsoleted by other
31	   documents at any time.  It is inappropriate to use Internet-Drafts
32	   as reference material or to cite them other than as "work in
33	   progress."

35	   This Internet-Draft will expire on June 24, 2013.

37	Copyright Notice

39	   Copyright (c) 2012 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with
47	   respect to this document.  Code Components extracted from this
48	   document must include Simplified BSD License text as described in
49	   Section 4.e of the Trust Legal Provisions and are provided without
50	   warranty as described in the Simplified BSD License.

52	Table of Contents

54	   1. Introduction...................................................3
55	   2. Terminology....................................................6
56	   3. Definitions....................................................6
57	   4. Overview of the Framework/Model................................9
58	   5. Spatial Relationships.........................................11
59	   6. Media Captures and Capture Scenes.............................12
60	      6.1. Media Captures...........................................12
61	         6.1.1. Media Capture Attributes............................12
62	      6.2. Capture Scene............................................15
63	         6.2.1. Capture scene attributes............................17
64	         6.2.2. Capture scene entry attributes......................18
65	      6.3. Simultaneous Transmission Set Constraints................19
66	   7. Encodings.....................................................20
67	      7.1. Individual Encodings.....................................21
68	      7.2. Encoding Group...........................................22
69	   8. Associating Media Captures with Encoding Groups...............24
70	   9. Consumer's Choice of Streams to Receive from the Provider.....25
71	      9.1. Local preference.........................................26
72	      9.2. Physical simultaneity restrictions.......................26
73	      9.3. Encoding and encoding group limits.......................26
74	      9.4. Message Flow.............................................27
75	   10. Extensibility................................................28
76	   11. Examples - Using the Framework...............................28
77	      11.1. Three screen endpoint media provider....................28
78	      11.2. Encoding Group Example..................................35
79	      11.3. The MCU Case............................................36
80	      11.4. Media Consumer Behavior.................................37
81	         11.4.1. One screen consumer................................37
82	         11.4.2. Two screen consumer configuring the example........38
83	         11.4.3. Three screen consumer configuring the example......38
84	   12. Acknowledgements.............................................39
85	   13. IANA Considerations..........................................39
86	   14. Security Considerations......................................39
87	   15. Changes Since Last Version...................................39
88	   16. Authors' Addresses...........................................42

90	1. Introduction

92	   Current telepresence systems, though based on open standards such
93	   as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate
94	   with each other.  A major factor limiting the interoperability of
95	   telepresence systems is the lack of a standardized way to describe
96	   and negotiate the use of the multiple streams of audio and video
97	   comprising the media flows.  This draft provides a framework for a
98	   protocol to enable interoperability by handling multiple streams
99	   in a standardized way.  It is intended to support the use cases
100	   described in draft-ietf-clue-telepresence-use-cases-02 and to meet
101	   the requirements in draft-ietf-clue-telepresence-requirements-01.

103	   Conceptually distinguished are Media Providers and Media
104	   Consumers.  A Media Provider provides Media in the form of RTP
105	   packets, a Media Consumer consumes those RTP packets.  Media
106	   Providers and Media Consumers can reside in Endpoints or in
107	   middleboxes such as Multipoint Control Units (MCUs).  A Media
108	   Provider in an Endpoint is usually associated with the generation
109	   of media for Media Captures; these Media Captures are typically
110	   sourced from cameras, microphones, and the like.  Similarly, the
111	   Media Consumer in an Endpoint is usually associated with
112	   Renderers, such as screens and loudspeakers.  In middleboxes,
113	   Media Providers and Consumers can have the form of outputs and
114	   inputs, respectively, of RTP mixers, RTP translators, and similar
115	   devices.  Typically, telepresence devices such as Endpoints and
116	   middleboxes would perform as both Media Providers and Media
117	   Consumers, the former being concerned with those devices'
118	   transmitted media and the latter with those devices' received
119	   media.  In a few circumstances, a CLUE Endpoint middlebox may
120	   include only Consumer or Provider functionality, such as recorder-
121	   type Consumers or webcam-type Providers.

123	   One initial motivation for this memo and its companion documents
124	   has been that Endpoints according to this memo can, and usually
125	   do, have multiple Media Captures and Media Renderers.  While
126	   previous system designs can deal with such a situation, what was
127	   missing was a mechanism that can associate the Media Captures with
128	   each other in space and time.  Further, due to the potentially
129	   large number of RTP flows required for a Multimedia Conference
130	   involving potentially many Endpoints, each of which can have many
131	   Media Captures and Media Renderers, a sensible system design is to
132	   multiplex multiple RTP media flows onto the same transport
133	   address, so to avoid using the port number as a multiplexing point
134	   and the associated shortcomings such as NAT/firewall traversal.

136	   While the actual mapping of those RTP flows to the header fields
137	   of the RTP packets is not subject of this specification, the large
138	   number of possible permutations of sensible options a Media
139	   Provider may make available to a Media Consumer makes a mechanism
140	   desirable that allows to narrow down the number of possible
141	   options that a SIP offer-answer exchange has to consider.  Such
142	   information is made available using protocol mechanisms specified
143	   in this memo and companion documents, although it should be
144	   stressed that its use in an implementation is optional.  Also,
145	   there are aspects of the control of both Endpoints and
146	   middleboxes/MCUs that dynamically change during the progress of a
147	   call, such as audio-level based screen switching, layout changes,
148	   and so on, which need to be conveyed.  Note that these control
149	   aspects are complementary to those specified in traditional SIP
150	   based conference management such as BFCP.  Finally, all this
151	   information needs to be conveyed, and the notion of support for it
152	   needs to be established.  This is done by the negotiation of a
153	   "CLUE channel", a data channel negotiated early during the
154	   initiation of a call.  An Endpoint or MCU that rejects the
155	   establishment of this data channel, by definition, is not
156	   supporting CLUE based mechanisms, whereas an Endpoint or MCU that
157	   accepts it is required to use it to the extent specified in this
158	   memo and its companion documents.

160	   A very brief outline of the call flow used by a simple system in
161	   compliance with this memo can be described as follows.

163	   An initial offer/answer exchange establishes a CLUE channel
164	   between two Endpoints.  With the establishment of that channel,
165	   the endpoints have consented to use the CLUE protocol mechanisms
166	   and have to adhere to them.

168	   Over this CLUE channel, the Provider in each Endpoint conveys its
169	   characteristics and capabilities as specified herein (which will
170	   typically not be sufficient to set up all media).  The Consumer in
171	   the Endpoint receives the information provided by the Provider,
172	   and can use it for two purposes.  First, it can, but is not
173	   necessarily required to, use the information provided to tailor
174	   the SDP it is going to send during the following SIP offer/answer
175	   exchange, and its reaction to SDP it receives in that step.  It is
176	   often a sensible implementation choice to do so, as the
177	   representation of the media information conveyed over the CLUE
178	   channel can dramatically cut down on the size of SDP messages used
179	   in the O/A exchange that follows.  Second, it takes note of the
180	   spatial relationship associated with the Media that are described.

182	   It is often sensible to take that spatial relationship into
183	   account when tailoring the SDP.

185	   This CLUE exchange is followed by an SDP offer answer exchange
186	   that not only establishes those aspects of the media that have not
187	   been "negotiated" over CLUE, but has also the side effect of
188	   setting up the media transmission itself, involving potentially
189	   security exchanges, ICE, and whatnot.  This step is plain vanilla
190	   SIP, with the exception that the SDP used herein, in most cases
191	   can (but not necessarily must) be considerably smaller than the
192	   SDP a system would typically need to exchange if there were no
193	   pre-established knowledge about the Provider and Consumer
194	   characteristics.

196	   During the lifetime of a call, further exchanges can occur over
197	   the CLUE channel.  In some cases, those further exchanges can be
198	   dealt with by Provider or Consumer without any other protocol
199	   activity.  For example, voice-activated screen switching, signaled
200	   over the CLUE channel, ought not to lead to heavy-handed
201	   mechanisms like SIP re-invites.  However, in other cases, after
202	   the CLUE negotiation an additional offer/answer exchange may
203	   become necessary.  For example, if both sides decide to upgrade
204	   the call from a single screen to a multi-screen call and more
205	   bandwidth is required for the additional video channels, that
206	   could require a new O/A exchange.

208	   Numerous optimizations may be possible, and are the implementer's
209	   choice.  For example, it may be sensible to establish one or more
210	   initial media channels during the initial offer/answer exchange,
211	   which would allow, for example, for a fast startup of audio.
212	   Depending on the system design, it may be possible to re-use this
213	   established channel using only CLUE mechanisms, thereby avoiding
214	   further offer/answer exchanges.

216	   One aspect of the protocol outlined herein and specified in
217	   normative detail in companion documents is that it makes available
218	   information regarding the Provider's capabilities to deliver
219	   Media, and attributes related to that media such as their spatial
220	   relationship, to the Media Consumer.  The operation of the
221	   Renderer inside the Consumer is unspecified in that it can choose
222	   to ignore some information provided by the Provider, and/or not
223	   render media streams available from the Provider (although it has
224	   to follow the CLUE protocol and, therefore, has to "accept" the
225	   Provider's information).  All CLUE protocol mechanisms are
226	   optional in the Consumer in the sense that, while the Consumer
227	   must be able to receive (and, potentially, gracefully acknowledge)
228	   CLUE messages, it is free to ignore the information provided
229	   therein.  Obviously, this is not a particularly sensible design
230	   choice.

232	   Legacy devices are defined here in as those Endpoints and MCUs
233	   that do not support the setup and use of the CLUE channel.  The
234	   notion of a device being a legacy device is established during the
235	   initial offer/answer exchange, in which the legacy device will not
236	   understand the offer for the CLUE channel and, therefore, reject
237	   it.  This is the indication for the CLUE-implementing Endpoint or
238	   MCU that the other side of the communication is not compliant with
239	   CLUE, and to fall back to whatever mechanism was used before the
240	   introduction of CLUE.

242	   As for the media, Provider and Consumer have an end-to-end
243	   communication relationship with respect to (RTP transported)
244	   media; and the mechanisms described herein and in companion
245	   documents do not change the aspects of setting up those RTP flows
246	   and sessions.  However, it should be noted that forms of RTP
247	   multiplexing of multiple RTP flows onto the same transport address
248	   are developed concurrently with the CLUE suite of specifications,
249	   and it is widely expected that most, if not all, Endpoints or MCUs
250	   supporting CLUE will also support those mechanisms.  Some design
251	   choices made in this memo reflect this coincidence in spec
252	   development timing.

254	2. Terminology

256	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
257	   NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
258	   in this document are to be interpreted as described in RFC 2119
259	   [RFC2119].

261	3. Definitions

263	   The terms defined below are used throughout this memo and
264	   companion documents and they are normative.  In order to easily
265	   identify the use of a defined term, those terms are capitalized.

267	   Audio Capture: Media Capture for audio.  Denoted as ACn.

269	   Camera-Left and Right: For media captures, camera-left and camera-
270	   right are from the point of view of a person observing the
271	   rendered media.  They are the opposite of stage-left and stage-
272	   right.

274	   Capture Device: A device that converts audio and video input into
275	   an electrical signal, in most cases to be fed into a media
276	   encoder.

278	   Cameras and microphones are examples for capture devices.

280	   Capture Encoding: A specific encoding of a media capture, to be
281	   sent by a media provider to a media consumer via RTP.

283	   Capture Scene: a structure representing the scene that is captured
284	   by a collection of capture devices.  A capture scene includes
285	   attributes and one or more capture scene entries, with each entry
286	   including one or more media captures.

288	   Capture Scene Entry: a list of media captures of the same media
289	   type that together form one way to represent the capture scene.

291	   Conference: used as defined in [RFC4353], A Framework for
292	   Conferencing within the Session Initiation Protocol (SIP).

294	   Individual Encoding: A variable with a set of attributes that
295	   describes the maximum values of a single audio or video capture
296	   encoding.  The attributes include: maximum bandwidth- and for
297	   video maximum macroblocks (for H.264), maximum width, maximum
298	   height, maximum frame rate.

300	   Encoding Group: A set of encoding parameters representing a total
301	   media encoding capability to be sub-divided across potentially
302	   multiple Individual Encodings.

304	   Endpoint: The logical point of final termination through
305	   receiving, decoding and rendering, and/or initiation through
306	   capturing, encoding, and sending of media streams.  An endpoint
307	   consists of one or more physical devices which source and sink
308	   media streams, and exactly one [RFC4353] Participant (which, in
309	   turn, includes exactly one SIP User Agent).  In contrast to an
310	   endpoint, an MCU may also send and receive media streams, but it
311	   is not the initiator nor the final terminator in the sense that
312	   Media is Captured or Rendered. Endpoints can be anything from
313	   multiscreen/multicamera rooms to handheld devices.

315	   Front: the portion of the room closest to the cameras.  In going
316	   towards back you move away from the cameras.

318	   MCU: Multipoint Control Unit (MCU) - a device that connects two or
319	   more endpoints together into one single multimedia conference
320	   [RFC5117].  An MCU includes an [RFC4353] Mixer.  [Edt. RFC4353 is
321	   tardy in requiring that media from the mixer be sent to EACH
322	   participant.  I think we have practical use cases where this is
323	   not the case.  But the bug (if it is one) is in 4353 and not
324	   herein.]

326	   Media: Any data that, after suitable encoding, can be conveyed
327	   over RTP, including audio, video or timed text.

329	   Media Capture: a source of Media, such as from one or more Capture
330	   Devices.  A Media Capture (MC) may be the source of one or more
331	   capture encodings.  A Media Capture may also be constructed from
332	   other Media streams.  A middle box can express Media Captures that
333	   it constructs from Media streams it receives.

335	   Media Consumer: an Endpoint or middle box that receives media
336	   streams

338	   Media Provider: an Endpoint or middle box that sends Media streams

340	   Model: a set of assumptions a telepresence system of a given
341	   vendor adheres to and expects the remote telepresence system(s)
342	   also to adhere to.

344	   Plane of Interest: The spatial plane containing the most relevant
345	   subject matter.

347	   Render: the process of generating a representation from a media,
348	   such as displayed motion video or sound emitted from loudspeakers.

350	   Simultaneous Transmission Set: a set of media captures that can be
351	   transmitted simultaneously from a Media Provider.

353	   Spatial Relation: The arrangement in space of two objects, in
354	   contrast to relation in time or other relationships.  See also
355	   Camera-Left and Right.

357	   Stage-Left and Right: For media captures, stage-left and stage-
358	   right are the opposite of camera-left and camera-right.  For the
359	   case of a person facing (and captured by) a camera, stage-left and
360	   stage-right are from the point of view of that person.

362	   Stream: a capture encoding sent from a media provider to a media
363	   consumer via RTP [RFC3550].

365	   Stream Characteristics: the media stream attributes commonly used
366	   in non-CLUE SIP/SDP environments (such as: media codec, bit rate,
367	   resolution, profile/level etc.) as well as CLUE specific
368	   attributes, such as the ID of a capture or a spatial location.

370	   Telepresence: an environment that gives non co-located users or
371	   user groups a feeling of (co-located) presence - the feeling that
372	   a Local user is in the same room with other Local users and the
373	   Remote parties.  The inclusion of Remote parties is achieved
374	   through multimedia communication including at least audio and
375	   video signals of high fidelity.

377	   Video Capture: Media Capture for video.  Denoted as VCn.

379	   Video composite: A single image that is formed from combining
380	   visual elements from separate sources.

382	4. Overview of the Framework/Model

384	   The CLUE framework specifies how multiple media streams are to be
385	   handled in a telepresence conference.

387	   The main goals include:

389	   o  Interoperability

391	   o  Extensibility

393	   o  Flexibility

395	   Interoperability is achieved by the media provider describing the
396	   relationships between media streams in constructs that are
397	   understood by the consumer, who can then render the media.
398	   Extensibility is achieved through abstractions and the generality
399	   of the model, making it easy to add new parameters.  Flexibility
400	   is achieved largely by having the consumer choose what content and
401	   format it wants to receive from what the provider is capable of
402	   sending.

404	   A transmitting endpoint or MCU describes specific aspects of the
405	   content of the media and the formatting of the media streams it
406	   can send (advertisement); and the receiving end responds to the
407	   provider by specifying which content and media streams it wants to
408	   receive (configuration).  The provider then transmits the asked
409	   for content in the specified streams.

411	   This advertisement and configuration occurs at call initiation but
412	   may also happen at any time throughout the conference, whenever
413	   there is a change in what the consumer wants or the provider can
414	   send.

416	   An endpoint or MCU typically acts as both provider and consumer at
417	   the same time, sending advertisements and sending configurations
418	   in response to receiving advertisements.  (It is possible to be
419	   just one or the other.)

421	   The data model is based around two main concepts: a capture and an
422	   encoding.  A media capture (MC), such as audio or video, describes
423	   the content a provider can send.  Media captures are described in
424	   terms of CLUE-defined attributes, such as spatial relationships
425	   and purpose of the capture.  Providers tell consumers which media
426	   captures they can provide, described in terms of the media capture
427	   attributes.

429	   A provider organizes its media captures that represent the same
430	   scene into capture scenes.  A consumer chooses which media
431	   captures it wants to receive according to the capture scenes sent
432	   by the provider.

434	   In addition, the provider sends the consumer a description of the
435	   individual encodings it can send in terms of the media attributes
436	   of the encodings, in particular, well-known audio and video
437	   parameters such as bandwidth, frame rate, macroblocks per second.

439	   The provider also specifies constraints on its ability to provide
440	   media, and the consumer must take these into account in choosing
441	   the content and capture encodings it wants.  Some constraints are
442	   due to the physical limitations of devices - for example, a camera
443	   may not be able to provide zoom and non-zoom views simultaneously.
444	   Other constraints are system based constraints, such as maximum
445	   bandwidth and maximum macroblocks/second.

447	   The following sections discuss these constructs and processes in
448	   detail, followed by use cases showing how the framework
449	   specification can be used.

451	5. Spatial Relationships

453	   In order for a consumer to perform a proper rendering, it is often
454	   necessary to provide spatial information about the streams it is
455	   receiving.  CLUE defines a coordinate system that allows media
456	   providers to describe the spatial relationships of their media
457	   captures to enable proper scaling and spatial rendering of their
458	   streams.  The coordinate system is based on a few principles:

460	   o  Simple systems which do not have multiple Media Captures to
461	      associate spatially need not use the coordinate model.

463	   o  Coordinates can either be in real, physical units
464	      (millimeters), have an unknown scale or have no physical scale.
465	      Systems which know their physical dimensions should always
466	      provide those real-world measurements.  Systems which don't
467	      know specific physical dimensions but still know relative
468	      distances should use 'unknown scale'.  'No scale' is intended
469	      to be used where Media Captures from different devices (with
470	      potentially different scales) will be forwarded alongside one
471	      another (e.g. in the case of a middle box).

473	      *  "millimeters" means the scale is in millimeters

475	      *  "Unknown" means the scale is not necessarily millimeters,
476	         but the scale is the same for every capture in the capture
477	         scene.

479	      *  "No Scale" means the scale could be different for each
480	         capture- an MCU provider that advertises two adjacent
481	         captures and picks sources (which can change quickly) from
482	         different endpoints might use this value; the scale could be
483	         different and changing for each capture.  But the areas of
484	         capture still represent a spatial relation between captures.

486	   o  The coordinate system is Cartesian X, Y, Z with the origin at a
487	      spot of the provider's choosing.  The provider must use the
488	      same coordinate system with same scale and origin for all
489	      coordinates within the same capture scene.

491	   The direction of increasing coordinate values is:
492	   X increases from camera left to camera right
493	   Y increases from front to back
494	   Z increases from low to high

496	6. Media Captures and Capture Scenes

498	   This section describes how media providers can describe the
499	   content of media to consumers.

501	6.1. Media Captures

503	   Media captures are the fundamental representations of streams that
504	   a device can transmit.  What a Media Capture actually represents
505	   is flexible:

507	   o  It can represent the immediate output of a physical source
508	      (e.g. camera, microphone) or 'synthetic' source (e.g. laptop
509	      computer, DVD player).

511	   o  It can represent the output of an audio mixer or video composer

513	   o  It can represent a concept such as 'the loudest speaker'

515	   o  It can represent a conceptual position such as 'the leftmost
516	      stream'

518	   To distinguish between multiple instances, video and audio
519	   captures are numbered such as: VC1, VC2 and AC1, AC2.  VC1 and VC2
520	   refer to two different video captures and AC1 and AC2 refer to two
521	   different audio captures.

523	   Each Media Capture can be associated with attributes to describe
524	   what it represents.

526	6.1.1. Media Capture Attributes

528	   Media Capture Attributes describe static information about the
529	   captures.  A provider uses the media capture attributes to
530	   describe the media captures to the consumer.  The consumer will
531	   select the captures it wants to receive.  Attributes are defined
532	   by a variable and its value.  The currently defined attributes and
533	   their values are:

535	   Content: {slides, speaker, sl, main, alt}
536	   A field with enumerated values which describes the role of the
537	   media capture and can be applied to any media type.  The
538	   enumerated values are defined by [RFC4796].  The values for this
539	   attribute are the same as the mediacnt values for the content
540	   attribute in [RFC4796].  This attribute can have multiple values,
541	   for example content={main, speaker}.

543	   Composed: {true, false}

545	   A field with a Boolean value which indicates whether or not the
546	   Media Capture is a mix (audio) or composition (video) of streams.

548	   This attribute is useful for a media consumer to avoid nesting a
549	   composed video capture into another composed capture or rendering.
550	   This attribute is not intended to describe the layout a media
551	   provider uses when composing video streams.

553	   Audio Channel Format: {mono, stereo} A field with enumerated
554	   values which describes the method of encoding used for audio.

556	   A value of 'mono' means the Audio Capture has one channel.

558	   A value of 'stereo' means the Audio Capture has two audio
559	   channels, left and right.

561	   This attribute applies only to Audio Captures.  A single stereo
562	   capture is different from two mono captures that have a left-right
563	   spatial relationship.  A stereo capture maps to a single RTP
564	   stream, while each mono audio capture maps to a separate RTP
565	   stream.

567	   Switched: {true, false}

569	   A field with a Boolean value which indicates whether or not the
570	   Media Capture represents the (dynamic) most appropriate subset of
571	   a 'whole'.  What is 'most appropriate' is up to the provider and
572	   could be the active speaker, a lecturer or a VIP.

574	   Point of Capture: {(X, Y, Z)}

576	   A field with a single Cartesian (X, Y, Z) point value which
577	   describes the spatial location, virtual or physical, of the
578	   capturing device (such as camera).

580	   When the Point of Capture attribute is specified, it must include
581	   X, Y and Z coordinates.  If the point of capture is not specified,
582	   it means the consumer should not assume anything about the spatial
583	   location of the capturing device.  Even if the provider specifies
584	   an area of capture attribute, it does not need to specify the
585	   point of capture.

587	   Point on Line of Capture: {(X,Y,Z)}

589	   A field with a single Cartesian (X, Y, Z) point value (virtual or
590	   physical) which describes a position in space of a second point on
591	   the axis of the capturing device; the first point being the Point
592	   of Capture (see above).  This point MUST lie between the Point of
593	   Capture and the Area of Capture.

595	   The Point on Line of Capture MUST be ignored if the Point of
596	   Capture is not present for this capture device.  When the Point on
597	   Line of Capture attribute is specified, it must include X, Y and Z
598	   coordinates.  These coordinates MUST NOT be identical to the Point
599	   of Capture coordinates.  If the Point on Line of Capture is not
600	   specified, no assumptions are made about the axis of the capturing
601	   device.

603	   Area of Capture:

605	   {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3,
606	   Y3, Z3), top right(X4, Y4, Z4)}

608	   A field with a set of four (X, Y, Z) points as a value which
609	   describe the spatial location of what is being "captured".  By
610	   comparing the Area of Capture for different Media Captures within
611	   the same capture scene a consumer can determine the spatial
612	   relationships between them and render them correctly.

614	   The four points should be co-planar.  The four points form a
615	   quadrilateral, not necessarily a rectangle.

617	   The quadrilateral described by the four (X, Y, Z) points defines
618	   the plane of interest for the particular media capture.

620	   If the area of capture attribute is specified, it must include X,
621	   Y and Z coordinates for all four points.  If the area of capture
622	   is not specified, it means the media capture is not spatially
623	   related to any other media capture (but this can change in a
624	   subsequent provider advertisement).

626	   For a switched capture that switches between different sections
627	   within a larger area, the area of capture should use coordinates
628	   for the larger potential area.

630	   EncodingGroup: {<encodeGroupID value>}

632	   A field with a value equal to the encodeGroupID of the encoding
633	   group associated with the media capture.

635	   Max Capture Encodings: {unsigned integer}

637	   An optional attribute indicating the maximum number of capture
638	   encodings that can be simultaneously active for the media capture.
639	   If absent, this parameter defaults to 1.  The minimum value for
640	   this attribute is 1.  The number of simultaneous capture encodings
641	   is also limited by the restrictions of the encoding group for the
642	   media capture.

644	6.2. Capture Scene

646	   In order for a provider's individual media captures to be used
647	   effectively by a consumer, the provider organizes the media
648	   captures into capture scenes, with the structure and contents of
649	   these capture scenes being sent from the provider to the consumer.

651	   A capture scene is a structure representing the scene that is
652	   captured by a collection of capture devices.  A capture scene
653	   includes one or more capture scene entries, with each entry
654	   including one or more media captures.  A capture scene represents,
655	   for example, the video image of a group of people seated next to
656	   each other, along with the sound of their voices, which could be
657	   represented by some number of VCs and ACs in the capture scene
658	   entries.  A middle box may also express capture scenes that it
659	   constructs from media streams it receives.

661	   A provider may advertise multiple capture scenes or just a single
662	   capture scene.  A media provider might typically use one capture
663	   scene for main participant media and another capture scene for a
664	   computer generated presentation.  A capture scene may include more
665	   than one type of media.  For example, a capture scene can include
666	   several capture scene entries for video captures, and several
667	   capture scene entries for audio captures.

669	   A provider can express spatial relationships between media
670	   captures that are included in the same capture scene.  But there
671	   is no spatial relationship between media captures that are in
672	   different capture scenes.

674	   A media provider arranges media captures in a capture scene to
675	   help the media consumer choose which captures it wants.  The
676	   capture scene entries in a capture scene are different
677	   alternatives the provider is suggesting for representing the
678	   capture scene.  The media consumer can choose to receive all media
679	   captures from one capture scene entry for each media type (e.g.
680	   audio and video), or it can pick and choose media captures
681	   regardless of how the provider arranges them in capture scene
682	   entries.  Different capture scene entries of the same media type
683	   are not necessarily mutually exclusive alternatives.

685	   Media captures within the same capture scene entry must be of the
686	   same media type - it is not possible to mix audio and video
687	   captures in the same capture scene entry, for instance.  The
688	   provider must be capable of encoding and sending all media
689	   captures in a single entry simultaneously.  A consumer may decide
690	   to receive all the media captures in a single capture scene entry,
691	   but a consumer could also decide to receive just a subset of those
692	   captures.  A consumer can also decide to receive media captures
693	   from different capture scene entries.

695	   When a provider advertises a capture scene with multiple entries,
696	   it is essentially signaling that there are multiple
697	   representations of the same scene available.  In some cases, these
698	   multiple representations would typically be used simultaneously
699	   (for instance a "video entry" and an "audio entry").  In some
700	   cases the entries would conceptually be alternatives (for instance
701	   an entry consisting of 3 video captures versus an entry consisting
702	   of just a single video capture).  In this latter example, the
703	   provider would in the simple case end up providing to the consumer
704	   the entry containing the number of video captures that most
705	   closely matched the media consumer's number of display devices.

707	   The following is an example of 4 potential capture scene entries
708	   for an endpoint-style media provider:

710	   1.  (VC0, VC1, VC2) - left, center and right camera video captures

712	   2.  (VC3) - video capture associated with loudest room segment

714	   3.  (VC4) - video capture zoomed out view of all people in the
715	   room
716	   4.  (AC0) - main audio

718	   The first entry in this capture scene example is a list of video
719	   captures with a spatial relationship to each other.  Determination
720	   of the order of these captures (VC0, VC1 and VC2) for rendering
721	   purposes is accomplished through use of their Area of Capture
722	   attributes.  The second entry (VC3) and the third entry (VC4) are
723	   additional alternatives of how to capture the same room in
724	   different ways.  The inclusion of the audio capture in the same
725	   capture scene indicates that AC0 is associated with those video
726	   captures, meaning it comes from the same scene.  The audio should
727	   be rendered in conjunction with any rendered video captures from
728	   the same capture scene.

730	6.2.1. Capture scene attributes

732	   Attributes can be applied to capture scenes as well as to
733	   individual media captures.  Attributes specified at this level
734	   apply to all constituent media captures.

736	   Description attribute - list of {<description text>, <language
737	   tag>}

739	   The optional description attribute is a list of human readable
740	   text strings which describe the capture scene.  If there is more
741	   than one string in the list, then each string in the list should
742	   contain the same description, but in a different language.  A
743	   provider that advertises multiple capture scenes can provide
744	   descriptions for each of them.  This attribute can contain text in
745	   any number of languages.

747	   The language tag identifies the language of the corresponding
748	   description text.  The possible values for a language tag are the
749	   values of the 'Subtag' column for the "Type: language" entries in
750	   the "Language Subtag Registry" at [IANA-Lan] originally defined in
751	   [RFC5646].  A particular language tag value MUST NOT be used more
752	   than once in the description attribute list.

754	   Area of Scene attribute

756	   The area of scene attribute for a capture scene has the same
757	   format as the area of capture attribute for a media capture.  The
758	   area of scene is for the entire scene, which is captured by the
759	   one or more media captures in the capture scene entries.  If the
760	   provider does not specify the area of scene, but does specify
761	   areas of capture, then the consumer may assume the area of scene
762	   is greater than or equal to the outer extents of the individual
763	   areas of capture.

765	   Scale attribute

767	   An optional attribute indicating if the numbers used for area of
768	   scene, area of capture and point of capture are in terms of
769	   millimeters, unknown scale factor, or not any scale, as described
770	   in Section 5.  If any media captures have an area of capture
771	   attribute or point of capture attribute, then this scale attribute
772	   must also be defined.  The possible values for this attribute are:

774	      "millimeters"

776	      "unknown"

778	      "no scale"

780	6.2.2. Capture scene entry attributes

782	   Attributes can be applied to capture scene entries.  Attributes
783	   specified at this level apply to the capture scene entry as a
784	   whole.

786	   Scene-switch-policy: {site-switch, segment-switch}

788	   A media provider uses this scene-switch-policy attribute to
789	   indicate its support for different switching policies.  In the
790	   provider's advertisement, this attribute can have multiple values,
791	   which means the provider supports each of the indicated policies.
792	   The consumer, when it requests media captures from this capture
793	   scene entry, should also include this attribute but with only the
794	   single value (from among the values indicated by the provider)
795	   indicating the consumer's choice for which policy it wants the
796	   provider to use.  If the provider does not support any of these
797	   policies, it should omit this attribute.

799	   The "site-switch" policy means all captures are switched at the
800	   same time to keep captures from the same endpoint site together.
801	   Let's say the speaker is at site A and everyone else is at a
802	   "remote" site.

804	   When the room at site A shown, all the camera images from site A
805	   are forwarded to the remote sites.  Therefore at each receiving
806	   remote site, all the screens display camera images from site A.
807	   This can be used to preserve full size image display, and also
808	   provide full visual context of the displayed far end, site A. In
809	   site switching, there is a fixed relation between the cameras in
810	   each room and the displays in remote rooms.  The room or
811	   participants being shown is switched from time to time based on
812	   who is speaking or by manual control.

814	   The "segment-switch" policy means different captures can switch at
815	   different times, and can be coming from different endpoints.
816	   Still using site A as where the speaker is, and "remote" to refer
817	   to all the other sites, in segment switching, rather than sending
818	   all the images from site A, only the image containing the speaker
819	   at site A is shown.  The camera images of the current speaker and
820	   previous speakers (if any) are forwarded to the other sites in the
821	   conference.

823	   Therefore the screens in each site are usually displaying images
824	   from different remote sites - the current speaker at site A and
825	   the previous ones.  This strategy can be used to preserve full
826	   size image display, and also capture the non-verbal communication
827	   between the speakers.  In segment switching, the display depends
828	   on the activity in the remote rooms - generally, but not
829	   necessarily based on audio / speech detection.

831	6.3. Simultaneous Transmission Set Constraints

833	   The provider may have constraints or limitations on its ability to
834	   send media captures.  One type is caused by the physical
835	   limitations of capture mechanisms; these constraints are
836	   represented by a simultaneous transmission set.  The second type
837	   of limitation reflects the encoding resources available -
838	   bandwidth and macroblocks/second.  This type of constraint is
839	   captured by encoding groups, discussed below.

841	   An endpoint or MCU can send multiple captures simultaneously,
842	   however sometimes there are constraints that limit which captures
843	   can be sent simultaneously with other captures.  A device may not
844	   be able to be used in different ways at the same time.  Provider
845	   advertisements are made so that the consumer will choose one of
846	   several possible mutually exclusive usages of the device.  This
847	   type of constraint is expressed in a Simultaneous Transmission
848	   Set, which lists all the media captures that can be sent at the
849	   same time.  This is easier to show in an example.

851	   Consider the example of a room system where there are 3 cameras
852	   each of which can send a separate capture covering 2 persons each-
853	   VC0, VC1, VC2.  The middle camera can also zoom out and show all 6
854	   persons, VC3.  But the middle camera cannot be used in both modes
855	   at the same time - it has to either show the space where 2
856	   participants sit or the whole 6 seats, but not both at the same
857	   time.

859	   Simultaneous transmission sets are expressed as sets of the MCs
860	   that could physically be transmitted at the same time, (though it
861	   may not make sense to do so).  In this example the two
862	   simultaneous sets are shown in Table 1.  The consumer must make
863	   sure that it chooses one and not more of the mutually exclusive
864	   sets.  A consumer may choose any subset of the media captures in a
865	   simultaneous set, it does not have to choose all the captures in a
866	   simultaneous set if it does not want to receive all of them.

868	                           +-------------------+
869	                           | Simultaneous Sets |
870	                           +-------------------+
871	                           | {VC0, VC1, VC2}   |
872	                           | {VC0, VC3, VC2}   |
873	                           +-------------------+

875	                Table 1: Two Simultaneous Transmission Sets

877	   A media provider includes the simultaneous sets in its provider
878	   advertisement.  These simultaneous set constraints apply across
879	   all the captures scenes in the advertisement.  The simultaneous
880	   transmission sets MUST allow all the media captures in a
881	   particular capture scene entry to be used simultaneously.

883	7. Encodings

885	   We have considered how providers can describe the content of media
886	   to consumers.  We will now consider how the providers communicate
887	   information about their abilities to send streams.  We introduce
888	   two constructs - individual encodings and encoding groups.
889	   Consumers will then map the media captures they want onto the
890	   encodings with encoding parameters they want.  This process is
891	   then described.

893	7.1. Individual Encodings

895	   An individual encoding represents a way to encode a media capture
896	   to become a capture encoding, to be sent as an encoded media
897	   stream from the media provider to the media consumer.  An
898	   individual encoding has a set of parameters characterizing how the
899	   media is encoded.

901	   Different media types have different parameters, and different
902	   encoding algorithms may have different parameters.  An individual
903	   encoding can be assigned to only one capture encoding at a time.

905	   The parameters of an individual encoding represent the maximum
906	   values for certain aspects of the encoding.  A particular
907	   instantiation into a capture encoding might use lower values than
908	   these maximums.

910	   The following tables show the variables for audio and video
911	   encoding.

913	   +--------------+--------------------------------------------------
914	   --+
915	   | Name         | Description
916	   |
917	   +--------------+--------------------------------------------------
918	   --+
919	   | encodeID     | A unique identifier for the individual encoding
920	   |
921	   | maxBandwidth | Maximum number of bits per second
922	   |
923	   | maxH264Mbps  | Maximum number of macroblocks per second: ((width
924	   |
925	   |              | + 15) / 16) * ((height + 15) / 16) *
926	   |
927	   |              | framesPerSecond
928	   |
929	   | maxWidth     | Video resolution's maximum supported width,
930	   |
931	   |              | expressed in pixels
932	   |
933	   | maxHeight    | Video resolution's maximum supported height,
934	   |
935	   |              | expressed in pixels
936	   |
937	   | maxFrameRate | Maximum supported frame rate
938	   |
939	   +--------------+--------------------------------------------------
940	   --+

942	               Table 2: Individual Video Encoding Parameters

944	           +--------------+-----------------------------------+
945	           | Name         | Description                       |
946	           +--------------+-----------------------------------+
947	           | maxBandwidth | Maximum number of bits per second |
948	           +--------------+-----------------------------------+

950	               Table 3: Individual Audio Encoding Parameters

952	7.2. Encoding Group

954	   An encoding group includes a set of one or more individual
955	   encodings, plus some parameters that apply to the group as a
956	   whole.  By grouping multiple individual encodings together, an
957	   encoding group describes additional constraints on bandwidth and
958	   other parameters for the group.  Table 4 shows the parameters and
959	   individual encoding sets that are part of an encoding group.

961	   +-------------------+---------------------------------------------
962	   --+
963	   | Name              | Description
964	   |
965	   +-------------------+---------------------------------------------
966	   --+
967	   | encodeGroupID     | A unique identifier for the encoding group
968	   |
969	   | maxGroupBandwidth | Maximum number of bits per second relating
970	   to |
971	   |                   | all encodings combined
972	   |
973	   | maxGroupH264Mbps  | Maximum number of macroblocks per second
974	   |
975	   |                   | relating to all video encodings combined
976	   |
977	   | videoEncodings[]  | Set of potential encodings (list of
978	   |
979	   |                   | encodeIDs)
980	   |
981	   | audioEncodings[]  | Set of potential encodings (list of
982	   |
983	   |                   | encodeIDs)
984	   |
985	   +-------------------+---------------------------------------------
986	   --+

988	                          Table 4: Encoding Group

990	   When the individual encodings in a group are instantiated into
991	   capture encodings, each capture encoding has a bandwidth that must
992	   be less than or equal to the maxBandwidth for the particular
993	   individual encoding.  The maxGroupBandwidth parameter gives the
994	   additional restriction that the sum of all the individual capture
995	   encoding bandwidths must be less than or equal to the
996	   maxGroupBandwidth value.

998	   Likewise, the sum of the macroblocks per second of each
999	   instantiated encoding in the group must not exceed the
1000	   maxGroupH264Mbps value.

1002	   The following diagram illustrates the structure of a media
1003	   provider's Encoding Groups and their contents.

1005	   ,-------------------------------------------------.
1006	   |             Media Provider                      |
1007	   |                                                 |
1008	   |  ,--------------------------------------.       |
1009	   |  | ,--------------------------------------.     |
1010	   |  | | ,--------------------------------------.   |
1011	   |  | | |          Encoding Group              |   |
1012	   |  | | | ,-----------.                        |   |
1013	   |  | | | |           | ,---------.            |   |
1014	   |  | | | |           | |         | ,---------.|   |
1015	   |  | | | | Encoding1 | |Encoding2| |Encoding3||   |
1016	   |  `.| | |           | |         | `---------'|   |
1017	   |    `.| `-----------' `---------'            |   |
1018	   |      `--------------------------------------'   |
1019	   `-------------------------------------------------'

1021	                    Figure 1: Encoding Group Structure

1023	   A media provider advertises one or more encoding groups.  Each
1024	   encoding group includes one or more individual encodings.  Each
1025	   individual encoding can represent a different way of encoding
1026	   media.  For example one individual encoding may be 1080p60 video,
1027	   another could be 720p30, with a third being CIF.

1029	   While a typical 3 codec/display system might have one encoding
1030	   group per "codec box", there are many possibilities for the number
1031	   of encoding groups a provider may be able to offer and for the
1032	   encoding values in each encoding group.

1034	   There is no requirement for all encodings within an encoding group
1035	   to be instantiated at once.

1037	8. Associating Media Captures with Encoding Groups

1039	   Every media capture is associated with an encoding group, which is
1040	   used to instantiate that media capture into one or more capture
1041	   encodings.  Each media capture has an encoding group attribute.
1042	   The value of this attribute is the encodeGroupID for the encoding
1043	   group with which it is associated.  More than one media capture
1044	   may use the same encoding group.

1046	   The maximum number of streams that can result from a particular
1047	   encoding group constraint is equal to the number of individual
1048	   encodings in the group.  The actual number of capture encodings
1049	   used at any time may be less than this maximum.  Any of the media
1050	   captures that use a particular encoding group can be encoded
1051	   according to any of the individual encodings in the group.  If
1052	   there are multiple individual encodings in the group, then the
1053	   media consumer can configure the media provider to encode a single
1054	   media capture into multiple different capture encodings at the
1055	   same time, subject to the Max Capture Encodings constraint, with
1056	   each capture encoding following the constraints of a different
1057	   individual encoding.

1059	   The Encoding Groups MUST allow all the media captures in a
1060	   particular capture scene entry to be used simultaneously.

1062	9. Consumer's Choice of Streams to Receive from the Provider

1064	   After receiving the provider's advertised media captures and
1065	   associated constraints, the consumer must choose which media
1066	   captures it wishes to receive, and which individual encodings from
1067	   the provider it wants to use to encode the captures.  Each media
1068	   capture has an encoding group ID attribute which specifies which
1069	   individual encodings are available to be used for that media
1070	   capture.

1072	   For each media capture the consumer wants to receive, it
1073	   configures one or more of the encodings in that capture's encoding
1074	   group.  The consumer does this by telling the provider the
1075	   resolution, frame rate, bandwidth, etc. when asking for capture
1076	   encodings for its chosen captures.  Upon receipt of this
1077	   configuration command from the consumer, the provider generates a
1078	   stream for each such configured capture encoding and sends those
1079	   streams to the consumer.

1081	   The consumer must have received at least one capture advertisement
1082	   from the provider to be able to configure the provider's
1083	   generation of media streams.

1085	   The consumer is able to change its configuration of the provider's
1086	   encodings any number of times during the call, either in response
1087	   to a new capture advertisement from the provider or autonomously.
1088	   The consumer need not send a new configure message to the provider
1089	   when it receives a new capture advertisement from the provider
1090	   unless the contents of the new capture advertisement cause the
1091	   consumer's current configure message to become invalid.

1093	   When choosing which streams to receive from the provider, and the
1094	   encoding characteristics of those streams, the consumer needs to
1095	   take several things into account: its local preference,
1096	   simultaneity restrictions, and encoding limits.

1098	9.1. Local preference

1100	   A variety of local factors will influence the consumer's choice of
1101	   streams to be received from the provider:

1103	   o  if the consumer is an endpoint, it is likely that it would
1104	      choose, where possible, to receive video and audio captures
1105	      that match the number of display devices and audio system it
1106	      has

1108	   o  if the consumer is a middle box such as an MCU, it may choose
1109	      to receive loudest speaker streams (in order to perform its own
1110	      media composition) and avoid pre-composed video captures

1112	   o  user choice (for instance, selection of a new layout) may
1113	      result in a different set of media captures, or different
1114	      encoding characteristics, being required by the consumer

1116	9.2. Physical simultaneity restrictions

1118	   There may be physical simultaneity constraints imposed by the
1119	   provider that affect the provider's ability to simultaneously send
1120	   all of the captures the consumer would wish to receive.  For
1121	   instance, a middle box such as an MCU, when connected to a multi-
1122	   camera room system, might prefer to receive both individual camera
1123	   streams of the people present in the room and an overall view of
1124	   the room from a single camera.  Some endpoint systems might be
1125	   able to provide both of these sets of streams simultaneously,
1126	   whereas others may not (if the overall room view were produced by
1127	   changing the zoom level on the center camera, for instance).

1129	9.3. Encoding and encoding group limits

1131	   Each of the provider's encoding groups has limits on bandwidth and
1132	   macroblocks per second, and the constituent potential encodings
1133	   have limits on the bandwidth, macroblocks per second, video frame
1134	   rate, and resolution that can be provided.  When choosing the
1135	   media captures to be received from a provider, a consumer device
1136	   must ensure that the encoding characteristics requested for each
1137	   individual media capture fits within the capability of the
1138	   encoding it is being configured to use, as well as ensuring that
1139	   the combined encoding characteristics for media captures fit
1140	   within the capabilities of their associated encoding groups.  In
1141	   some cases, this could cause an otherwise "preferred" choice of
1142	   capture encodings to be passed over in favour of different capture
1143	   encodings - for instance, if a set of 3 media captures could only
1144	   be provided at a low resolution then a 3 screen device could
1145	   switch to favoring a single, higher quality, capture encoding.

1147	9.4. Message Flow

1149	   The following diagram shows the basic flow of messages between a
1150	   media provider and a media consumer.  The usage of the "capture
1151	   advertisement" and "configure encodings" message is described
1152	   above. The consumer also sends its own capability message to the
1153	   provider which may contain information about its own capabilities
1154	   or restrictions.

1156	   Diagram for Message Flow

1158	            Media Consumer                         Media Provider
1159	            --------------                         ------------
1160	                  |                                     |
1161	                  |----- Consumer Capability ---------->|
1162	                  |                                     |
1163	                  |                                     |
1164	                  |<---- Capture advertisement ---------|
1165	                  |                                     |
1166	                  |                                     |
1167	                  |------ Configure encodings --------->|
1168	                  |                                     |

1170	   In order for a maximally-capable provider to be able to advertise
1171	   a manageable number of video captures to a consumer, there is a
1172	   potential use for the consumer, at the start of CLUE, to be able
1173	   to inform the provider of its capabilities.  One example here
1174	   would be the video capture attribute set - a consumer could tell
1175	   the provider the complete set of video capture attributes it is
1176	   able to understand and so the provider would be able to reduce the
1177	   capture scene it advertises to be tailored to the consumer.

1179	   TBD - the content of the consumer capability message needs to be
1180	   better defined.  The authors believe there is a need for this
1181	   message, but have not worked out the details yet.

1183	10. Extensibility

1185	   One of the most important characteristics of the Framework is its
1186	   extensibility.  Telepresence is a relatively new industry and
1187	   while we can foresee certain directions, we also do not know
1188	   everything about how it will develop.  The standard for
1189	   interoperability and handling multiple streams must be future-
1190	   proof. The framework itself is inherently extensible through
1191	   expanding the data model types.  For example:

1193	   o  Adding more types of media, such as telemetry, can done by
1194	      defining additional types of captures in addition to audio and
1195	      video.

1197	   o  Adding new functionalities , such as 3-D, say, will require
1198	      additional attributes describing the captures.

1200	   o  Adding a new codecs, such as H.265, can be accomplished by
1201	      defining new encoding variables.

1203	   The infrastructure is designed to be extended rather than
1204	   requiring new infrastructure elements.  Extension comes through
1205	   adding to defined types.

1207	   Assuming the implementation is in something like XML, adding data
1208	   elements and attributes makes extensibility easy.

1210	11. Examples - Using the Framework

1212	   This section shows some examples in more detail how to use the
1213	   framework to represent a typical case for telepresence rooms.
1214	   First an endpoint is illustrated, then an MCU case is shown.

1216	11.1. Three screen endpoint media provider

1218	   Consider an endpoint with the following description:

1220	   3 cameras, 3 displays, a 6 person table

1222	   o  Each video device can provide one capture for each 1/3 section
1223	      of the table

1225	   o  A single capture representing the active speaker can be
1226	      provided

1228	   o  A single capture representing the active speaker with the other
1229	      2 captures shown picture in picture within the stream can be
1230	      provided

1232	   o  A capture showing a zoomed out view of all 6 seats in the room
1233	      can be provided

1235	   The audio and video captures for this endpoint can be described as
1236	   follows.

1238	   Video Captures:

1240	   o  VC0- (the camera-left camera stream), encoding group=EG0,
1241	      content=main, switched=false

1243	   o  VC1- (the center camera stream), encoding group=EG1,
1244	      content=main, switched=false

1246	   o  VC2- (the camera-right camera stream), encoding group=EG2,
1247	      content=main, switched=false

1249	   o  VC3- (the loudest panel stream), encoding group=EG1,
1250	      content=main, switched=true

1252	   o  VC4- (the loudest panel stream with PiPs), encoding group=EG1,
1253	      content=main, composed=true, switched=true

1255	   o  VC5- (the zoomed out view of all people in the room), encoding
1256	      group=EG1, content=main, composed=false, switched=false

1258	   o  VC6- (presentation stream), encoding group=EG1, content=slides,
1259	      switched=false

1261	   The following diagram is a top view of the room with 3 cameras, 3
1262	   displays, and 6 seats.  Each camera is capturing 2 people.  The
1263	   six seats are not all in a straight line.

1265	      ,-. D
1266	     (   )`--.__        +---+
1267	      `-' /     `--.__  |   |
1268	    ,-.  |            `-.._ |_-+Camera 2 (VC2)
1269	   (   ).'        ___..-+-''`+-+
1270	    `-' |_...---''      |   |
1271	    ,-.c+-..__          +---+
1272	   (   )|     ``--..__  |   |
1273	    `-' |             ``+-..|_-+Camera 1 (VC1)
1274	    ,-. |            __..--'|+-+
1275	   (   )|     __..--'   |   |
1276	    `-'b|..--'          +---+
1277	    ,-. |``---..___     |   |
1278	   (   )\          ```--..._|_-+Camera 0 (VC0)
1279	    `-'  \             _..-''`-+
1280	     ,-. \      __.--'' |   |
1281	    (   ) |..-''        +---+
1282	     `-' a

1284	   The two points labeled b and c are intended to be at the midpoint
1285	   between the seating positions, and where the fields of view of the
1286	   cameras intersect.

1288	   The plane of interest for VC0 is a vertical plane that intersects
1289	   points 'a' and 'b'.

1291	   The plane of interest for VC1 intersects points 'b' and 'c'. The
1292	   plane of interest for VC2 intersects points 'c' and 'd'.

1294	   This example uses an area scale of millimeters.

1296	   Areas of capture:

1298	       bottom left    bottom right  top left         top right
1299	   VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1300	   VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1301	   VC2 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1302	   VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1303	   VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1304	   VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1305	   VC6 none

1307	   Points of capture:
1308	   VC0 (-1678,0,800)
1309	   VC1 (0,0,800)
1310	   VC2 (1678,0,800)
1311	   VC3 none
1312	   VC4 none
1313	   VC5 (0,0,800)
1314	   VC6 none

1316	   In this example, the right edge of the VC0 area lines up with the
1317	   left edge of the VC1 area.  It doesn't have to be this way.  There
1318	   could be a gap or an overlap.  One additional thing to note for
1319	   this example is the distance from a to b is equal to the distance
1320	   from b to c and the distance from c to d.  All these distances are
1321	   1346 mm. This is the planar width of each area of capture for VC0,
1322	   VC1, and VC2.

1324	   Note the text in parentheses (e.g. "the camera-left camera
1325	   stream") is not explicitly part of the model, it is just
1326	   explanatory text for this example, and is not included in the
1327	   model with the media captures and attributes.  Also, the
1328	   "composed" boolean attribute doesn't say anything about how a
1329	   capture is composed, so the media consumer can't tell based on
1330	   this attribute that VC4 is composed of a "loudest panel with
1331	   PiPs".

1333	   Audio Captures:

1335	   o  AC0 (camera-left), encoding group=EG3, content=main, channel
1336	      format=mono

1338	   o  AC1 (camera-right), encoding group=EG3, content=main, channel
1339	      format=mono

1341	   o  AC2 (center) encoding group=EG3, content=main, channel
1342	      format=mono

1344	   o  AC3 being a simple pre-mixed audio stream from the room (mono),
1345	      encoding group=EG3, content=main, channel format=mono

1347	   o  AC4 audio stream associated with the presentation video (mono)
1348	      encoding group=EG3, content=slides, channel format=mono

1350	   Areas of capture:

1352	       bottom left    bottom right  top left         top right

1354	   AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757)
1355	   AC1 (  673,3000,0) (2011,2850,0) (  673,3000,757) (2011,3000,757)
1356	   AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757)
1357	   AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757)
1358	   AC4 none

1360	   The physical simultaneity information is:

1362	      Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6}

1364	      Simultaneous transmission set #2 {VC0, VC2, VC5, VC6}

1366	   This constraint indicates it is not possible to use all the VCs at
1367	   the same time.  VC5 can not be used at the same time as VC1 or VC3
1368	   or VC4.  Also, using every member in the set simultaneously may
1369	   not make sense - for example VC3(loudest) and VC4 (loudest with
1370	   PIP).  (In addition, there are encoding constraints that make
1371	   choosing all of the VCs in a set impossible.  VC1, VC3, VC4, VC5,
1372	   VC6 all use EG1 and EG1 has only 3 ENCs.  This constraint shows up
1373	   in the encoding groups, not in the simultaneous transmission
1374	   sets.)

1376	   In this example there are no restrictions on which audio captures
1377	   can be sent simultaneously.

1379	   Encoding Groups:

1381	   This example has three encoding groups associated with the video
1382	   captures.  Each group can have 3 encodings, but with each
1383	   potential encoding having a progressively lower specification.  In
1384	   this example, 1080p60 transmission is possible (as ENC0 has a
1385	   maxMbps value compatible with that) as long as it is the only
1386	   active encoding in the group(as maxMbps for the entire encoding
1387	   group is also 489600).  Significantly, as up to 3 encodings are
1388	   available per group, it is possible to transmit some video
1389	   captures simultaneously that are not in the same entry in the
1390	   capture scene.  For example VC1 and VC3 at the same time.

1392	   It is also possible to transmit multiple capture encodings of a
1393	   single video capture.  For example VC0 can be encoded using ENC0
1394	   and ENC1 at the same time, as long as the encoding parameters
1395	   satisfy the constraints of ENC0, ENC1, and EG0, such as one at
1396	   1080p30 and one at 720p30.

1398	   encodeGroupID=EG0, maxGroupH264Mbps=489600,
1399	   maxGroupBandwidth=6000000
1400	       encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1401	                      maxH264Mbps=489600, maxBandwidth=4000000
1402	       encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1403	                      maxH264Mbps=108000, maxBandwidth=4000000
1404	       encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30,
1405	                      maxH264Mbps=61200, maxBandwidth=4000000
1406	   encodeGroupID=EG1 maxGroupH264Mbps=489600
1407	   maxGroupBandwidth=6000000
1408	       encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1409	                      maxH264Mbps=489600, maxBandwidth=4000000
1410	       encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1411	                      maxH264Mbps=108000, maxBandwidth=4000000
1412	       encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30,
1413	                      maxH264Mbps=61200, maxBandwidth=4000000
1414	   encodeGroupID=EG2 maxGroupH264Mbps=489600
1415	   maxGroupBandwidth=6000000
1416	       encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60,
1417	                      maxH264Mbps=489600, maxBandwidth=4000000
1418	       encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30,
1419	                      maxH264Mbps=108000, maxBandwidth=4000000
1420	       encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30,
1421	                      maxH264Mbps=61200, maxBandwidth=4000000

1423	                Figure 2: Example Encoding Groups for Video

1425	   For audio, there are five potential encodings available, so all
1426	   five audio captures can be encoded at the same time.

1428	   encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000
1429	       encodeID=ENC9, maxBandwidth=64000
1430	       encodeID=ENC10, maxBandwidth=64000
1431	       encodeID=ENC11, maxBandwidth=64000
1432	       encodeID=ENC12, maxBandwidth=64000
1433	       encodeID=ENC13, maxBandwidth=64000

1435	                Figure 3: Example Encoding Group for Audio

1437	   Capture Scenes:

1439	   The following table represents the capture scenes for this
1440	   provider. Recall that a capture scene is composed of alternative
1441	   capture scene entries covering the same scene.  Capture Scene #1
1442	   is for the main people captures, and Capture Scene #2 is for
1443	   presentation.

1445	   Each row in the table is a separate entry in the capture scene

1447	                           +------------------+
1448	                           | Capture Scene #1 |
1449	                           +------------------+
1450	                           | VC0, VC1, VC2    |
1451	                           | VC3              |
1452	                           | VC4              |
1453	                           | VC5              |
1454	                           | AC0, AC1, AC2    |
1455	                           | AC3              |
1456	                           +------------------+

1458	                           +------------------+
1459	                           | Capture Scene #2 |
1460	                           +------------------+
1461	                           | VC6              |
1462	                           | AC4              |
1463	                           +------------------+

1465	   Different capture scenes are unique to each other, non-
1466	   overlapping. A consumer can choose an entry from each capture
1467	   scene.  In this case the three captures VC0, VC1, and VC2 are one
1468	   way of representing the video from the endpoint.  These three
1469	   captures should appear adjacent next to each other.
1470	   Alternatively, another way of representing the Capture Scene is
1471	   with the capture VC3, which automatically shows the person who is
1472	   talking.  Similarly for the VC4 and VC5 alternatives.

1474	   As in the video case, the different entries of audio in Capture
1475	   Scene #1 represent the "same thing", in that one way to receive
1476	   the audio is with the 3 audio captures (AC0, AC1, AC2), and
1477	   another way is with the mixed AC3.  The Media Consumer can choose
1478	   an audio capture entry it is capable of receiving.

1480	   The spatial ordering is understood by the media capture attributes
1481	   area and point of capture.

1483	   A Media Consumer would likely want to choose a capture scene entry
1484	   to receive based in part on how many streams it can simultaneously
1485	   receive.  A consumer that can receive three people streams would
1486	   probably prefer to receive the first entry of Capture Scene #1
1487	   (VC0, VC1, VC2) and not receive the other entries.  A consumer
1488	   that can receive only one people stream would probably choose one
1489	   of the other entries.

1491	   If the consumer can receive a presentation stream too, it would
1492	   also choose to receive the only entry from Capture Scene #2 (VC6).

1494	11.2. Encoding Group Example

1496	   This is an example of an encoding group to illustrate how it can
1497	   express dependencies between encodings.

1499	   encodeGroupID=EG0, maxGroupH264Mbps=489600,
1500	   maxGroupBandwidth=6000000
1501	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1502	   maxFrameRate=60,
1503	                         maxH264Mbps=244800, maxBandwidth=4000000
1504	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1505	   maxFrameRate=60,
1506	                         maxH264Mbps=244800, maxBandwidth=4000000
1507	       encodeID=AUDENC0, maxBandwidth=96000
1508	       encodeID=AUDENC1, maxBandwidth=96000
1509	       encodeID=AUDENC2, maxBandwidth=96000

1511	   Here, the encoding group is EG0.  It can transmit up to two
1512	   1080p30 capture encodings (Mbps for 1080p = 244800), but it is
1513	   capable of transmitting a maxFrameRate of 60 frames per second
1514	   (fps).  To achieve the maximum resolution (1920 x 1088) the frame
1515	   rate is limited to 30 fps.  However 60 fps can be achieved at a
1516	   lower resolution if required by the consumer.  Although the
1517	   encoding group is capable of transmitting up to 6Mbit/s, no
1518	   individual video encoding can exceed 4Mbit/s.

1520	   This encoding group also allows up to 3 audio encodings, AUDENC<0-
1521	   2>. It is not required that audio and video encodings reside
1522	   within the same encoding group, but if so then the group's overall
1523	   maxBandwidth value is a limit on the sum of all audio and video
1524	   encodings configured by the consumer.  A system that does not wish
1525	   or need to combine bandwidth limitations in this way should
1526	   instead use separate encoding groups for audio and video in order
1527	   for the bandwidth limitations on audio and video to not interact.

1529	   Audio and video can be expressed in separate encoding groups, as
1530	   in this illustration.

1532	   encodeGroupID=EG0, maxGroupH264Mbps=489600,
1533	   maxGroupBandwidth=6000000
1534	       encodeID=VIDENC0, maxWidth=1920, maxHeight=1088,
1535	   maxFrameRate=60,
1536	                         maxH264Mbps=244800, maxBandwidth=4000000
1537	       encodeID=VIDENC1, maxWidth=1920, maxHeight=1088,
1538	   maxFrameRate=60,
1539	                         maxH264Mbps=244800, maxBandwidth=4000000
1540	   encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000
1541	       encodeID=AUDENC0, maxBandwidth=96000
1542	       encodeID=AUDENC1, maxBandwidth=96000
1543	       encodeID=AUDENC2, maxBandwidth=96000

1545	11.3. The MCU Case

1547	   This section shows how an MCU might express its Capture Scenes,
1548	   intending to offer different choices for consumers that can handle
1549	   different numbers of streams.  A single audio capture stream is
1550	   provided for all single and multi-screen configurations that can
1551	   be associated (e.g. lip-synced) with any combination of video
1552	   captures at the consumer.

1554	   +--------------------+--------------------------------------------
1555	   -+
1556	   | Capture Scene #1   | note
1557	   |
1558	   +--------------------+--------------------------------------------
1559	   -+
1560	   | VC0                | video capture for single screen consumer
1561	   |
1562	   | VC1, VC2           | video capture for 2 screen consumer
1563	   |
1564	   | VC3, VC4, VC5      | video capture for 3 screen consumer
1565	   |
1566	   | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer
1567	   |
1568	   | AC0                | audio capture representing all participants
1569	   |
1570	   +--------------------+--------------------------------------------
1571	   -+

1573	   If / when a presentation stream becomes active within the
1574	   conference the MCU might re-advertise the available media as:

1576	        +------------------+--------------------------------------+
1577	        | Capture Scene #2 | note                                 |
1578	        +------------------+--------------------------------------+
1579	        | VC10             | video capture for presentation       |
1580	        | AC1              | presentation audio to accompany VC10 |
1581	        +------------------+--------------------------------------+

1583	11.4. Media Consumer Behavior

1585	   This section gives an example of how a media consumer might behave
1586	   when deciding how to request streams from the three screen
1587	   endpoint described in the previous section.

1589	   The receive side of a call needs to balance its requirements,
1590	   based on number of screens and speakers, its decoding capabilities
1591	   and available bandwidth, and the provider's capabilities in order
1592	   to optimally configure the provider's streams.  Typically it would
1593	   want to receive and decode media from each capture scene
1594	   advertised by th provider.

1596	   A sane, basic, algorithm might be for the consumer to go through
1597	   eac capture scene in turn and find the collection of video
1598	   captures that best matches the number of screens it has (this
1599	   might include consideration of screens dedicated to presentation
1600	   video display rather than "people" video) and then decide between
1601	   alternative entries in the video capture scenes based either on
1602	   hard-coded preferences or user choice.  Once this choice has been
1603	   made, the consumer would then decide how to configure the
1604	   provider's encoding groups in order to make best use of the
1605	   available network bandwidth and its own decoding capabilities.

1607	11.4.1. One screen consumer

1609	   VC3, VC4 and VC5 are all different entries by themselves, not
1610	   grouped together in a single entry, so the receiving device should
1611	   choose between one of those.  The choice would come down to
1612	   whether to see the greatest number of participants simultaneously
1613	   at roughly equal precedence (VC5), a switched view of just the
1614	   loudest region (VC3) or a switched view with PiPs (VC4).  An
1615	   endpoint device with a small amount of knowledge of these
1616	   differences could offer a dynamic choice of these options, in-
1617	   call, to the user.

1619	11.4.2. Two screen consumer configuring the example

1621	   Mixing systems with an even number of screens, "2n", and those
1622	   with "2n+1" cameras (and vice versa) is always likely to be the
1623	   problematic case.  In this instance, the behavior is likely to be
1624	   determined by whether a "2 screen" system is really a "2 decoder"
1625	   system, i.e., whether only one received stream can be displayed
1626	   per screen or whether more than 2 streams can be received and
1627	   spread across the available screen area.  To enumerate 3 possible
1628	   behaviors here for the 2 screen system when it learns that the far
1629	   end is "ideally" expressed via 3 capture streams:

1631	   1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as
1632	      per the 1 screen consumer case above) and either leave one
1633	      screen blank or use it for presentation if / when a
1634	      presentation becomes active.

1636	   2. Receive 3 streams (VC0, VC1 and VC2) and display across 2
1637	      screens (either with each capture being scaled to 2/3 of a
1638	      screen and the centre capture being split across 2 screens) or,
1639	      as would be necessary if there were large bezels on the
1640	      screens, with each stream being scaled to 1/2 the screen width
1641	      and height and there being a 4th "blank" panel.  This 4th panel
1642	      could potentially be used for any presentation that became
1643	      active during the call.

1645	   3. Receive 3 streams, decode all 3, and use control information
1646	      indicating which was the most active to switch between showing
1647	      the left and centre streams (one per screen) and the centre and
1648	      right streams.

1650	   For an endpoint capable of all 3 methods of working described
1651	   above, again it might be appropriate to offer the user the choice
1652	   of display mode.

1654	11.4.3. Three screen consumer configuring the example

1656	   This is the most straightforward case - the consumer would look to
1657	   identify a set of streams to receive that best matched its
1658	   available screens and so the VC0 plus VC1 plus VC2 should match
1659	   optimally.  The spatial ordering would give sufficient information
1660	   for the correct video capture to be shown on the correct screen,
1661	   and the consumer would either need to divide a single encoding
1662	   group's capability by 3 to determine what resolution and frame
1663	   rate to configure the provider with or to configure the individual
1664	   video captures' encoding groups with what makes most sense (taking
1665	   into account the receive side decode capabilities, overall call
1666	   bandwidth, the resolution of the screens plus any user preferences
1667	   such as motion vs sharpness).

1669	12. Acknowledgements

1671	   Mark Gorzyinski contributed much to the approach.  We want to
1672	   thank Stephen Botzko for helpful discussions on audio.

1674	13. IANA Considerations

1676	   TBD

1678	14. Security Considerations

1680	   TBD

1682	15. Changes Since Last Version

1684	   NOTE TO THE RFC-Editor: Please remove this section prior to
1685	   publication as an RFC.

1687	   Changes from 06 to 07:

1689	   1. Ticket #9.  Rename Axis of Capture Point attribute to Point on
1690	      Line of Capture.  Clarify the description of this attribute.

1692	   2. Ticket #17.  Add "capture encoding" definition.  Use this new
1693	      term throughout document as appropriate, replacing some usage
1694	      of the terms "stream" and "encoding".

1696	   3. Ticket #18.  Add Max Capture Encodings media capture attribute.

1698	   4. Add clarification that different capture scene entries are not
1699	      necessarily mutually exclusive.

1701	   Changes from 05 to 06:

1703	   1. Capture scene description attribute is a list of text strings,
1704	      each in a different language, rather than just a single string.

1706	   2. Add new Axis of Capture Point attribute.

1708	   3. Remove appendices A.1 through A.6.

1710	   4. Clarify that the provider must use the same coordinate system
1711	      with same scale and origin for all coordinates within the same
1712	      capture scene.

1714	   Changes from 04 to 05:

1716	   1. Clarify limitations of "composed" attribute.

1718	   2. Add new section "capture scene entry attributes" and add the
1719	      attribute "scene-switch-policy".

1721	   3. Add capture scene description attribute and description
1722	      language attribute.

1724	   4. Editorial changes to examples section for consistency with the
1725	      rest of the document.

1727	   Changes from 03 to 04:

1729	   1. Remove sentence from overview - "This constitutes a significant
1730	      change ..."

1732	   2. Clarify a consumer can choose a subset of captures from a
1733	      capture scene entry or a simultaneous set (in section "capture
1734	      scene" and "consumer's choice...").

1736	   3. Reword first paragraph of Media Capture Attributes section.

1738	   4. Clarify a stereo audio capture is different from two mono audio
1739	      captures (description of audio channel format attribute).

1741	   5. Clarify what it means when coordinate information is not
1742	      specified for area of capture, point of capture, area of scene.

1744	   6. Change the term "producer" to "provider" to be consistent (it
1745	      was just in two places).

1747	   7. Change name of "purpose" attribute to "content" and refer to
1748	      RFC4796 for values.

1750	   8. Clarify simultaneous sets are part of a provider advertisement,
1751	      and apply across all capture scenes in the advertisement.

1753	   9. Remove sentence about lip-sync between all media captures in a
1754	      capture scene.

1756	   10.   Combine the concepts of "capture scene" and "capture set"
1757	      into a single concept, using the term "capture scene" to
1758	      replace the previous term "capture set", and eliminating the
1759	      original separate capture scene concept.

1761	   Informative References

1763	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1764	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1766	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G.,
1767	   Johnston,
1768	              A., Peterson, J., Sparks, R., Handley, M., and E.
1769	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
1770	              June 2002.

1772	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1773	              Jacobson, "RTP: A Transport Protocol for Real-Time
1774	              Applications", STD 64, RFC 3550, July 2003.

1776	   [RFC4353]  Rosenberg, J., "A Framework for Conferencing with the
1777	              Session Initiation Protocol (SIP)", RFC 4353,
1778	              February 2006.

1780	   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session
1781	   Description
1782	              Protocol (SDP) Content Attribute", RFC 4796,
1783	              February 2007.

1785	   [RFC5117]  Westerlund, M. and S. Wenger, "RTP Topologies", RFC
1786	   5117,
1787	              January 2008.

1789	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
1790	              Languages", BCP 47, RFC 5646, September 2009.

1792	   [IANA-Lan]
1793	              IANA, "Language Subtag Registry",
1794	   <http://www.iana.org/assignments/
1795	              language-subtag-registry>.

1797	16. Authors' Addresses

1799	   Mark Duckworth (editor)
1800	   Polycom
1801	   Andover, MA  01810
1802	   USA

1804	   Email: mark.duckworth@polycom.com

1806	   Andrew Pepperell
1807	   Silverflare
1808	   Uxbridge, England
1809	   UK

1811	   Email: apeppere@gmail.com

1813	   Stephan Wenger
1814	   Vidyo, Inc.
1815	   433 Hakcensack Ave.
1816	   Hackensack, N.J. 07601
1817	   USA

1819	   Email: stewe@stewe.org