idnits 2.17.1 draft-ietf-clue-framework-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2011) is 4559 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: May 3, 2012 Polycom 6 A. Pepperell 7 B. Baldino 8 Cisco Systems 9 October 31, 2011 11 Framework for Telepresence Multi-Streams 12 draft-ietf-clue-framework-01.txt 14 Abstract 16 This memo offers a framework for a protocol that enables devices in a 17 telepresence conference to interoperate by specifying the 18 relationships between multiple RTP streams. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on May 3, 2012. 37 Copyright Notice 39 Copyright (c) 2011 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 55 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 4. Framework Features . . . . . . . . . . . . . . . . . . . . . . 7 58 5. Stream Information . . . . . . . . . . . . . . . . . . . . . . 8 59 5.1. Overview of the Model . . . . . . . . . . . . . . . . . . 9 60 5.2. Media capture -- Audio and Video . . . . . . . . . . . . . 9 61 5.3. Attributes for Media Captures . . . . . . . . . . . . . . 10 62 5.3.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . 11 63 5.3.2. Composed . . . . . . . . . . . . . . . . . . . . . . . 11 64 5.3.3. Audio Channel Format . . . . . . . . . . . . . . . . . 11 65 5.3.4. Area of capture . . . . . . . . . . . . . . . . . . . 12 66 5.3.5. Point of capture . . . . . . . . . . . . . . . . . . . 12 67 5.3.6. Auto-switched . . . . . . . . . . . . . . . . . . . . 13 68 5.4. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 13 69 5.5. Attributes for Capture Sets . . . . . . . . . . . . . . . 15 70 5.5.1. Area of Scene . . . . . . . . . . . . . . . . . . . . 15 71 5.5.2. Area Scale Millimeters . . . . . . . . . . . . . . . . 15 72 6. Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 16 73 6.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 16 74 6.1.1. Consumer Capability Message . . . . . . . . . . . . . 17 75 6.1.2. Provider Capabilities Announcement . . . . . . . . . . 17 76 6.1.3. Consumer Configure Request . . . . . . . . . . . . . . 17 77 6.2. Physical Simultaneity . . . . . . . . . . . . . . . . . . 18 78 6.3. Encoding Groups . . . . . . . . . . . . . . . . . . . . . 19 79 6.3.1. Encoding Group Structure . . . . . . . . . . . . . . . 20 80 6.3.2. Individual Encodes . . . . . . . . . . . . . . . . . . 21 81 6.3.3. More on Encoding Groups . . . . . . . . . . . . . . . 22 82 6.3.4. Examples of Encoding Groups . . . . . . . . . . . . . 23 83 7. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 25 84 8. Other aspects of the framework . . . . . . . . . . . . . . . . 25 85 9. Using the Framework . . . . . . . . . . . . . . . . . . . . . 26 86 9.1. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 29 87 9.2. Media Consumer Behavior . . . . . . . . . . . . . . . . . 30 88 9.2.1. One screen consumer . . . . . . . . . . . . . . . . . 30 89 9.2.2. Two screen consumer configuring the example . . . . . 30 90 9.2.3. Three screen consumer configuring the example . . . . 31 91 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31 92 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 93 12. Security Considerations . . . . . . . . . . . . . . . . . . . 31 94 13. Informative References . . . . . . . . . . . . . . . . . . . . 32 95 Appendix A. Open Issues . . . . . . . . . . . . . . . . . . . . . 32 96 A.1. Video layout arrangements and centralized composition . . 32 97 A.2. Source is selectable . . . . . . . . . . . . . . . . . . . 32 98 A.3. Media Source Selection . . . . . . . . . . . . . . . . . . 33 99 A.4. Endpoint requesting many streams from MCU . . . . . . . . 33 100 A.5. VAD (voice activity detection) tagging of audio streams . 33 101 A.6. Private Information . . . . . . . . . . . . . . . . . . . 34 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 104 1. Introduction 106 Current telepresence systems, though based on open standards such as 107 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 108 other. A major factor limiting the interoperability of telepresence 109 systems is the lack of a standardized way to describe and negotiate 110 the use of the multiple streams of audio and video comprising the 111 media flows. This draft provides a framework for a protocol to 112 enable interoperability by handling multiple streams in a 113 standardized way. It is intended to support the use cases described 114 in draft-ietf-clue-telepresence-use-cases-00 and to meet the 115 requirements in draft-romanow-clue-requirements-xx. 117 The solution described here is strongly focused on what is being done 118 today, rather than on a vision of future conferencing. At the same 119 time, the highest priority has been given to creating an extensible 120 framework to make it easy to accommodate future conferencing 121 functionality as it evolves. 123 The purpose of this effort is to make it possible to handle multiple 124 streams of media in such a way that a satisfactory user experience is 125 possible even when participants are on different vendor equipment and 126 when they are using devices with different types of communication 127 capabilities. Information about the relationship of media streams 128 must be communicated so that audio/video rendering can be done in the 129 best possible manner. In addition, it is necessary to choose which 130 media streams are sent. 132 There is no attempt here to dictate to the renderer what it should 133 do. What the renderer does is up to the renderer. 135 After the following Definitions, a short section introduces key 136 concepts. The body of the text comprises three sections that deal 137 with in turn stream content, choosing streams and an implementation 138 example. The media provider and media consumer behavior are 139 described in separate sections as well. Several appendices describe 140 topics that are under discussion for adding to the document. 142 2. Terminology 144 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 145 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 146 document are to be interpreted as described in RFC 2119 [RFC2119]. 148 3. Definitions 150 The definitions marked with an "*" are new; all the others are from 151 draft-wenger-clue-definitions-00-01.txt. 153 *Audio Capture: Media Capture for audio. Denoted as ACn. 155 Camera-Left and Right: For media captures, camera-left and camera- 156 right are from the point of view of a person observing the rendered 157 media. They are the opposite of stage-left and stage-right. 159 Capture Device: A device that converts audio and video input into an 160 electrical signal, in most cases to be fed into a media encoder. 161 Cameras and microphones are examples for capture devices. 163 Capture Scene: the scene that is captured by a collection of Capture 164 Devices. A Capture Scene may be represented by more than one type of 165 Media. A Capture Scene may include more than one Media Capture of 166 the same type. An example of a Capture Scene is the video image of a 167 group of people seated next to each other, along with the sound of 168 their voices, which could be represented by some number of VCs and 169 ACs. A middle box may also express Capture Scenes that it constructs 170 from Media streams it receives. 172 A Capture Set includes Media Captures that all represent some aspect 173 of the same Capture Scene. The items (rows) in a Capture Set 174 represent different alternatives for representing the same Capture 175 Scene. 177 Conference: used as defined in [RFC4353], A Framework for 178 Conferencing within the Session Initiation Protocol (SIP). 180 *Individual Encode: A variable with a set of attributes that 181 describes the maximum values of a single audio or video capture 182 encoding. The attributes include: maximum bandwidth- and for video 183 maximum macroblocks, maximum width, maximum height, maximum frame 184 rate. [Edt. These are based on H.264.] 186 *Encoding Group: Encoding group: A set of encoding parameters 187 representing a device's complete encoding capabilities or a 188 subdivision of them. Media stream providers formed of multiple 189 physical units, in each of which resides some encoding capability, 190 would typically advertise themselves to the remote media stream 191 consumer as being formed multiple encoding groups. Within each 192 encoding group, multiple potential actual encodings are possible, 193 with the sum of those encodings' characteristics constrained to being 194 less than or equal to the group-wide constraints. 196 Endpoint: The logical point of final termination through receiving, 197 decoding and rendering, and/or initiation through capturing, 198 encoding, and sending of media streams. An endpoint consists of one 199 or more physical devices which source and sink media streams, and 200 exactly one [RFC4353] Participant (which, in turn, includes exactly 201 one SIP User Agent). In contrast to an endpoint, an MCU may also 202 send and receive media streams, but it is not the initiator nor the 203 final terminator in the sense that Media is Captured or Rendered. 204 Endpoints can be anything from multiscreen/multicamera rooms to 205 handheld devices. 207 Endpoint Characteristics: include placement of Capture and Rendering 208 Devices, capture/render angle, resolution of cameras and screens, 209 spatial location and mixing parameters of microphones. Endpoint 210 characteristics are not specific to individual media streams sent by 211 the endpoint. 213 Front: the portion of the room closest to the cameras. In going 214 towards back you move away from the cameras. 216 MCU: Multipoint Control Unit (MCU) - a device that connects two or 217 more endpoints together into one single multimedia conference 218 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 219 tardy in requiring that media from the mixer be sent to EACH 220 participant. I think we have practical use cases where this is not 221 the case. But the bug (if it is one) is in 4353 and not herein. 223 Media: Any data that, after suitable encoding, can be conveyed over 224 RTP, including audio, video or timed text. 226 *Media Capture: a source of Media, such as from one or more Capture 227 Devices. A Media Capture (MC) may be the source of one or more Media 228 streams. A Media Capture may also be constructed from other Media 229 streams. A middle box can express Media Captures that it constructs 230 from Media streams it receives. 232 *Media Consumer: an Endpoint or middle box that receives Media 233 streams 235 *Media Provider: an Endpoint or middle box that sends Media streams 237 Model: a set of assumptions a telepresence system of a given vendor 238 adheres to and expects the remote telepresence system(s) also to 239 adhere to. 241 Render: the process of generating a representation from a media, such 242 as displayed motion video or sound emitted from loudspeakers. 244 *Simultaneous Transmission Set: a set of media captures that can be 245 transmitted simultaneously from a Media Provider. 247 Spatial Relation: The arrangement in space of two objects, in 248 contrast to relation in time or other relationships. See also 249 Camera-Left and Right. 251 Stage-Left and Right: For media captures, stage-left and stage-right 252 are the opposite of camera-left and camera-right. For the case of a 253 person facing (and captured by) a camera, stage-left and stage-right 254 are from the point of view of that person. 256 *Stream: RTP stream as in [RFC3550]. 258 Stream Characteristics: include media stream attributes commonly used 259 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 260 resolution, profile/level etc.) as well as CLUE specific attributes 261 (which could include for example and depending on the solution found: 262 the I-D or spatial location of a capture device a stream originates 263 from). 265 Telepresence: an environment that gives non co-located users or user 266 groups a feeling of (co-located) presence - the feeling that a Local 267 user is in the same room with other Local users and the Remote 268 parties. The inclusion of Remote parties is achieved through 269 multimedia communication including at least audio and video signals 270 of high fidelity. 272 *Video Capture: Media Capture for video. Denoted as VCn. 274 Video composite: A single image that is formed from combining visual 275 elements from separate sources. 277 4. Framework Features 279 Two key functions must be accomplished so that multiple media streams 280 can be handled in a telepresence conference. These are: 282 o How to choose which streams the provider should send to the 283 consumer 285 o What information needs to be added to the streams to allow a 286 rendering of the capture scene 288 The framework/model we present here can be understood as specifying 289 these two functions. 291 Media stream providers and consumers are central to the framework. 292 The provider's job is to advertise its capabilities (as described 293 here) to the consumer, whose job it is to configure the provider's 294 encoding capabilities as described below. Both providers and 295 consumers can each send and receive information, that is, we do not 296 have one party as the provider and one as the consumer exclusively, 297 but all parties have both sending and receiving parts to them. Most 298 devices function as both a media provider and as a media consumer. 300 For two devices to communicate bidirectionally, with media flowing in 301 both directions, both devices act as both a media provider and a 302 media consumer. The protocol exchange shown later in the "Choosing 303 Streams" section happens twice independently between the 2 304 bidirectional devices. 306 Both endpoints and MCUs, or more generally "middleboxes", can be 307 media providers and consumers. 309 5. Stream Information 311 This section describes the structure for communicating information 312 between providers and consumers. Figure illustrates how information 313 to be communicated is organized. Each construct illustrated in the 314 diagram is discussed in the sections below. 316 Diagram for Stream Content 318 +---------------+ 319 | | 320 | Capture Set | 321 | | 322 +-------+-------+ 323 _..-' | ``-._ 324 _.-' | ``-._ 325 _.-' | ``-._ 326 +----------------+ +----------------+ +----------------+ 327 | Media Capture | | Media Capture | | Media Capture | 328 | Audio or Video | | Audio or Video | | Audio or Video | 329 +----------------+ +----------------+ +----------------+ 330 .' `. `-..__ 331 .' `. ``-..__ 332 ,-----. ,---------. ``,----------. 333 ,' Encode`. ,' `. ,'Simultaneous`. 334 ( Group ) ( Attributes ) ( Transmission ) 335 `. ,' `. ,' `. Sets ,' 336 `-----' `---------' `----------' 338 5.1. Overview of the Model 340 The basic method of operation is that a provider describes to a 341 consumer what streams it has to offer. It describes them in terms 342 both of attributes of the media (e.g. audio and video) captures and 343 in terms of the encoding characteristics of the streams for these 344 captures. The consumer then tells the provider which streams it 345 wants to receive. Prior to this exchange, the consumer sends 346 information about itself to the provider which the provider may use 347 in determining what to advertise to the consumer. 349 A media provider provides media for one or more capture scenes. As 350 defined, a capture scene is the source scene that is captured by 351 media devices. An endpoint is likely to have more than one capture 352 scene, for example one for people and one for presentation. Each 353 capture scene is represented by a capture set, which describes all 354 the collections of media captures for that scene. A capture set 355 consists of one or more rows of media captures, where each row 356 represents a way of capturing the scene. 358 A media capture, typically audio or video, is the basic data 359 structure, as defined in definitions and described below in 360 Section 5.2. Media captures have attributes that describe them, such 361 as their spatial properties and relationships. These attributes are 362 described in Section 5.3 and Section 5.5. 364 Media Captures are also associated with data constructs that capture 365 encoding aspects of the streams - that is, simultaneous transmission 366 sets and encoding groups, described in Section 6.2 and Section 6.3. 368 Generally, the provider is capable of sending alternate captures of a 369 capture scene - different number of captures for the scene, or 370 captures with differing characteristics like bandwidth or resolution. 371 These are described by the provider as capabilities, using the 372 capture set and media capture model mentioned above, and chosen by 373 the consumer. The message exchange to accomplish this is described 374 in Section 6.1. 376 There are some additional separate aspects of the framework mentioned 377 in Section 8. 379 5.2. Media capture -- Audio and Video 381 A media capture, as defined in definitions, is a fundamental concept 382 of the model. Media can be captured in different ways, for example 383 by various arrangements of cameras and microphones. The model uses 384 the terms "video capture" (VC) and "audio capture" (AC) to refer to 385 sources of media streams. To distinguish between multiple instances, 386 they are numbered for example VC1, VC2, and VC3 could refer to three 387 different video captures which can be used simultaneously. 389 A media capture can be a media source such as video from a specific 390 camera, or it can be more conceptual such as a composite image from 391 several cameras, or an automatic dynamically switched capture 392 choosing from several cameras depending on who is talking or other 393 factors. 395 A media capture can also come from synthetically generated sources, 396 such as a computer generated audiovisual presentation. Or from the 397 playback of a recording. Any media type that can be carried over RTP 398 can be represented by a media capture. 400 A media capture is described by Attributes and associated with an 401 Encode Group, and Simultaneous Transmission Set. 403 Media captures are aggregated into Capture Sets as described below. 405 5.3. Attributes for Media Captures 407 Media capture attributes describe information about streams and their 408 relationships. [Edt: We do not mean to duplicate SDP, if an SDP 409 description can be used, great.] The attributes of media captures 410 refer to static aspects of those captures that can be used by the 411 consumer for selecting the captures offered by the provider. 413 The mechanism of Attributes make the framework extensible. Although 414 we are defining some attributes now based on the most common use 415 cases, new attributes can be added for new use cases as they arise. 416 In general, the way to extend the solution to handle new features is 417 by adding attributes and/or values. 419 We describe attributes by variables and their values. The current 420 attributes are listed below and then described. The variable is 421 shown in parentheses, and the values follow after the colon: 423 o (Purpose): main, presentation 425 o (Composed): true, false 427 o (Audio Channel Format): mono, stereo, tbd 429 o (Area of Capture): A set of 'Ranges' describing the relevant area 430 being capture by a capture device 432 o (Point of Capture): A 'Point' describing the location of the 433 capture device or pseudo-device 435 o (Auto-switched): true, false 437 5.3.1. Purpose 439 A variable with enumerated values describing the purpose or role of 440 the Media Capture. It could be applied to any media type. Possible 441 values: main, presentation, others TBD. 443 Main: 445 The audio or video capture is of one or more people participating in 446 a conference (or where they would be if they were there). It is of 447 part or all of the Capture Scene. 449 Presentation: 451 The capture provides a presentation, e. g., from a connected laptop 452 or other input device. 454 5.3.2. Composed 456 A Boolean variable to indicate whether the MC is a mix or composition 457 of other MCs or Streams. (This could indicate for example a 458 continuous presence view of multiple images in a grid, or a large 459 image with smaller picture-in-picture images in it. When applied to 460 an audio capture, it indicates a composition of ACs by some mixing 461 algorithm) 463 This attribute is not intended to differentiate between different 464 ways of composing or mixing images. For possible extension of the 465 framework, additional attributes could be defined to distinguish 466 between different ways of composing or mixing captures. For example, 467 with different video layout arrangements of composing multiple images 468 into one, or different audio mixing algorithms. 470 5.3.3. Audio Channel Format 472 The "channel format" attribute of an Audio Capture indicates how the 473 meaning of the channels is determined. It is an enumerated variable 474 describing the type of audio channel or channels in the Audio 475 Capture. The possible values of the "channel format" attribute are: 477 o mono 479 o stereo 481 o TBD - other possible future values (to potentially include other 482 things like 3.0, 3.1, 5.1 surround sound and binaural) 484 All ACs in the same row of a Capture Set MUST have the same value of 485 the "channel format" attribute. 487 There can be multiple ACs of a particular type, or even different 488 types. These multiple ACs could each have an area of capture 489 attribute to indicate they represent different areas of the capture 490 scene. 492 If there are multiple audio streams, they might be correlated (that 493 is, someone talking might be heard in multiple captures from the same 494 room). Echo cancellation and stream synchronization in consumers 495 should take this into account. 497 Mono: 499 An AC with channel format="mono" has one audio channel. 501 Stereo: 503 An AC with channel format = "stereo" has exactly two audio channels, 504 left and right, as part of the same AC. [Edt: should we mention RFC 505 3551 here? The channel format may be related to how Audio Captures 506 are mapped to RTP streams. This stereo is not the same as the effect 507 produced from two mono ACs one from the left and one from the right.] 509 5.3.4. Area of capture 511 The area_of_capture attribute is used to describe the relevant area 512 of which a media capture is "capturing". By comparing the area of 513 capture for different media captures, a consumer can determine the 514 spatial relationships of the captures on the provider so that they 515 can be rendered correctly. The attribute consists of a set of 516 'Ranges', one range for each spatial dimension, where each range has 517 a Begin and End coordinate. It is not necessary to fill out all of 518 the dimensions if they are not relevant (i.e. if an endpoint's 519 captures only span a single dimension, only the 'x' coordinate can be 520 used). There is no need to pre-define a possible range for this 521 coordinate system; a device may choose what is most appropriate for 522 describing its captures. However, it is specified that as numbers 523 move from lower to higher, the location is going from: camera-left to 524 camera-right (in the case of the 'x' dimension), front to back (in 525 the case of the 'y' dimension or low to high (in the case of the 'z' 526 dimension). 528 5.3.5. Point of capture 530 The point_of_capture attribute can be used to describe the location 531 of a capture device or pseudo-device. If there are multiple captures 532 which share the same 'area_of_capture' value, then it is useful to 533 know the location from which they are capturing that area (e.g. a 534 device which has multiview). Point of capture is expressed as a 535 single {x, y, z} coordinate where, as with area_of_capture, only the 536 necessary dimensions need be expressed. 538 5.3.6. Auto-switched 540 A Boolean variable that may be used for audio and/or video streams. 541 In this case the offered AC or VC varies depending on some rule; it 542 is auto-switched between possible VCs, or between possible ACs. The 543 most common example of this is sending the video capture associated 544 with the "loudest" speaker according to an audio detection algorithm. 546 5.4. Capture Set 548 A capture set describes the alternative media streams that the 549 provider offers to send to the consumer. As shown in the content 550 diagram above, the capture set is an aggregation of all audio and 551 video captures for a particular scene that a provider is willing to 552 send. 554 A provider can have more than one capture set, each representing a 555 different scene. For example one capture set can be for main people 556 audio and video, and another capture set can be for a computer 557 generated presentation. 559 A provider describes its ability to send alternative media streams in 560 the capture set, which lists the media captures in rows, as shown 561 below. Each row of the capture set consists of either a single 562 capture or a group of captures. A group means the individual 563 captures in the group are spatially related with the specific 564 ordering of the captures described through the use of attributes. 566 Here is an example of a simple capture set with three video captures 567 and three audio captures: 569 (VC0, VC1, VC2) 571 (AC0, AC1, AC2) 573 The three VCs together in a row indicate those captures are spatially 574 related to each other. Similarly for the 3 ACs in the second row. 575 The ACs and VCs in the same capture set are spatially related to each 576 other. 578 Multiple Media Captures of the same media type are often spatially 579 related to each other. Typically multiple Video Captures should be 580 rendered next to each other in a particular order, or multiple audio 581 channels should be rendered to match different speakers in a 582 particular way. Also, media of different types are often associated 583 with each other, for example a group of Video Captures can be 584 associated with a group of Audio Captures meaning they should be 585 rendered together. 587 Media Captures of the same media type are associated with each other 588 by grouping them together in a single row of a Capture Set. Media 589 Captures of different media types are associated with each other by 590 putting them in different rows of the same Capture Set. 592 Since all captures have an area_of_capture associated with them, a 593 consumer can determine the spatial relationships of captures by 594 comparing the locations of their areas of capture with one another. 596 Association between audio and video can be made by finding audio and 597 video captures which share overlapping areas of capture. 599 The items (rows) in a capture set represent different alternatives 600 for representing the same Capture Scene. For example the following 601 are alternative ways of capturing the same Capture Scene - two 602 cameras each viewing half of a room, or one camera viewing the whole 603 room, or one stream that automatically captures the person in the 604 room who is currently speaking. Each row of the Capture Set contains 605 either a single media capture or one group of media captures. 607 The following example shows a capture set for an endpoint media 608 provider where: 610 o (VC0, VC1, VC2) - camera-left video capture, center video capture, 611 camera-right video capture 613 o (VC3) - capture associated with loudest 615 o (VC4) - zoomed out view of all people in the room 617 o (AC0) - main audio 619 The first item in this capture set example is a group of video 620 captures with a spatial relationship to each other. These are VC0, 621 VC1, and VC2. VC3 and VC4 are additional alternatives of how to 622 capture the same room in different ways. The audio capture is 623 included in the same capture set to indicate AC0 is associated with 624 those video captures, meaning the audio should be rendered along with 625 the video in the same set. 627 The idea is to have sets of captures that represent the same 628 information ("information" in this context might be a set of people 629 and their associated audio / video streams, or might be a 630 presentation supplied by a laptop, perhaps with an accompanying audio 631 commentary). Spatial ordering of media captures is described through 632 the use of attributes. 634 A media consumer could choose one row of each media type (e.g., audio 635 and video) from a capture set. For example a three stream consumer 636 could choose the first video row plus the audio row, while a single 637 stream consumer could choose the second or third video row plus the 638 audio row. An MCU consumer might choose to receive multiple rows. 640 The Simultaneous Transmission Sets and Encoding Groups as discussed 641 in the next section apply to media captures listed in capture sets. 642 The Simultaneous Transmission Sets and Encoding Groups MUST allow all 643 the Media Captures in a particular row of the capture set to be used 644 simultaneously. But media captures in different rows of the capture 645 set might not be able to be used simultaneously. 647 5.5. Attributes for Capture Sets 649 These are attibutes that can be applied to a capture set. 651 o (Area of Scene): A set of 'Ranges' describing the area of the 652 entire capture scene 654 o (Area scale): true, false indicating if area numbers are in 655 millimeters 657 5.5.1. Area of Scene 659 The area of scene attribute for a capture set has the same format as 660 the area of capture attribute for a media capture. The area of scene 661 is for the entire scene, which is captured by the one or more media 662 captures in the capture set rows. 664 5.5.2. Area Scale Millimeters 666 An optional Boolean variable indicating if the numbers used for area 667 of scene, area of capture and point of capture are in terms of 668 millimeters. If this attribute is true, then the x,y,z numbers 669 represent millimeters. If this attribute is false, then there is no 670 physical scale. The default value is true. 672 This attribute applies to all the MCs that are part of the capture 673 set. 675 6. Choosing Streams 677 This section describes the process of choosing which streams the 678 provider sends to the consumer. In order for appropriate streams to 679 be sent from providers to consumers, certain characteristics of the 680 multiple streams must be understood by both providers and consumers. 681 Two separate aspects of streams suffice to describe the necessary 682 information to be shared by providers and consumers. The first 683 aspect we call "physical simultaneity" and the other aspect we refer 684 to as "encoding group". These are described in the following 685 sections, after the message flow is discussed. 687 6.1. Message Flow 689 The following diagram shows the flow of messages between a media 690 provider and a media consumer. The provider sends information about 691 its capabilities (as specified in this section), then the consumer 692 chooses which streams it wants, which we refer to as "configure". 693 The consumer sends its own capability message to the provider which 694 may contain information about its own capabilities or restrictions, 695 in which case the provider might tailor its announcements to the 696 consumer. 698 Diagram for Message Flow 700 Media Consumer Media Provider 701 -------------- ------------ 702 | | 703 |----- Consumer Capability ---------->| 704 | | 705 | | 706 |<---- Capabilities (announce) -------| 707 | | 708 | | 709 |------ Configure (request) --------->| 710 | | 712 Media captures are dynamic. They can come and go in a conference - 713 and their parameters can change. A provider can advertise a new list 714 of captures at any time. Both the media provider and media consumer 715 can send "their messages" (i.e., capture set announcements, stream 716 configurations) any number of times during a call, and the other end 717 is always required to act on any new information received (e.g., 718 stopping streams it had previously configured that are no longer 719 valid). 721 These messages do not always have to occur with all three messages 722 together as part of an exchange. A provider can send a new 723 capabilities announce message any time, without first receiving a new 724 consumer capability message. Similarly, a consumer can send a new 725 configure request at any time, to change what it wants to receive. 726 The new configure request must be compatible with the most recently 727 received capabilities announce message. 729 6.1.1. Consumer Capability Message 731 In order for a maximally-capable provider to be able to advertise a 732 manageable number of video captures to a consumer, there is a 733 potential use for the consumer being able, at the start of CLUE to be 734 able to inform the provider of its capabilities. One example here 735 would be the video capture attribute set - a consumer could tell the 736 provider the complete set of video capture attributes it is able to 737 understand and so the provider would be able to reduce the capture 738 set it advertises to be tailored to the consumer. 740 TBD - the content of this message needs to be better defined. The 741 authors believe there is a need for this message, but have not worked 742 out the details yet. 744 6.1.2. Provider Capabilities Announcement 746 The provider capabilities announce message includes: 748 o the list of captures and their attributes 750 o the list of capture sets 752 o the list of Simultaneous Transmission Sets 754 o the list of the encoding groups 756 6.1.3. Consumer Configure Request 758 After receiving a set of video capture information from a provider 759 and making its choice of what media streams to receive based on the 760 consumer's own capabilities and any provider-side simultaneity 761 restrictions, the consumer needs to essentially configure the 762 provider to transmit the chosen set. 764 The expectation is that this message will enumerate each of the 765 encoding groups and potential encoders within those groups that the 766 consumer wishes to be active (this may well be a subset of the 767 complete set available). For each such encoder within an encoding 768 group, the consumer would specify the video capture (i.e., VC as 769 described above) along with the specifics of the video encoding 770 required, i.e. width, height, frame rate and bit rate. At this 771 stage, the consumer would also provide RTP demultiplexing information 772 as required to distinguish each stream from the others being 773 configured by the same mechanism. 775 6.2. Physical Simultaneity 777 An endpoint or MCU can send multiple captures simultaneously. 778 However, there may be constraints that limit which captures can be 779 sent simultaneously with other captures. 781 Physical or device simultaneity refers to fact that a device may not 782 be able to be used in different ways at the same time. This shapes 783 the way that offers are made from the provider. The offers are made 784 so that the consumer will choose one of several possible usages of 785 the device. This type of constraint is expressed in Simultaneous 786 Transmission Sets. This is easier to show in an example. 788 Consider the example of a room system where there are 3 cameras each 789 of which can send a separate capture covering 2 persons each- VC0, 790 VC1, VC2. The middle camera can also zoom out and show all 6 791 persons, VC3. But the middle camera cannot be used in both modes at 792 the same time - it has to either show the space where 2 participants 793 sit or the whole 6 seats. We refer to this as a physical device 794 simultaneity constraint. 796 The following illustration shows 3 cameras with 4 video streams. The 797 middle camera can be used as main video zoomed in on 2 people or it 798 could be used in zoomed out mode and capture the whole endpoint. The 799 idea here is that the middle camera cannot be used for both zoomed in 800 and zoomed out captures simultaneously. This is a constraint imposed 801 by the physical limitations of the devices. 803 Diagram for Simultaneity 804 `-. +--------+ VC2 805 .-'_Camera 3|----------> 806 .-' +--------+ 807 VC3 808 --------> 809 `-. +--------+ / 810 .-'|Camera 2|< 811 .-' +--------+ \ VC1 812 --------> 814 `-. +--------+ VC0 815 .-'|Camera 1|----------> 816 .-' +--------+ 818 VC0- video zoomed in on 2 people VC2- video zoomed in on 2 people 819 VC1- video zoomed in on 2 people VC3- video zoomed out on 6 people 821 Simultaneous transmission sets can be expressed as sets of the VCs 822 that could physically be transmitted at the same time, though it may 823 not make sense to do so. 825 In this example the two simultaneous sets are: 827 {VC0, VC1, VC2} 829 {VC0, VC3, VC2} 831 In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2. 832 Only one set can be transmitted at a time. These are physical 833 capabilities describing what can physically be sent at the same time, 834 not what might make sense to send. For example, in the second set 835 both VC0 and VC2 are redundant if VC3 is included. 837 In describing its capabilities, the provider must take physical 838 simultaneity into account and send a list of its Simultaneous 839 Transmission Sets to the consumer, along with the Capture Sets and 840 Encoding Groups. 842 6.3. Encoding Groups 844 The second aspect of multiple streams that must be understood by 845 providers and consumers in order to create the best experience 846 possible, i. e., for the "right" or "best" streams to be sent, is the 847 encoding characteristics of the possible audio and video streams 848 which can be sent. Just as in the way that a constraint is imposed 849 on the multiple streams due to the physical limitations, there are 850 also constraints due to encoding limitations. These are described by 851 four variables that make up an Encoding Group, as shown in the 852 following table: 854 Table: Encoding Group 856 +----------------+--------------------------------------------------+ 857 | Name | Description | 858 +----------------+--------------------------------------------------+ 859 | maxBandwidth | Maximum number of bits per second relating to | 860 | | all encodes combined | 861 | maxVideoMbps | Maximum number of macroblocks per second | 862 | | relating to a all video encodes combined ((width | 863 | | + 15) / 16) * ((height + 15) / 16) * | 864 | | framesPerSecond | 865 | videoEncodes[] | Set of potential video encodes can be generated | 866 | audioEncodes[] | Set of potential encodes that can be generated | 867 +----------------+--------------------------------------------------+ 869 An encoding group is the basic concept for describing encoding 870 capability. As shown in the Table, it has an overall maxMbps and 871 bandwidth limits, as well as being comprised of sets of individual 872 encodes, which will be described in more detail below. 874 Each media stream provider includes one or more encoding groups. 875 There may be multiple encoding groups per endpoint. For example, 876 each video capture device might have an associated encoding group 877 that describes the video streams that can result from that capture. 879 A remote receiver (i. e., stream consumer)configures some or all of 880 the specific encodings within one or more groups in order to provide 881 it with media streams to decode. 883 6.3.1. Encoding Group Structure 885 This section shows more detail on the media stream provider's 886 encoding group structure. The encoding group includes several 887 individual encodes, each has different encoding values. For example 888 one may be high definition video 1080p60, and another 720p30, with a 889 third being CIF. While a typical 3 codec/display system would have 890 one encoding group per "box", there are many possibilities for the 891 number of encoding groups a provider may be able to offer and for 892 what encoding values there are in each encoding group. 894 Diagram for Encoding Group Structure 895 ,-------------------------------------------------. 896 | Media Provider | 897 | | 898 | ,--------------------------------------. | 899 | | ,--------------------------------------. | 900 | | | ,--------------------------------------. | 901 | | | | Encoding Group | | 902 | | | | ,-----------. | | 903 | | | | | | ,---------. | | 904 | | | | | | | | ,---------.| | 905 | | | | | Encode1 | | Encode2 | | Encode3 || | 906 | `.| | | | | | `---------'| | 907 | `.| `-----------' `---------' | | 908 | `--------------------------------------' | 909 `-------------------------------------------------' 911 As shown in the diagram, each encoding group has multiple potential 912 individual encodes within it. Not all encodes are equally capable, 913 the stream consumer chooses the encodes it wants by configuring the 914 provider to send it what it wants to receive. 916 Some encoding endpoints are fixed, others are flexible, e. g., a 917 single box with multiple DSPs where the resources are shared. 919 6.3.2. Individual Encodes 921 An encoding group is associated with a media capture through the 922 individual encodes, that is, an audio or video capture is encoded in 923 one or more individual encodes, as described by the videoEncodes[] 924 and audioEncodes[]variables. 926 The following table shows the variables for a Video Encode. (There 927 is a similar table for audio.) 929 Table: Individual Video Encode 931 +--------------+----------------------------------------------------+ 932 | Name | Description | 933 +--------------+----------------------------------------------------+ 934 | maxBandwidth | Maximum number of bits per second relating to a | 935 | | single video encoding | 936 | maxMbps | Maximum number of macroblocks per second relating | 937 | | to a single video encoding: ((width + 15) / 16) * | 938 | | ((height + 15) / 16) * framesPerSecond | 939 | maxWidth | Video resolution's maximum supported width, | 940 | | expressed in pixels | 941 | maxHeight | Video resolution's maximum supported height, | 942 | | expressed in pixels | 943 | maxFrameRate | Maximum supported frame rate | 944 +--------------+----------------------------------------------------+ 946 A remote receiver configures (i. e., instantiates) some or all of the 947 specific encodes such that: 949 o The configuration of each active ENC does not exceed that 950 individual encode's maxWidth, maxHeight, maxFrameRate. 952 o The total bandwidth of the configured ENC does not exceed the 953 maxBandwidth of the encoding group. 955 o The sum of the macroblocks per second of each configured encode 956 does not exceed the maxMbps attribute of the encoding group. 958 An equivalent set of attributes holds for audio encodes within an 959 audio encoding group. 961 6.3.3. More on Encoding Groups 963 An encoding group EG comprises one or more potential encodings 964 ENC. For example, 966 EG0: maxMbps=489600, maxBandwidth=6000000 967 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 968 maxMbps=244800, maxBandwidth=4000000 969 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 970 maxMbps=244800, maxBandwidth=4000000 971 AUDIO_ENC0: maxBandwidth=96000 972 AUDIO_ENC1: maxBandwidth=96000 973 AUDIO_ENC2: maxBandwidth=96000 975 Here, the encoding group is EG0. It can transmit up to two 1080p30 976 encodings (Mbps for 1080p = 244800), but it is capable of 977 transmitting a maxFrameRate of 60 frames per second (fps). To 978 achieve the maximum resolution (1920 x 1088) the frame rate is 979 limited to 30 fps. However 60 fps can be achieved at a lower 980 resolution if required by the consumer. Although the encoding group 981 is capable of transmitting up to 6Mbit/s, no individual video 982 encoding can exceed 4Mbit/s. 984 This encoding group also allows up to 3 audio encodings, 985 AUDIO_ENC<0-2>. It is not required that audio and video encodings 986 reside within the same encoding group, but if so then the group's 987 overall maxBandwidth value is a limit on the sum of all audio and 988 video encodings configured by the consumer. A system that does not 989 wish or need to combine bandwidth limitations in this way should 990 instead use separate encoding groups for audio and video in order for 991 the bandwidth limitations on audio and video to not interact. 993 Audio and video can be expressed in separate encode groups, as in 994 this illustration. 996 VIDEO_EG0: maxMbps=489600, maxBandwidth=6000000 997 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 998 maxMbps=244800, maxBandwidth=4000000 999 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1000 maxMbps=244800, maxBandwidth=4000000 1001 AUDIO_EG0: maxBandwidth=500000 1002 AUDIO_ENC0: maxBandwidth=96000 1003 AUDIO_ENC1: maxBandwidth=96000 1004 AUDIO_ENC2: maxBandwidth=96000 1006 6.3.4. Examples of Encoding Groups 1008 This section illustrates further examples of encoding groups. In the 1009 first example, the capability parameters are the same across ENCs. 1010 In the second example, they vary. 1012 An endpoint that has 3 similar video capture devices would advertise 1013 3 encoding groups that can each transmit up to 2 1080p30 encodings, 1014 as follows: 1016 EG0: maxMbps = 489600, maxBandwidth=6000000 1017 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1018 maxMbps=244800, maxBandwidth=4000000 1019 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1020 maxMbps=244800, maxBandwidth=4000000 1021 EG1: maxMbps = 489600, maxBandwidth=6000000 1022 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1023 maxMbps=244800, maxBandwidth=4000000 1024 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1025 maxMbps=244800, maxBandwidth=4000000 1026 EG2: maxMbps = 489600, maxBandwidth=6000000 1027 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1028 maxMbps=244800, maxBandwidth=4000000 1029 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1030 maxMbps=244800, maxBandwidth=4000000 1032 A remote consumer configures some or all of the specific encodings 1033 such that: 1035 o The configuration of each active ENC parameter values does not 1036 cause that encoding's maxWidth, maxHeight, maxFrameRate to be 1037 exceeded 1039 o The total bandwidth of the configured ENC encodings does not 1040 exceed the maxBandwidth of the encoding group 1042 o The sum of the "macroblocks per second" values of each configured 1043 encoding does not exceed the maxMbps of the encoding group 1045 There is no requirement for all encodings within an encoding group to 1046 be activated when configured by the consumer. 1048 Depending on the provider's encoding methods, the consumer may be 1049 able to request fixed encode values or choose encode values in the 1050 range less than the maximum offered. We will discuss consumer 1051 behavior in more detail in a section below. 1053 6.3.4.1. Sample video encoding group specification #2 1055 This example specification expresses a system whose encoding groups 1056 can each transmit up to 3 encodings, but with each potential encoding 1057 having a progressively lower specification. In this example, 1080p60 1058 transmission is possible (as ENC0 has a maxMbps value compatible with 1059 that) as long as it is the only active encoding (as maxMbps for the 1060 entire encoding group is also 489600). Significantly, as up to 3 1061 encodings are available per group, some sets of captures which 1062 weren't able to be transmitted simultaneously in example #1 above now 1063 become possible, for instance VC1, VC3 and VC6 together. In common 1064 with example #1, all encoding groups have an identical specification. 1066 EG0: maxMbps = 489600, maxBandwidth=6000000 1067 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1068 maxMbps=489600, maxBandwidth=4000000 1069 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 1070 maxMbps=108000, maxBandwidth=4000000 1071 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 1072 maxMbps=61200, maxBandwidth=4000000 1073 EG1: maxMbps = 489600, maxBandwidth=6000000 1074 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1075 maxMbps=489600, maxBandwidth=4000000 1076 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 1077 maxMbps=108000, maxBandwidth=4000000 1078 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 1079 maxMbps=61200, maxBandwidth=4000000 1080 EG2: maxMbps = 489600, maxBandwidth=6000000 1081 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1082 maxMbps=489600, maxBandwidth=4000000 1083 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 1084 maxMbps=108000, maxBandwidth=4000000 1085 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 1086 maxMbps=61200, maxBandwidth=4000000 1088 7. Extensibility 1090 One of the most important characteristics of the Framework is its 1091 extensibility. Telepresence is a relatively new industry and while 1092 we can foresee certain directions, we also do not know everything 1093 about how it will develop. The standard for interoperability and 1094 handling multiple streams must be future-proof. 1096 The framework itself is inherently extensible through expanding the 1097 data model types. For example: 1099 o Adding more types of media, such as telemetry, can done by 1100 defining additional types of captures in addition to audio and 1101 video. 1103 o Adding new functionalities , such as 3-D, say, will require 1104 additional attributes describing the captures, such as x,y, z 1105 coordinates. 1107 o Adding a new codecs, such as H.265, can be accomplished by 1108 defining new encoding variables. 1110 The infrastructure is designed to be extended rather than requiring 1111 new infrastructure elements. Extension comes through adding to 1112 defined types. 1114 Assuming the implementation is in something like XML, adding data 1115 elements and attributes makes extensibility easy. 1117 8. Other aspects of the framework 1119 A few other aspects of the framework are separate from the provider 1120 capture set model. These include: 1122 o Voice activity detection 1124 o Indications about stream switching/composing, information about 1125 the source media captures 1127 o associating captures/streams with a conference roster 1129 o mapping the model to specific protocol messages 1131 [Edt. much of this is work in progress and will need to be updated] 1133 9. Using the Framework 1135 This section shows in more detail how to use the framework to 1136 represent a typical case for telepresence rooms. First an endpoint 1137 is illustrated, then an MCU case is shown. 1139 Consider an endpoint with the following characteristics: 1141 o 3 cameras, 3 displays, a 6 person table 1143 o Each video device can provide one capture for each 1/3 section of 1144 the table 1146 o A single capture representing the active speaker can be provided 1148 o A single capture representing the active speaker with the other 2 1149 captures shown picture in picture within the stream can be 1150 provided 1152 o A capture showing a zoomed out view of all 6 seats in the room can 1153 be provided 1155 The audio and video captures for this endpoint can be described as 1156 follows. The Encode Group specifications can be found above in 1157 Section 6.3.4.1, Sample video encoding group specification #2. 1159 Video Captures: 1161 o VC0- (the camera-left camera stream), encoding group:EG0, 1162 attributes:purpose=main;auto-switched:no; 1163 area_of_capture={xBegin=0, xEnd=33} 1165 o VC1- (the center camera stream), encoding group:EG1, attributes: 1166 purpose=main; auto-switched:no; area_of_capture={xBegin=33, 1167 xEnd=66} 1169 o VC2- (the camera-right camera stream), encoding group:EG2, 1170 attributes: purpose=main;auto-switched:no; 1171 area_of_capture={xBegin=66, xEnd=99} 1173 o VC3- (the loudest panel stream), encoding group:EG1, attributes: 1174 purpose=main;auto-switched:yes; area_of_capture={xBegin=0, 1175 xEnd=99} 1177 o VC4- (the loudest panel stream with PiPs), encoding group:EG1, 1178 attributes: purpose=main; composed=true; auto-switched:yes; 1179 area_of_capture={xBegin=0, xEnd=99} 1181 o VC5- (the zoomed out view of all people in the room), encoding 1182 group:EG1, attributes: purpose=main;auto-switched:no; 1183 area_of_capture={xBegin=0, xEnd=99} 1185 o VC6- (presentation stream), encoding group:EG1, attributes: 1186 purpose=presentation;auto-switched:no; area_of_capture={xBegin=0, 1187 xEnd=99} 1189 Summary of video captures - 3 codecs, center one is used for center 1190 camera stream, presentation stream, auto-switched, and zoomed views. 1192 Note the text in parentheses (e.g. "the camera-left camera stream") 1193 is not explicitly part of the model, it is just explanatory text for 1194 this example, and is not included in the model with the media 1195 captures and attributes. 1197 [edt. It is arbitrary that for this example the alternative views 1198 are on EG1 - they could have been spread out- it was not a necessary 1199 choice.] 1201 Audio Captures: 1203 o AC0 (camera-left), attributes: purpose=main;channel format=mono; 1204 area_of_capture={xBegin=0, xEnd=33} 1206 o AC1 (camera-right), attributes: purpose=main;channel format=mono; 1207 area_of_capture={xBegin=66, xEnd=99} 1209 o AC2 (center) attributes: purpose=main;channel format=mono; 1210 area_of_capture={xBegin=33, xEnd=66} 1212 o AC3 being a simple pre-mixed audio stream from the room (mono), 1213 attributes: purpose=main;channel format=mono; mixed=true; 1214 area_of_capture={xBegin=0, xEnd=99} 1216 o AC4 audio stream associated with the presentation video (mono) 1217 attributes: purpose=presentation;channel format=mono; 1218 area_of_capture={xBegin=0, xEnd=99} 1220 The physical simultaneity information is: 1222 {VC0, VC1, VC2, VC3, VC4, VC6} 1224 {VC0, VC2, VC5, VC6} 1226 It is possible to select any or all of the rows in a capture set. 1227 This is strictly what is possible from the devices. However, using 1228 every member in the set simultaneously may not make sense- for 1229 example VC3(loudest) and VC4 (loudest with PIP). (In addition, there 1230 are encoding constraints that make choosing all of the VCs in a set 1231 impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1 has only 3 1232 ENCs. This constraint shows up in the Capture list and encoding 1233 groups, not in the simultaneous transmission sets.) 1235 In this example there are no restrictions on which audio captures can 1236 be sent simultaneously. 1238 The following table represents the capture sets for this provider. 1239 Recall that a capture set is composed of alternative captures 1240 covering the same scene. Capture Set #1 is for the main people 1241 captures, and Capture Set #2 is for presentation. 1243 +----------------+ 1244 | Capture Set #1 | 1245 +----------------+ 1246 | VC0, VC1, VC2 | 1247 | VC3 | 1248 | VC4 | 1249 | VC5 | 1250 | AC0, AC1, AC2 | 1251 | AC3 | 1252 +----------------+ 1254 +----------------+ 1255 | Capture Set #2 | 1256 +----------------+ 1257 | VC6 | 1258 | AC4 | 1259 +----------------+ 1261 Different capture sets are unique to each other, non-overlapping. A 1262 consumer chooses a capture row from each capture set. In this case 1263 the three captures VC0, VC1, and VC2 are one way of representing the 1264 video from the endpoint. These three captures should appear adjacent 1265 next to each other. Alternatively, another way of representing the 1266 Capture Scene is with the capture VC3, which automatically shows the 1267 person who is talking. Similarly for the VC4 and VC5 alternatives. 1269 As in the video case, the different rows of audio in Capture Set #1 1270 represent the "same thing", in that one way to receive the audio is 1271 with the 3 linear position audio captures (AC0, AC1, AC2), and 1272 another way is with the single channel monaural format AC3. The 1273 Media Consumer would choose the one audio capture row it is capable 1274 of receiving. 1276 The spatial ordering is understood by the media capture attributes 1277 area and point of capture. 1279 The consumer finds a "row" in each capture set #x section of the 1280 table that it wants. It configures the streams according to the 1281 encoding group for the row. 1283 A Media Consumer would likely want to choose a row to receive based 1284 in part on how many streams it can simultaneously receive. A 1285 consumer that can receive three people streams would probably prefer 1286 to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not 1287 receive the other rows. A consumer that can receive only one people 1288 stream would probably choose one of the other rows. 1290 If the consumer can receive a presentation stream too, it would also 1291 choose to receive the only row from Capture Set #2 (VC6). 1293 9.1. The MCU Case 1295 This section shows how an MCU might express its Capture Sets, 1296 intending to offer different choices for consumers that can handle 1297 different numbers of streams. A single audio capture stream is 1298 provided for all single and multi-screen configurations that can be 1299 associated (e.g. lip-synced) with any combination of video captures 1300 at the consumer. 1302 +--------------------+---------------------------------------------+ 1303 | Capture Set #1 | note | 1304 +--------------------+---------------------------------------------+ 1305 | VC0 | video capture for single screen consumer | 1306 | VC1, VC2 | video capture for 2 screen consumer | 1307 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1308 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1309 | AC0 | audio capture representing all participants | 1310 +--------------------+---------------------------------------------+ 1312 If / when a presentation stream becomes active within the conference, 1313 the MCU might re-advertise the available media as: 1315 +----------------+--------------------------------------+ 1316 | Capture Set #2 | note | 1317 +----------------+--------------------------------------+ 1318 | VC10 | video capture for presentation | 1319 | AC1 | presentation audio to accompany VC10 | 1320 +----------------+--------------------------------------+ 1322 9.2. Media Consumer Behavior 1324 [Edt. Should this be moved to appendix?] 1326 The receive side of a call needs to balance its requirements, based 1327 on number of screens and speakers, its decoding capabilities and 1328 available bandwidth, and the provider's capabilities in order to 1329 optimally configure the provider's streams. Typically it would want 1330 to receive and decode media from each capture set advertised by the 1331 provider. 1333 A sane, basic, algorithm might be for the consumer to go through each 1334 capture set in turn and find the collection of video captures that 1335 best matches the number of screens it has (this might include 1336 consideration of screens dedicated to presentation video display 1337 rather than "people" video) and then decide between alternative rows 1338 in the video capture sets based either on hard-coded preferences or 1339 user choice. Once this choice has been made, the consumer would then 1340 decide how to configure the provider's encode groups in order to make 1341 best use of the available network bandwidth and its own decoding 1342 capabilities. 1344 9.2.1. One screen consumer 1346 VC3, VC4 and VC5 are all on different rows by themselves, not in a 1347 group, so the receiving device should choose between one of those. 1348 The choice would come down to whether to see the greatest number of 1349 participants simultaneously at roughly equal precedence (VC5), a 1350 switched view of just the loudest region (VC3) or a switched view 1351 with PiPs (VC4). An endpoint device with a small amount of knowledge 1352 of these differences could offer a dynamic choice of these options, 1353 in-call, to the user. 1355 9.2.2. Two screen consumer configuring the example 1357 Mixing systems with an even number of screens, "2n", and those with 1358 "2n+1" cameras (and vice versa) is always likely to be the 1359 problematic case. In this instance, the behavior is likely to be 1360 determined by whether a "2 screen" system is really a "2 decoder" 1361 system, i.e., whether only one received stream can be displayed per 1362 screen or whether more than 2 streams can be received and spread 1363 across the available screen area. To enumerate 3 possible behaviors 1364 here for the 2 screen system when it learns that the far end is 1365 "ideally" expressed via 3 capture streams: 1367 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1368 per the 1 screen consumer case above) and either leave one screen 1369 blank or use it for presentation if / when a presentation becomes 1370 active 1372 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1373 (either with each capture being scaled to 2/3 of a screen and the 1374 centre capture being split across 2 screens) or, as would be 1375 necessary if there were large bezels on the screens, with each 1376 stream being scaled to 1/2 the screen width and height and there 1377 being a 4th "blank" panel. This 4th panel could potentially be 1378 used for any presentation that became active during the call. 1380 3. Receive 3 streams, decode all 3, and use control information 1381 indicating which was the most active to switch between showing 1382 the left and centre streams (one per screen) and the centre and 1383 right streams. 1385 For an endpoint capable of all 3 methods of working described above, 1386 again it might be appropriate to offer the user the choice of display 1387 mode. 1389 9.2.3. Three screen consumer configuring the example 1391 This is the most straightforward case - the consumer would look to 1392 identify a set of streams to receive that best matched its available 1393 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1394 spatial ordering would give sufficient information for the correct 1395 video capture to be shown on the correct screen, and the consumer 1396 would either need to divide a single encode group's capability by 3 1397 to determine what resolution and frame rate to configure the provider 1398 with or to configure the individual video captures' encode groups 1399 with what makes most sense (taking into account the receive side 1400 decode capabilities, overall call bandwidth, the resolution of the 1401 screens plus any user preferences such as motion vs sharpness). 1403 10. Acknowledgements 1405 Mark Gorzyinski contributed much to the approach. We want to thank 1406 Stephen Botzko for helpful discussions on audio. 1408 11. IANA Considerations 1410 TBD 1412 12. Security Considerations 1414 TBD 1416 13. Informative References 1418 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1419 Requirement Levels", BCP 14, RFC 2119, March 1997. 1421 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1422 A., Peterson, J., Sparks, R., Handley, M., and E. 1423 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1424 June 2002. 1426 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1427 Jacobson, "RTP: A Transport Protocol for Real-Time 1428 Applications", STD 64, RFC 3550, July 2003. 1430 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1431 Session Initiation Protocol (SIP)", RFC 4353, 1432 February 2006. 1434 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1435 January 2008. 1437 Appendix A. Open Issues 1439 A.1. Video layout arrangements and centralized composition 1441 In the context of a conference with a central MCU, there has been 1442 discussion about a consumer requesting the provider to provide a 1443 certain type of layout arrangement or perform a certain composition 1444 algorithm, such as combining some number of most recent talkers, or 1445 producing a video layout using a 2x2 grid or 1 large cell with 5 1446 smaller cells around it. The current framework does not address 1447 this. It isn't clear if this topic should be included in this 1448 framework, or maybe a different part of CLUE, or maybe outside of 1449 CLUE altogether. 1451 A.2. Source is selectable 1453 A Boolean variable. True indicates the media consumer can request a 1454 particular media source be mapped to a media capture. Default is 1455 false. 1457 TBD - how does the consumer make the request for a particular source? 1458 How does the consumer know what is available? Need to explain better 1459 how multiple media captures are different from a single media capture 1460 with choices for the source, and when each concept should be used. 1462 A.3. Media Source Selection 1464 The use cases include a case where the person at a receiving endpoint 1465 can request to receive media from a particular other endpoint, for 1466 example in a multipoint call to request to receive the video from a 1467 certain section of a certain room, whether or not people there are 1468 talking. 1470 TBD - this framework should address this case. Maybe need a roster 1471 list of rooms or people in the conference, with a mechanism to select 1472 from the roster and associate it with media captures. This is 1473 different from selecting a particular media capture from a capture 1474 set. The mechanism to do this will probably need to be different 1475 than selecting media captures based on capture sets and attributes. 1477 A.4. Endpoint requesting many streams from MCU 1479 TBD - how to do VC selection for a system where the endpoint media 1480 consumers want to receive lots of streams and do their own 1481 composition, rather than MCU doing transcoding and composing. 1482 Example is 3 screen consumer that wants 3 large loudest speaker 1483 streams, and a bunch of small ones to render as PiP. How the small 1484 ones are chosen, which could potentially be chosen by either the 1485 endpoint or MCU. There are other more complicated examples also. Is 1486 the current framework adequate to support this? 1488 A.5. VAD (voice activity detection) tagging of audio streams 1490 TBD - do we want to have VAD be mandatory? All audio streams 1491 originating from a media provider must be tagged with VAD 1492 information. This tagging would include an overall energy value for 1493 the stream plus information on which sections of the capture scene 1494 are "active". 1496 Each audio stream which forms a constituent of a row within a capture 1497 set should include this tagging, and the energy value within it 1498 calculated using a fixed, consistent algorithm. 1500 When a system determines the most active area of a capture scene 1501 (either "loudest", or determined by other means such as a button 1502 press) it should convey that information to the corresponding media 1503 stream consumer via any audio streams being sent within that capture 1504 set. Specifically, there should be a list of active linear positions 1505 and their VAD characteristics within the audio stream in addition to 1506 the overall VAD information for the capture set. This is to ensure 1507 all media stream consumers receive the same, consistent, audio energy 1508 information whichever audio capture or captures they choose to 1509 receive for a capture set. Additionally, linear position information 1510 can be mapped to video captures by a media stream consumer in order 1511 that it can perform "panel switching" if required. 1513 A.6. Private Information 1515 Do we want a way to include private information? 1517 Authors' Addresses 1519 Allyn Romanow 1520 Cisco Systems 1521 San Jose, CA 95134 1522 USA 1524 Email: allyn@cisco.com 1526 Mark Duckworth (editor) 1527 Polycom 1528 Andover, MA 01810 1529 US 1531 Email: mark.duckworth@polycom.com 1533 Andrew Pepperell 1534 Cisco Systems 1535 Langley, England 1536 UK 1538 Email: apeppere@cisco.com 1540 Brian Baldino 1541 Cisco Systems 1542 San Jose, CA 95134 1543 US 1545 Email: bbaldino@cisco.com