idnits 2.17.1 draft-ietf-clue-framework-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 19, 2011) is 4565 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: April 21, 2012 Polycom 6 A. Pepperell 7 B. Baldino 8 Cisco Systems 9 October 19, 2011 11 Framework for Telepresence Multi-Streams 12 draft-ietf-clue-framework-00.txt 14 Abstract 16 This memo offers a framework for a protocol that enables devices in a 17 telepresence conference to interoperate by specifying the 18 relationships between multiple RTP streams. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on April 21, 2012. 37 Copyright Notice 39 Copyright (c) 2011 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 55 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 4. Framework Features . . . . . . . . . . . . . . . . . . . . . . 7 58 5. Stream Information . . . . . . . . . . . . . . . . . . . . . . 8 59 5.1. Media capture -- Audio and Video . . . . . . . . . . . . . 9 60 5.2. Attributes . . . . . . . . . . . . . . . . . . . . . . . . 9 61 5.2.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . 10 62 5.2.2. Audio mixed . . . . . . . . . . . . . . . . . . . . . 10 63 5.2.3. Audio Channel Format . . . . . . . . . . . . . . . . . 10 64 5.2.4. Area of capture . . . . . . . . . . . . . . . . . . . 11 65 5.2.5. Point of capture . . . . . . . . . . . . . . . . . . . 11 66 5.2.6. Area Scale Millimeters . . . . . . . . . . . . . . . . 12 67 5.2.7. Video composed . . . . . . . . . . . . . . . . . . . . 12 68 5.2.8. Auto-switched . . . . . . . . . . . . . . . . . . . . 12 69 5.3. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 12 70 6. Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 14 71 6.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 15 72 6.1.1. Consumer Capability Message . . . . . . . . . . . . . 15 73 6.1.2. Provider Capabilities Announcement . . . . . . . . . . 16 74 6.1.3. Consumer Configure Request . . . . . . . . . . . . . . 16 75 6.2. Physical Simultaneity . . . . . . . . . . . . . . . . . . 16 76 6.3. Encoding Groups . . . . . . . . . . . . . . . . . . . . . 18 77 6.3.1. Encoding Group Structure . . . . . . . . . . . . . . . 18 78 6.3.2. Individual Encodes . . . . . . . . . . . . . . . . . . 19 79 6.3.3. More on Encoding Groups . . . . . . . . . . . . . . . 20 80 6.3.4. Examples of Encoding Groups . . . . . . . . . . . . . 21 81 7. Using the Framework . . . . . . . . . . . . . . . . . . . . . 23 82 7.1. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 27 83 7.2. Media Consumer Behavior . . . . . . . . . . . . . . . . . 27 84 7.2.1. One screen consumer . . . . . . . . . . . . . . . . . 28 85 7.2.2. Two screen consumer configuring the example . . . . . 28 86 7.2.3. Three screen consumer configuring the example . . . . 29 87 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29 88 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 89 10. Security Considerations . . . . . . . . . . . . . . . . . . . 29 90 11. Informative References . . . . . . . . . . . . . . . . . . . . 29 91 Appendix A. Open Issues . . . . . . . . . . . . . . . . . . . . . 30 92 A.1. Video layout arrangements and centralized composition . . 30 93 A.2. Source is selectable . . . . . . . . . . . . . . . . . . . 30 94 A.3. Media Source Selection . . . . . . . . . . . . . . . . . . 30 95 A.4. Endpoint requesting many streams from MCU . . . . . . . . 31 96 A.5. VAD (voice activity detection) tagging of audio streams . 31 97 A.6. Private Information . . . . . . . . . . . . . . . . . . . 31 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 31 100 1. Introduction 102 Current telepresence systems, though based on open standards such as 103 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 104 other. A major factor limiting the interoperability of telepresence 105 systems is the lack of a standardized way to describe and negotiate 106 the use of the multiple streams of audio and video comprising the 107 media flows. This draft provides a framework for a protocol to 108 enable interoperability by handling multiple streams in a 109 standardized way. It is intended to support the use cases described 110 in draft-ietf-clue-telepresence-use-cases-00 and to meet the 111 requirements in draft-romanow-clue-requirements-xx. 113 The solution described here is strongly focused on what is being done 114 today, rather than on a vision of future conferencing. At the same 115 time, the highest priority has been given to creating an extensible 116 framework to make it easy to accommodate future conferencing 117 functionality as it evolves. 119 The purpose of this effort is to make it possible to handle multiple 120 streams of media in such a way that a satisfactory user experience is 121 possible even when participants are on different vendor equipment and 122 when they are using devices with different types of communication 123 capabilities. Information about the relationship of media streams 124 must be communicated so that audio/video rendering can be done in the 125 best possible manner. In addition, it is necessary to choose which 126 media streams are sent. 128 There is no attempt here to dictate to the renderer what it should 129 do. What the renderer does is up to the renderer. 131 After the following Definitions, a short section introduces key 132 concepts. The body of the text comprises three sections that deal 133 with in turn stream content, choosing streams and an implementation 134 example. The media provider and media consumer behavior are 135 described in separate sections as well. Several appendices describe 136 topics that are under discussion for adding to the document. 138 2. Terminology 140 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 141 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 142 document are to be interpreted as described in RFC 2119 [RFC2119]. 144 3. Definitions 146 The definitions marked with an "*" are new; all the others are from 147 draft-wenger-clue-definitions-00-01.txt. 149 *Audio Capture: Media Capture for audio. Denoted as ACn. 151 Capture Device: A device that converts audio and video input into an 152 electrical signal, in most cases to be fed into a media encoder. 153 Cameras and microphones are examples for capture devices. 155 Capture Scene: the scene that is captured by a collection of Capture 156 Devices. A Capture Scene may be represented by more than one type of 157 Media. A Capture Scene may include more than one Media Capture of 158 the same type. An example of a Capture Scene is the video image of a 159 group of people seated next to each other, along with the sound of 160 their voices, which could be represented by some number of VCs and 161 ACs. A middle box may also express Capture Scenes that it constructs 162 from Media streams it receives. 164 A Capture Set includes Media Captures that all represent some aspect 165 of the same Capture Scene. The items (rows) in a Capture Set 166 represent different alternatives for representing the same Capture 167 Scene. 169 Conference: used as defined in [RFC4353], A Framework for 170 Conferencing within the Session Initiation Protocol (SIP). 172 *Individual Encode: A variable with a set of attributes that 173 describes the maximum values of a single audio or video capture 174 encoding. The attributes include: maximum bandwidth- and for video 175 maximum macroblocks, maximum width, maximum height, maximum frame 176 rate. [Edt. These are based on H.264.] 178 *Encoding Group: Encoding group: A set of encoding parameters 179 representing a device's complete encoding capabilities or a 180 subdivision of them. Media stream providers formed of multiple 181 physical units, in each of which resides some encoding capability, 182 would typically advertise themselves to the remote media stream 183 consumer as being formed multiple encoding groups. Within each 184 encoding group, multiple potential actual encodings are possible, 185 with the sum of those encodings' characteristics constrained to being 186 less than or equal to the group-wide constraints. 188 Endpoint: The logical point of final termination through receiving, 189 decoding and rendering, and/or initiation through capturing, 190 encoding, and sending of media streams. An endpoint consists of one 191 or more physical devices which source and sink media streams, and 192 exactly one [RFC4353] Participant (which, in turn, includes exactly 193 one SIP User Agent). In contrast to an endpoint, an MCU may also 194 send and receive media streams, but it is not the initiator nor the 195 final terminator in the sense that Media is Captured or Rendered. 196 Endpoints can be anything from multiscreen/multicamera rooms to 197 handheld devices. 199 Endpoint Characteristics: include placement of Capture and Rendering 200 Devices, capture/render angle, resolution of cameras and screens, 201 spatial location and mixing parameters of microphones. Endpoint 202 characteristics are not specific to individual media streams sent by 203 the endpoint. 205 Left: For media captures, left and right is from the point of view of 206 a person observing the rendered media. 208 MCU: Multipoint Control Unit (MCU) - a device that connects two or 209 more endpoints together into one single multimedia conference 210 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 211 tardy in requiring that media from the mixer be sent to EACH 212 participant. I think we have practical use cases where this is not 213 the case. But the bug (if it is one) is in 4353 and not herein. 215 Media: Any data that, after suitable encoding, can be conveyed over 216 RTP, including audio, video or timed text. 218 *Media Capture: a source of Media, such as from one or more Capture 219 Devices. A Media Capture may be the source of one or more Media 220 streams. A Media Capture may also be constructed from other Media 221 streams. A middle box can express Media Captures that it constructs 222 from Media streams it receives. 224 *Media Consumer: an Endpoint or middle box that receives Media 225 streams 227 *Media Provider: an Endpoint or middle box that sends Media streams 229 Model: a set of assumptions a telepresence system of a given vendor 230 adheres to and expects the remote telepresence system(s) also to 231 adhere to. 233 Right: For media captures, left and right is from the point of view 234 of a person observing the rendered media. 236 Render: the process of generating a representation from a media, such 237 as displayed motion video or sound emitted from loudspeakers. 239 *Simultaneous Transmission Set: a set of media captures that can be 240 transmitted simultaneously from a Media Provider. 242 Spatial Relation: The arrangement in space of two objects, in 243 contrast to relation in time or other relationships. See also Left 244 and Right. 246 *Stream: RTP stream as in [RFC3550]. 248 Stream Characteristics: include media stream attributes commonly used 249 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 250 resolution, profile/level etc.) as well as CLUE specific attributes 251 (which could include for example and depending on the solution found: 252 the I-D or spatial location of a capture device a stream originates 253 from). 255 Telepresence: an environment that gives non co-located users or user 256 groups a feeling of (co-located) presence - the feeling that a Local 257 user is in the same room with other Local users and the Remote 258 parties. The inclusion of Remote parties is achieved through 259 multimedia communication including at least audio and video signals 260 of high fidelity. 262 *Video Capture: Media Capture for video. Denoted as VCn. 264 Video composite: A single image that is formed from combining visual 265 elements from separate sources. 267 4. Framework Features 269 Two key functions must be accomplished so that multiple media streams 270 can be handled in a telepresence conference. These are: 272 o How to choose which streams the provider should send to the 273 consumer 275 o What information needs to be added to the streams to allow a 276 rendering of the capture scene 278 The framework/model we present here can be understood as specifying 279 these two functions. 281 Media stream providers and consumers are central to the framework. 282 The provider's job is to advertise its capabilities (as described 283 here) to the consumer, whose job it is to configure the provider's 284 encoding capabilities as described below. Both providers and 285 consumers can each send and receive information, that is, we do not 286 have one party as the provider and one as the consumer exclusively, 287 but all parties have both sending and receiving parts to them. Most 288 devices function as both a media provider and as a media consumer. 290 For two devices to communicate bidirectionally, with media flowing in 291 both directions, both devices act as both a media provider and a 292 media consumer. The protocol exchange shown later in the "Choosing 293 Streams" section happens twice independently between the 2 294 bidirectional devices. 296 Both endpoints and MCUs, or more generally "middleboxes", can be 297 media providers and consumers. 299 Generally, the provider is capable of sending alternate captures of a 300 capture scene. These are described by the provider as capabilities 301 and chosen by the consumer. 303 5. Stream Information 305 This section describes the structure for communicating information 306 between providers and consumers. Figure illustrates how information 307 to be communicated is organized. Each construct illustrated in the 308 diagram is discussed in the sections below. 310 Diagram for Stream Content 312 +---------------+ 313 | | 314 | Capture Set | 315 | | 316 +-------+-------+ 317 _..-' | ``-._ 318 _.-' | ``-._ 319 _.-' | ``-._ 320 +----------------+ +----------------+ +----------------+ 321 | Media Capture | | Media Capture | | Media Capture | 322 | Audio or Video | | Audio or Video | | Audio or Video | 323 +----------------+ +----------------+ +----------------+ 324 .' `. `-..__ 325 .' `. ``-..__ 326 ,-----. ,---------. ``,----------. 327 ,' Encode`. ,' `. ,'Simultaneous`. 328 ( Group ) ( Attributes ) ( Transmission ) 329 `. ,' `. ,' `. Sets ,' 330 `-----' `---------' `----------' 332 5.1. Media capture -- Audio and Video 334 A media capture, as defined in definitions, is a fundamental concept 335 of the model. Media can be captured in different ways, for example 336 by various arrangements of cameras and microphones. The model uses 337 the terms "video capture" (VC) and "audio capture" (AC) to refer to 338 sources of media streams. To distinguish between multiple instances, 339 they are numbered for example VC1, VC2, and VC3 could refer to three 340 different video captures which can be used simultaneously. 342 A media capture can be a media source such as video from a specific 343 camera, or it can be more conceptual such as a composite image from 344 several cameras, or an automatic dynamically switched capture 345 choosing from several cameras depending on who is talking or other 346 factors. 348 A media capture is described by Attributes and associated with an 349 Encode Group, and Simultaneous Transmission Set. 351 Audio and video captures are aggregated into Capture Sets as 352 described below. 354 5.2. Attributes 356 Audio and video capture attributes describe information about streams 357 and their relationships. [Edt: We do not mean to duplicate SDP, if 358 an SDP description can be used, great.] The attributes of media 359 captures refer to static aspects of those captures that can be used 360 by the consumer for selecting the captures offered by the provider. 362 The mechanism of Attributes make the framework extensible. Although 363 we are defining some attributes now based on the most common use 364 cases, new attributes can be added for new use cases as they arise. 365 In general, the way to extend the solution to handle new features is 366 by adding attributes and/or values. 368 We describe attributes by variables and their values. The current 369 attributes are listed below and then described. The variable is 370 shown in parentheses, and the values follow after the colon: 372 o (Purpose): main, presentation 374 o (Audio mixed): true, false 376 o (Audio Channel Format): mono, stereo, tbd 378 o (Area of Capture): A set of 'Ranges' describing the relevant area 379 being capture by a capture device 381 o (Point of Capture): A 'Point' describing the location of the 382 capture device or pseudo-device 384 o (Area scale): true, false indicating if area numbers are in 385 millimeters 387 o (Video composed): true, false 389 o (Auto-switched): true, false 391 5.2.1. Purpose 393 A variable with enumerated values describing the purpose or role of 394 the Media Capture. It could be applied to any media type. Possible 395 values: main, presentation, others TBD. 397 Main: 399 The audio or video capture is of one or more people participating in 400 a conference (or where they would be if they were there). It is of 401 part or all of the Capture Scene. 403 Presentation: 405 The capture provides a presentation, e. g., from a connected laptop 406 or other input device. 408 5.2.2. Audio mixed 410 A Boolean variable to indicate whether the AC is a mix of other ACs 411 or Streams. 413 5.2.3. Audio Channel Format 415 The "channel format" attribute of an Audio Capture indicates how the 416 meaning of the channels is determined. It is an enumerated variable 417 describing the type of audio channel or channels in the Audio 418 Capture. The possible values of the "channel format" attribute are: 420 o mono 422 o stereo 424 o TBD - other possible future values (to potentially include other 425 things like 3.0, 3.1, 5.1 surround sound and binaural) 427 All ACs in the same row of a Capture Set MUST have the same value of 428 the "channel format" attribute. 430 There can be multiple ACs of a particular type, or even different 431 types. These multiple ACs could each have an area of capture 432 attribute to indicate they represent different areas of the capture 433 scene. 435 If there are multiple audio streams, they might be correlated (that 436 is, someone talking might be heard in multiple captures from the same 437 room). Echo cancellation and stream synchronization in consumers 438 should take this into account. 440 Mono: 442 An AC with channel format="mono" has one audio channel. 444 Stereo: 446 An AC with channel format = "stereo" has exactly two audio channels, 447 left and right, as part of the same AC. [Edt: should we mention RFC 448 3551 here? The channel format may be related to how Audio Captures 449 are mapped to RTP streams. This stereo is not the same as the effect 450 produced from two mono ACs one from the left and one from the right. 451 ] 453 5.2.4. Area of capture 455 The area_of_capture attribute is used to describe the relevant area 456 of which a media capture is "capturing". By comparing the area of 457 capture for different media captures, a consumer can determine the 458 spatial relationships of the captures on the provider so that they 459 can be rendered correctly. The attribute consists of a set of 460 'Ranges', one range for each spatial dimension, where each range has 461 a Begin and End coordinate. It is not necessary to fill out all of 462 the dimensions if they are not relevant (i.e. if an endpoint's 463 captures only span a single dimension, only the 'x' coordinate can be 464 used). There is no need to pre-define a possible range for this 465 coordinate system; a device may choose what is most appropriate for 466 describing its captures. However, it is specified that as numbers 467 move from lower to higher, the location is going from: left to right 468 (in the case of the 'x' dimension), front to back (in the case of the 469 'y' dimension or low to high (in the case of the 'z' dimension). 471 5.2.5. Point of capture 473 The point_of_capture attribute can be used to describe the location 474 of a capture device or pseudo-device. If there are multiple captures 475 which share the same 'area_of_capture' value, then it is useful to 476 know the location from which they are capturing that area (e.g. a 477 device which has multiview). Point of capture is expressed as a 478 single {x, y, z} coordinate where, as with area_of_capture, only the 479 necessary dimensions need be expressed. 481 5.2.6. Area Scale Millimeters 483 An optional Boolean variable indicating if the numbers used for area 484 of capture and point of capture are in terms of millimeters. If this 485 attribute is true, then the x,y,z numbers represent millimeters. If 486 this attribute is false, then there is no physical scale. The 487 default value is false. 489 5.2.7. Video composed 491 An optional Boolean variable indicating if the VC is constructed by 492 composing multiple other video captures together. (This could 493 indicate for example a continuous presence view of multiple images in 494 a grid, or a large image with smaller picture-in-picture images in 495 it.) 497 Note: this attribute is not intended to differentiate between 498 different ways of composing images. For possible extension of the 499 framework, additional attributes could be defined to distinguish 500 between different ways of composing images, with different video 501 layout arrangements of composing multiple images into one. 503 5.2.8. Auto-switched 505 A Boolean variable that may be used for audio and/or video streams. 506 In this case the offered AC or VC varies depending on some rule; it 507 is auto-switched between possible VCs, or between possible ACs. The 508 most common example of this is sending the video capture associated 509 with the "loudest" speaker according to an audio detection algorithm. 511 5.3. Capture Set 513 A capture set describes the alternative media streams that the 514 provider offers to send to the consumer. As shown in the content 515 diagram above, the capture set is an aggregation of all audio and 516 video captures for a particular scene that a provider is willing to 517 send. 519 A provider describes its ability to send alternative media streams in 520 the capture set, which lists the media captures in rows, as shown 521 below. Each row of the capture set consists of either a single 522 capture or a group of captures. A group means the individual 523 captures in the group are spatially related with the specific 524 ordering of the captures described through the use of attributes. 526 Here is an example of a simple capture set with three video captures 527 and three audio channels: 529 (VC0, VC1, VC2) 531 (AC0, AC1, AC2) 533 The three VCs together in a row indicate those captures are spatially 534 related to each other. Similarly for the 3 ACs in the second row. 535 The ACs and VCs in the same capture set are spatially related to each 536 other. 538 Multiple Media Captures of the same media type are often spatially 539 related to each other. Typically multiple Video Captures should be 540 rendered next to each other in a particular order, or multiple audio 541 channels should be rendered to match different speakers in a 542 particular way. Also, media of different types are often associated 543 with each other, for example a group of Video Captures can be 544 associated with a group of Audio Captures meaning they should be 545 rendered together. 547 Media Captures of the same media type are associated with each other 548 by grouping them together in a single row of a Capture Set. Media 549 Captures of different media types are associated with each other by 550 putting them in different rows of the same Capture Set. 552 Since all captures have an area_of_capture associated with them, a 553 consumer can determine the spatial relationships of captures by 554 comparing the locations of their areas of capture with one another. 556 Association between audio and video can be made by finding audio and 557 video captures which share overlapping areas of capture. 559 The items (rows) in a capture set represent different alternatives 560 for representing the same Capture Scene. For example the following 561 are alternative ways of capturing the same Capture Scene - two 562 cameras each viewing half of a room, or one camera viewing the whole 563 room, or one stream that automatically captures the person in the 564 room who is currently speaking. Each row of the Capture Set contains 565 either a single media capture or one group of media captures. 567 The following example shows a capture set for an endpoint media 568 provider where: 570 o (VC0, VC1, VC2) - left camera capture, center camera capture, 571 right camera capture 573 o (VC3) - capture associated with loudest 575 o (VC4) - zoomed out view of all people in the room 577 o (AC0) - main audio 579 The first item in this capture set example is a group of video 580 captures with a spatial relationship to each other. These are VC0, 581 VC1, and VC2. VC3 and VC4 are additional alternatives of how to 582 capture the same room in different ways. The audio capture is 583 included in the same capture set to indicate AC0 is associated with 584 those video captures, meaning the audio should be rendered along with 585 the video in the same set. 587 The idea is to have sets of captures that represent the same 588 information ("information" in this context might be a set of people 589 and their associated audio / video streams, or might be a 590 presentation supplied by a laptop, perhaps with an accompanying audio 591 commentary). Spatial ordering of media captures is described through 592 the use of attributes. 594 A media consumer could choose one row of each media type (e.g., audio 595 and video) from a capture set. For example a three stream consumer 596 could choose the first video row plus the audio row, while a single 597 stream consumer could choose the second or third video row plus the 598 audio row. An MCU consumer might choose to receive multiple rows. 600 The Simultaneous Transmission Sets and Encoding Groups as discussed 601 in the next section apply to media captures listed in capture sets. 602 The Simultaneous Transmission Sets and Encoding Groups MUST allow all 603 the Media Captures in a particular row of the capture set to be used 604 simultaneously. But media captures in different rows of the capture 605 set might not be able to be used simultaneously. 607 6. Choosing Streams 609 This section describes the process of choosing which streams the 610 provider sends to the consumer. In order for appropriate streams to 611 be sent from providers to consumers, certain characteristics of the 612 multiple streams must be understood by both providers and consumers. 613 Two separate aspects of streams suffice to describe the necessary 614 information to be shared by providers and consumers. The first 615 aspect we call "physical simultaneity" and the other aspect we refer 616 to as "encoding group". These are described in the following 617 sections, after the message flow is discussed. 619 6.1. Message Flow 621 The following diagram shows the flow of messages between a media 622 provider and a media consumer. The provider sends information about 623 its capabilities (as specified in this section), then the consumer 624 chooses which streams it wants, which we refer to as "configure". 625 The consumer sends its own capability message to the provider which 626 may contain information about its own capabilities or restrictions, 627 in which case the provider might tailor its announcements to the 628 consumer. 630 Diagram for Message Flow 632 Media Consumer Media Provider 633 -------------- ------------ 634 | | 635 |----- Consumer Capability ---------->| 636 | | 637 | | 638 |<---- Capabilities (announce) -------| 639 | | 640 | | 641 |------ Configure (request) --------->| 642 | | 644 Media captures are dynamic. They can come and go in a conference - 645 and their parameters can change. A provider can advertise a new list 646 of captures at any time. Both the media provider and media consumer 647 can send "their messages" (i.e., capture set announcements, stream 648 configurations) any number of times during a call, and the other end 649 is always required to act on any new information received (e.g., 650 stopping streams it had previously configured that are no longer 651 valid). 653 6.1.1. Consumer Capability Message 655 In order for a maximally-capable provider to be able to advertise a 656 manageable number of video captures to a consumer, there is a 657 potential use for the consumer being able, at the start of CLUE to be 658 able to inform the provider of its capabilities. One example here 659 would be the video capture attribute set - a consumer could tell the 660 provider the complete set of video capture attributes it is able to 661 understand and so the provider would be able to reduce the capture 662 set it advertises to be tailored to the consumer. 664 TBD - the content of this message needs to be better defined. The 665 authors believe there is a need for this message, but have not worked 666 out the details yet. 668 6.1.2. Provider Capabilities Announcement 670 The provider capabilities announce message includes: 672 o the list of captures and their attributes 674 o the list of capture sets 676 o the list of Simultaneous Transmission Sets 678 o the list of the encoding groups 680 6.1.3. Consumer Configure Request 682 After receiving a set of video capture information from a provider 683 and making its choice of what media streams to receive based on the 684 consumer's own capabilities and any provider-side simultaneity 685 restrictions, the consumer needs to essentially configure the 686 provider to transmit the chosen set. 688 The expectation is that this message will enumerate each of the 689 encoding groups and potential encoders within those groups that the 690 consumer wishes to be active (this may well be a subset of the 691 complete set available). For each such encoder within an encoding 692 group, the consumer would specify the video capture (i.e., VC as 693 described above) along with the specifics of the video encoding 694 required, i.e. width, height, frame rate and bit rate. At this 695 stage, the consumer would also provide RTP demultiplexing information 696 as required to distinguish each stream from the others being 697 configured by the same mechanism. 699 6.2. Physical Simultaneity 701 An endpoint or MCU can send multiple captures simultaneously. 702 However, there may be constraints that limit which captures can be 703 sent simultaneously with other captures. 705 Physical or device simultaneity refers to fact that a device may not 706 be able to be used in different ways at the same time. This shapes 707 the way that offers are made from the provider. The offers are made 708 so that the consumer will choose one of several possible usages of 709 the device. This type of constraint is expressed in Simultaneous 710 Transmission Sets. This is easier to show in an example. 712 Consider the example of a room system where there are 3 cameras each 713 of which can send a separate capture covering 2 persons each- VC0, 714 VC1, VC2. The middle camera can also zoom out and show all 6 715 persons, VC3. But the middle camera cannot be used in both modes at 716 the same time - it has to either show the space where 2 participants 717 sit or the whole 6 seats. We refer to this as a physical device 718 simultaneity constraint. 720 The following illustration shows 3 cameras with 4 video streams. The 721 middle camera can be used as main video zoomed in on 2 people or it 722 could be used in zoomed out mode and capture the whole endpoint. The 723 idea here is that the middle camera cannot be used for both zoomed in 724 and zoomed out captures simultaneously. This is a constraint imposed 725 by the physical limitations of the devices. 727 Diagram for Simultaneity 729 `-. +--------+ VC2 730 .-'_Camera 3|----------> 731 .-' +--------+ 732 VC3 733 --------> 734 `-. +--------+ / 735 .-'|Camera 2|< 736 .-' +--------+ \ VC1 737 --------> 739 `-. +--------+ VC0 740 .-'|Camera 1|----------> 741 .-' +--------+ 743 VC0- video zoomed in on 2 people VC2- video zoomed in on 2 people 744 VC1- video zoomed in on 2 people VC3- video zoomed out on 6 people 746 Simultaneous transmission sets can be expressed as sets of the VCs 747 that could physically be transmitted at the same time, though it may 748 not make sense to do so. 750 In this example the two simultaneous sets are: 752 {VC0, VC1, VC2} 754 {VC0, VC3, VC2} 756 In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2. 757 Only one set can be transmitted at a time. These are physical 758 capabilities describing what can physically be sent at the same time, 759 not what might make sense to send. For example, in the second set 760 both VC0 and VC2 are redundant if VC3 is included. 762 In describing its capabilities, the provider must take physical 763 simultaneity into account and send a list of its Simultaneous 764 Transmission Sets to the consumer, along with the Capture Sets and 765 Encoding Groups. 767 6.3. Encoding Groups 769 The second aspect of multiple streams that must be understood by 770 providers and consumers in order to create the best experience 771 possible, i. e., for the "right" or "best" streams to be sent, is the 772 encoding characteristics of the possible audio and video streams 773 which can be sent. Just as in the way that a constraint is imposed 774 on the multiple streams due to the physical limitations, there are 775 also constraints due to encoding limitations. These are described by 776 four variables that make up an Encoding Group, as shown in the 777 following table: 779 Table: Encoding Group 781 +----------------+--------------------------------------------------+ 782 | Name | Description | 783 +----------------+--------------------------------------------------+ 784 | maxBandwidth | Maximum number of bits per second relating to | 785 | | all encodes combined | 786 | maxVideoMbps | Maximum number of macroblocks per second | 787 | | relating to a all video encodes combined ((width | 788 | | + 15) / 16) * ((height + 15) / 16) * | 789 | | framesPerSecond | 790 | videoEncodes[] | Set of potential video encodes can be generated | 791 | audioEncodes[] | Set of potential encodes that can be generated | 792 +----------------+--------------------------------------------------+ 794 An encoding group is the basic concept for describing encoding 795 capability. As shown in the Table, it has an overall maxMbps and 796 bandwidth limits, as well as being comprised of sets of individual 797 encodes, which will be described in more detail below. 799 Each media stream provider includes one or more encoding groups. 800 There may be multiple encoding groups per endpoint. For example, 801 each video capture device might have an associated encoding group 802 that describes the video streams that can result from that capture. 804 A remote receiver (i. e., stream consumer)configures some or all of 805 the specific encodings within one or more groups in order to provide 806 it with media streams to decode. 808 6.3.1. Encoding Group Structure 810 This section shows more detail on the media stream provider's 811 encoding group structure. The encoding group includes several 812 individual encodes, each has different encoding values. For example 813 one may be high definition video 1080p60, and another 720p30, with a 814 third being CIF. While a typical 3 codec/display system would have 815 one encoding group per "box", there are many possibilities for the 816 number of encoding groups a provider may be able to offer and for 817 what encoding values there are in each encoding group. 819 Diagram for Encoding Group Structure 821 ,-------------------------------------------------. 822 | Media Provider | 823 | | 824 | ,--------------------------------------. | 825 | | ,--------------------------------------. | 826 | | | ,--------------------------------------. | 827 | | | | Encoding Group | | 828 | | | | ,-----------. | | 829 | | | | | | ,---------. | | 830 | | | | | | | | ,---------.| | 831 | | | | | Encode1 | | Encode2 | | Encode3 || | 832 | `.| | | | | | `---------'| | 833 | `.| `-----------' `---------' | | 834 | `--------------------------------------' | 835 `-------------------------------------------------' 837 As shown in the diagram, each encoding group has multiple potential 838 individual encodes within it. Not all encodes are equally capable, 839 the stream consumer chooses the encodes it wants by configuring the 840 provider to send it what it wants to receive. 842 Some encoding endpoints are fixed, others are flexible, e. g., a 843 single box with multiple DSPs where the resources are shared. 845 6.3.2. Individual Encodes 847 An encoding group is associated with a media capture through the 848 individual encodes, that is, an audio or video capture is encoded in 849 one or more individual encodes, as described by the videoEncodes[] 850 and audioEncodes[]variables. 852 The following table shows the variables for a Video Encode. (There 853 is a similar table for audio.) 855 Table: Individual Video Encode 856 +--------------+----------------------------------------------------+ 857 | Name | Description | 858 +--------------+----------------------------------------------------+ 859 | maxBandwidth | Maximum number of bits per second relating to a | 860 | | single video encoding | 861 | maxMbps | Maximum number of macroblocks per second relating | 862 | | to a single video encoding: ((width + 15) / 16) * | 863 | | ((height + 15) / 16) * framesPerSecond | 864 | maxWidth | Video resolution's maximum supported width, | 865 | | expressed in pixels | 866 | maxHeight | Video resolution's maximum supported height, | 867 | | expressed in pixels | 868 | maxFrameRate | Maximum supported frame rate | 869 +--------------+----------------------------------------------------+ 871 A remote receiver configures (i. e., instantiates) some or all of the 872 specific encodes such that: 874 o The configuration of each active ENC does not exceed that 875 individual encode's maxWidth, maxHeight, maxFrameRate. 877 o The total bandwidth of the configured ENC does not exceed the 878 maxBandwidth of the encoding group. 880 o The sum of the macroblocks per second of each configured encode 881 does not exceed the maxMbps attribute of the encoding group. 883 An equivalent set of attributes holds for audio encodes within an 884 audio encoding group. 886 6.3.3. More on Encoding Groups 888 An encoding group EG comprises one or more potential encodings 889 ENC. For example, 891 EG0: maxMbps=489600, maxBandwidth=6000000 892 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 893 maxMbps=244800, maxBandwidth=4000000 894 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 895 maxMbps=244800, maxBandwidth=4000000 896 AUDIO_ENC0: maxBandwidth=96000 897 AUDIO_ENC1: maxBandwidth=96000 898 AUDIO_ENC2: maxBandwidth=96000 900 Here, the encoding group is EG0. It can transmit up to two 1080p30 901 encodings (Mbps for 1080p = 244800), but it is capable of 902 transmitting a maxFrameRate of 60 frames per second (fps). To 903 achieve the maximum resolution (1920 x 1088) the frame rate is 904 limited to 30 fps. However 60 fps can be achieved at a lower 905 resolution if required by the consumer. Although the encoding group 906 is capable of transmitting up to 6Mbit/s, no individual video 907 encoding can exceed 4Mbit/s. 909 This encoding group also allows up to 3 audio encodings, 910 AUDIO_ENC<0-2>. It is not required that audio and video encodings 911 reside within the same encoding group, but if so then the group's 912 overall maxBandwidth value is a limit on the sum of all audio and 913 video encodings configured by the consumer. A system that does not 914 wish or need to combine bandwidth limitations in this way should 915 instead use separate encoding groups for audio and video in order for 916 the bandwidth limitations on audio and video to not interact. 918 Audio and video can be expressed in separate encode groups, as in 919 this illustration. 921 VIDEO_EG0: maxMbps=489600, maxBandwidth=6000000 922 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 923 maxMbps=244800, maxBandwidth=4000000 924 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 925 maxMbps=244800, maxBandwidth=4000000 926 AUDIO_EG0: maxBandwidth=500000 927 AUDIO_ENC0: maxBandwidth=96000 928 AUDIO_ENC1: maxBandwidth=96000 929 AUDIO_ENC2: maxBandwidth=96000 931 6.3.4. Examples of Encoding Groups 933 This section illustrates further examples of encoding groups. In the 934 first example, the capability parameters are the same across ENCs. 935 In the second example, they vary. 937 An endpoint that has 3 similar video capture devices would advertise 938 3 encoding groups that can each transmit up to 2 1080p30 encodings, 939 as follows: 941 EG0: maxMbps = 489600, maxBandwidth=6000000 942 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 943 maxMbps=244800, maxBandwidth=4000000 944 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 945 maxMbps=244800, maxBandwidth=4000000 946 EG1: maxMbps = 489600, maxBandwidth=6000000 947 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 948 maxMbps=244800, maxBandwidth=4000000 949 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 950 maxMbps=244800, maxBandwidth=4000000 951 EG2: maxMbps = 489600, maxBandwidth=6000000 952 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 953 maxMbps=244800, maxBandwidth=4000000 954 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 955 maxMbps=244800, maxBandwidth=4000000 957 A remote consumer configures some or all of the specific encodings 958 such that: 960 o The configuration of each active ENC parameter values does not 961 cause that encoding's maxWidth, maxHeight, maxFrameRate to be 962 exceeded 964 o The total bandwidth of the configured ENC encodings does not 965 exceed the maxBandwidth of the encoding group 967 o The sum of the "macroblocks per second" values of each configured 968 encoding does not exceed the maxMbps of the encoding group 970 There is no requirement for all encodings within an encoding group to 971 be activated when configured by the consumer. 973 Depending on the provider's encoding methods, the consumer may be 974 able to request fixed encode values or choose encode values in the 975 range less than the maximum offered. We will discuss consumer 976 behavior in more detail in a section below. 978 6.3.4.1. Sample video encoding group specification #2 980 This example specification expresses a system whose encoding groups 981 can each transmit up to 3 encodings, but with each potential encoding 982 having a progressively lower specification. In this example, 1080p60 983 transmission is possible (as ENC0 has a maxMbps value compatible with 984 that) as long as it is the only active encoding (as maxMbps for the 985 entire encoding group is also 489600). Significantly, as up to 3 986 encodings are available per group, some sets of captures which 987 weren't able to be transmitted simultaneously in example #1 above now 988 become possible, for instance VC1, VC3 and VC6 together. In common 989 with example #1, all encoding groups have an identical specification. 991 EG0: maxMbps = 489600, maxBandwidth=6000000 992 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 993 maxMbps=489600, maxBandwidth=4000000 994 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 995 maxMbps=108000, maxBandwidth=4000000 996 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 997 maxMbps=61200, maxBandwidth=4000000 998 EG1: maxMbps = 489600, maxBandwidth=6000000 999 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1000 maxMbps=489600, maxBandwidth=4000000 1001 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 1002 maxMbps=108000, maxBandwidth=4000000 1003 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 1004 maxMbps=61200, maxBandwidth=4000000 1005 EG2: maxMbps = 489600, maxBandwidth=6000000 1006 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1007 maxMbps=489600, maxBandwidth=4000000 1008 ENC1: maxWidth=1280, maxHeight=720, maxFrameRate=30, 1009 maxMbps=108000, maxBandwidth=4000000 1010 ENC2: maxWidth=960, maxHeight=544, maxFrameRate=30, 1011 maxMbps=61200, maxBandwidth=4000000 1013 7. Using the Framework 1015 This section shows in more detail how to use the framework to 1016 represent a typical case for telepresence rooms. First an endpoint 1017 is illustrated, then an MCU case is shown. 1019 Consider an endpoint with the following characteristics: 1021 o 3 cameras, 3 displays, a 6 person table 1023 o Each video device can provide one capture for each 1/3 section of 1024 the table 1026 o A single capture representing the active speaker can be provided 1028 o A single capture representing the active speaker with the other 2 1029 captures shown picture in picture within the stream can be 1030 provided 1032 o A capture showing a zoomed out view of all 6 seats in the room can 1033 be provided 1035 The audio and video captures for this endpoint can be described as 1036 follows. The Encode Group specifications can be found above in 1037 Section 6.3.4.1, Sample video encoding group specification #2. 1039 Video Captures: 1041 o VC0- (the left camera stream), encoding group:EG0, attributes: 1042 purpose=main;auto-switched:no; area_of_capture={xBegin=0, xEnd=33} 1044 o VC1- (the center camera stream), encoding group:EG1, attributes: 1045 purpose=main; auto-switched:no; area_of_capture={xBegin=33, 1046 xEnd=66} 1048 o VC2- (the right camera stream), encoding group:EG2, attributes: 1049 purpose=main;auto-switched:no; area_of_capture={xBegin=66, 1050 xEnd=99} 1052 o VC3- (the loudest panel stream), encoding group:EG1, attributes: 1053 purpose=main;auto-switched:yes; area_of_capture={xBegin=0, 1054 xEnd=99} 1056 o VC4- (the loudest panel stream with PiPs), encoding group:EG1, 1057 attributes: purpose=main; composed=true; auto-switched:yes; 1058 area_of_capture={xBegin=0, xEnd=99} 1060 o VC5- (the zoomed out view of all people in the room), encoding 1061 group:EG1, attributes: purpose=main;auto-switched:no; 1062 area_of_capture={xBegin=0, xEnd=99} 1064 o VC6- (presentation stream), encoding group:EG1, attributes: 1065 purpose=presentation;auto-switched:no; area_of_capture={xBegin=0, 1066 xEnd=99} 1068 Summary of video captures - 3 codecs, center one is used for center 1069 camera stream, presentation stream, auto-switched, and zoomed views. 1071 Note the text in parentheses (e.g. "the left camera stream") is not 1072 explicitly part of the model, it is just explanatory text for this 1073 example, and is not included in the model with the media captures and 1074 attributes. 1076 [edt. It is arbitrary that for this example the alternative views 1077 are on EG1 - they could have been spread out- it was not a necessary 1078 choice.] 1080 Audio Captures: 1082 o AC0 (left), attributes: purpose=main;channel format=mono; 1083 area_of_capture={xBegin=0, xEnd=33} 1085 o AC1 (right), attributes: purpose=main;channel format=mono; 1086 area_of_capture={xBegin=66, xEnd=99} 1088 o AC2 (center) attributes: purpose=main;channel format=mono; 1089 area_of_capture={xBegin=33, xEnd=66} 1091 o AC3 being a simple pre-mixed audio stream from the room (mono), 1092 attributes: purpose=main;channel format=mono; mixed=true; 1093 area_of_capture={xBegin=0, xEnd=99} 1095 o AC4 audio stream associated with the presentation video (mono) 1096 attributes: purpose=presentation;channel format=mono; 1097 area_of_capture={xBegin=0, xEnd=99} 1099 The physical simultaneity information is: 1101 {VC0, VC1, VC2, VC3, VC4, VC6} 1103 {VC0, VC2, VC5, VC6} 1105 It is possible to select any or all of the rows in a capture set. 1106 This is strictly what is possible from the devices. However, using 1107 every member in the set simultaneously may not make sense- for 1108 example VC3(loudest) and VC4 (loudest with PIP). (In addition, there 1109 are encoding constraints that make choosing all of the VCs in a set 1110 impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1 has only 3 1111 ENCs. This constraint shows up in the Capture list and encoding 1112 groups, not in the simultaneous transmission sets.) 1114 In this example there are no restrictions on which audio captures can 1115 be sent simultaneously. 1117 The following table represents the capture sets for this provider. 1118 Recall that a capture set is composed of alternative captures 1119 covering the same scene. Capture Set #1 is for the main people 1120 captures, and Capture Set #2 is for presentation. 1122 +----------------+ 1123 | Capture Set #1 | 1124 +----------------+ 1125 | VC0, VC1, VC2 | 1126 | VC3 | 1127 | VC4 | 1128 | VC5 | 1129 | AC0, AC1, AC2 | 1130 | AC3 | 1131 +----------------+ 1133 +----------------+ 1134 | Capture Set #2 | 1135 +----------------+ 1136 | VC6 | 1137 | AC4 | 1138 +----------------+ 1140 Different capture sets are unique to each other, non-overlapping. A 1141 consumer chooses a capture row from each capture set. In this case 1142 the three captures VC0, VC1, and VC2 are one way of representing the 1143 video from the endpoint. These three captures should appear adjacent 1144 next to each other. Alternatively, another way of representing the 1145 Capture Scene is with the capture VC3, which automatically shows the 1146 person who is talking. Similarly for the VC4 and VC5 alternatives. 1148 As in the video case, the different rows of audio in Capture Set #1 1149 represent the "same thing", in that one way to receive the audio is 1150 with the 3 linear position audio captures (AC0, AC1, AC2), and 1151 another way is with the single channel monaural format AC3. The 1152 Media Consumer would choose the one audio capture row it is capable 1153 of receiving. 1155 The spatial ordering is understood by the media capture attributes 1156 area and point of capture. 1158 The consumer finds a "row" in each capture set #x section of the 1159 table that it wants. It configures the streams according to the 1160 encoding group for the row. 1162 A Media Consumer would likely want to choose a row to receive based 1163 in part on how many streams it can simultaneously receive. A 1164 consumer that can receive three people streams would probably prefer 1165 to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not 1166 receive the other rows. A consumer that can receive only one people 1167 stream would probably choose one of the other rows. 1169 If the consumer can receive a presentation stream too, it would also 1170 choose to receive the only row from Capture Set #2 (VC6). 1172 7.1. The MCU Case 1174 This section shows how an MCU might express its Capture Sets, 1175 intending to offer different choices for consumers that can handle 1176 different numbers of streams. A single audio capture stream is 1177 provided for all single and multi-screen configurations that can be 1178 associated (e.g. lip-synced) with any combination of video captures 1179 at the consumer. 1181 +--------------------+---------------------------------------------+ 1182 | Capture Set #1 | note | 1183 +--------------------+---------------------------------------------+ 1184 | VC0 | video capture for single screen consumer | 1185 | VC1, VC2 | video capture for 2 screen consumer | 1186 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1187 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1188 | AC0 | audio capture representing all participants | 1189 +--------------------+---------------------------------------------+ 1191 If / when a presentation stream becomes active within the conference, 1192 the MCU might re-advertise the available media as: 1194 +----------------+--------------------------------------+ 1195 | Capture Set #2 | note | 1196 +----------------+--------------------------------------+ 1197 | VC10 | video capture for presentation | 1198 | AC1 | presentation audio to accompany VC10 | 1199 +----------------+--------------------------------------+ 1201 7.2. Media Consumer Behavior 1203 [Edt. Should this be moved to appendix?] 1205 The receive side of a call needs to balance its requirements, based 1206 on number of screens and speakers, its decoding capabilities and 1207 available bandwidth, and the provider's capabilities in order to 1208 optimally configure the provider's streams. Typically it would want 1209 to receive and decode media from each capture set advertised by the 1210 provider. 1212 A sane, basic, algorithm might be for the consumer to go through each 1213 capture set in turn and find the collection of video captures that 1214 best matches the number of screens it has (this might include 1215 consideration of screens dedicated to presentation video display 1216 rather than "people" video) and then decide between alternative rows 1217 in the video capture sets based either on hard-coded preferences or 1218 user choice. Once this choice has been made, the consumer would then 1219 decide how to configure the provider's encode groups in order to make 1220 best use of the available network bandwidth and its own decoding 1221 capabilities. 1223 7.2.1. One screen consumer 1225 VC3, VC4 and VC5 are all on different rows by themselves, not in a 1226 group, so the receiving device should choose between one of those. 1227 The choice would come down to whether to see the greatest number of 1228 participants simultaneously at roughly equal precedence (VC5), a 1229 switched view of just the loudest region (VC3) or a switched view 1230 with PiPs (VC4). An endpoint device with a small amount of knowledge 1231 of these differences could offer a dynamic choice of these options, 1232 in-call, to the user. 1234 7.2.2. Two screen consumer configuring the example 1236 Mixing systems with an even number of screens, "2n", and those with 1237 "2n+1" cameras (and vice versa) is always likely to be the 1238 problematic case. In this instance, the behavior is likely to be 1239 determined by whether a "2 screen" system is really a "2 decoder" 1240 system, i.e., whether only one received stream can be displayed per 1241 screen or whether more than 2 streams can be received and spread 1242 across the available screen area. To enumerate 3 possible behaviors 1243 here for the 2 screen system when it learns that the far end is 1244 "ideally" expressed via 3 capture streams: 1246 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1247 per the 1 screen consumer case above) and either leave one screen 1248 blank or use it for presentation if / when a presentation becomes 1249 active 1251 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1252 (either with each capture being scaled to 2/3 of a screen and the 1253 centre capture being split across 2 screens) or, as would be 1254 necessary if there were large bezels on the screens, with each 1255 stream being scaled to 1/2 the screen width and height and there 1256 being a 4th "blank" panel. This 4th panel could potentially be 1257 used for any presentation that became active during the call. 1259 3. Receive 3 streams, decode all 3, and use control information 1260 indicating which was the most active to switch between showing 1261 the left and centre streams (one per screen) and the centre and 1262 right streams. 1264 For an endpoint capable of all 3 methods of working described above, 1265 again it might be appropriate to offer the user the choice of display 1266 mode. 1268 7.2.3. Three screen consumer configuring the example 1270 This is the most straightforward case - the consumer would look to 1271 identify a set of streams to receive that best matched its available 1272 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1273 spatial ordering would give sufficient information for the correct 1274 video capture to be shown on the correct screen, and the consumer 1275 would either need to divide a single encode group's capability by 3 1276 to determine what resolution and frame rate to configure the provider 1277 with or to configure the individual video captures' encode groups 1278 with what makes most sense (taking into account the receive side 1279 decode capabilities, overall call bandwidth, the resolution of the 1280 screens plus any user preferences such as motion vs sharpness). 1282 8. Acknowledgements 1284 Mark Gorzyinski contributed much to the approach. We want to thank 1285 Stephen Botzko for helpful discussions on audio. 1287 9. IANA Considerations 1289 TBD 1291 10. Security Considerations 1293 TBD 1295 11. Informative References 1297 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1298 Requirement Levels", BCP 14, RFC 2119, March 1997. 1300 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1301 A., Peterson, J., Sparks, R., Handley, M., and E. 1302 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1303 June 2002. 1305 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1306 Jacobson, "RTP: A Transport Protocol for Real-Time 1307 Applications", STD 64, RFC 3550, July 2003. 1309 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1310 Session Initiation Protocol (SIP)", RFC 4353, 1311 February 2006. 1313 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1314 January 2008. 1316 Appendix A. Open Issues 1318 A.1. Video layout arrangements and centralized composition 1320 In the context of a conference with a central MCU, there has been 1321 discussion about a consumer requesting the provider to provide a 1322 certain type of layout arrangement or perform a certain composition 1323 algorithm, such as combining some number of most recent talkers, or 1324 producing a video layout using a 2x2 grid or 1 large cell with 5 1325 smaller cells around it. The current framework does not address 1326 this. It isn't clear if this topic should be included in this 1327 framework, or maybe a different part of CLUE, or maybe outside of 1328 CLUE altogether. 1330 A.2. Source is selectable 1332 A Boolean variable. True indicates the media consumer can request a 1333 particular media source be mapped to a media capture. Default is 1334 false. 1336 TBD - how does the consumer make the request for a particular source? 1337 How does the consumer know what is available? Need to explain better 1338 how multiple media captures are different from a single media capture 1339 with choices for the source, and when each concept should be used. 1341 A.3. Media Source Selection 1343 The use cases include a case where the person at a receiving endpoint 1344 can request to receive media from a particular other endpoint, for 1345 example in a multipoint call to request to receive the video from a 1346 certain section of a certain room, whether or not people there are 1347 talking. 1349 TBD - this framework should address this case. Maybe need a roster 1350 list of rooms or people in the conference, with a mechanism to select 1351 from the roster and associate it with media captures. This is 1352 different from selecting a particular media capture from a capture 1353 set. The mechanism to do this will probably need to be different 1354 than selecting media captures based on capture sets and attributes. 1356 A.4. Endpoint requesting many streams from MCU 1358 TBD - how to do VC selection for a system where the endpoint media 1359 consumers want to receive lots of streams and do their own 1360 composition, rather than MCU doing transcoding and composing. 1361 Example is 3 screen consumer that wants 3 large loudest speaker 1362 streams, and a bunch of small ones to render as PiP. How the small 1363 ones are chosen, which could potentially be chosen by either the 1364 endpoint or MCU. There are other more complicated examples also. Is 1365 the current framework adequate to support this? 1367 A.5. VAD (voice activity detection) tagging of audio streams 1369 TBD - do we want to have VAD be mandatory? All audio streams 1370 originating from a media provider must be tagged with VAD 1371 information. This tagging would include an overall energy value for 1372 the stream plus information on which sections of the capture scene 1373 are "active". 1375 Each audio stream which forms a constituent of a row within a capture 1376 set should include this tagging, and the energy value within it 1377 calculated using a fixed, consistent algorithm. 1379 When a system determines the most active area of a capture scene 1380 (either "loudest", or determined by other means such as a button 1381 press) it should convey that information to the corresponding media 1382 stream consumer via any audio streams being sent within that capture 1383 set. Specifically, there should be a list of active linear positions 1384 and their VAD characteristics within the audio stream in addition to 1385 the overall VAD information for the capture set. This is to ensure 1386 all media stream consumers receive the same, consistent, audio energy 1387 information whichever audio capture or captures they choose to 1388 receive for a capture set. Additionally, linear position information 1389 can be mapped to video captures by a media stream consumer in order 1390 that it can perform "panel switching" if required. 1392 A.6. Private Information 1394 Do we want a way to include private information? 1396 Authors' Addresses 1398 Allyn Romanow 1399 Cisco Systems 1400 San Jose, CA 95134 1401 USA 1403 Email: allyn@cisco.com 1405 Mark Duckworth (editor) 1406 Polycom 1407 Andover, MA 01810 1408 US 1410 Email: mark.duckworth@polycom.com 1412 Andrew Pepperell 1413 Cisco Systems 1414 Langley, England 1415 UK 1417 Email: apeppere@cisco.com 1419 Brian Baldino 1420 Cisco Systems 1421 San Jose, CA 95134 1422 US 1424 Email: bbaldino@cisco.com