idnits 2.17.1 draft-ietf-clue-framework-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 993 has weird spacing: '...om left bot...' == Line 1041 has weird spacing: '...om left bot...' -- The document date (February 4, 2012) is 4464 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: August 7, 2012 Polycom 6 A. Pepperell 8 B. Baldino 9 Cisco Systems 10 February 4, 2012 12 Framework for Telepresence Multi-Streams 13 draft-ietf-clue-framework-03.txt 15 Abstract 17 This memo offers a framework for a protocol that enables devices in a 18 telepresence conference to interoperate by specifying the 19 relationships between multiple media streams. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 7, 2012. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 4. Overview of the Framework/Model . . . . . . . . . . . . . . . 6 59 5. Spatial Relationships . . . . . . . . . . . . . . . . . . . . 8 60 6. Media Captures and Capture Sets . . . . . . . . . . . . . . . 8 61 6.1. Media Captures . . . . . . . . . . . . . . . . . . . . . . 9 62 6.1.1. Media Capture Attributes . . . . . . . . . . . . . . . 9 63 6.2. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 11 64 6.2.1. Capture set attributes . . . . . . . . . . . . . . . . 12 65 6.3. Simultaneous Transmission Set Constraints . . . . . . . . 13 66 7. Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 14 67 7.1. Individual Encodings . . . . . . . . . . . . . . . . . . . 14 68 7.2. Encoding Group . . . . . . . . . . . . . . . . . . . . . . 15 69 8. Associating Media Captures with Encoding Groups . . . . . . . 16 70 9. Consumer's Choice of Streams to Receive from the Provider . . 17 71 9.1. Local preference . . . . . . . . . . . . . . . . . . . . . 17 72 9.2. Physical simultaneity restrictions . . . . . . . . . . . . 18 73 9.3. Encoding and encoding group limits . . . . . . . . . . . . 18 74 9.4. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 18 75 10. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 19 76 11. Examples - Using the Framework . . . . . . . . . . . . . . . . 20 77 11.1. Three screen endpoint media provider . . . . . . . . . . . 20 78 11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 26 79 11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 27 80 11.4. Media Consumer Behavior . . . . . . . . . . . . . . . . . 27 81 11.4.1. One screen consumer . . . . . . . . . . . . . . . . . 28 82 11.4.2. Two screen consumer configuring the example . . . . . 28 83 11.4.3. Three screen consumer configuring the example . . . . 29 84 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 29 85 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 86 14. Security Considerations . . . . . . . . . . . . . . . . . . . 29 87 15. Informative References . . . . . . . . . . . . . . . . . . . . 29 88 Appendix A. Open Issues . . . . . . . . . . . . . . . . . . . . . 30 89 A.1. Video layout arrangements and centralized composition . . 30 90 A.2. Source is selectable . . . . . . . . . . . . . . . . . . . 30 91 A.3. Media Source Selection . . . . . . . . . . . . . . . . . . 30 92 A.4. Endpoint requesting many streams from MCU . . . . . . . . 31 93 A.5. VAD (voice activity detection) tagging of audio streams . 31 94 A.6. Private Information . . . . . . . . . . . . . . . . . . . 31 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 31 97 1. Introduction 99 Current telepresence systems, though based on open standards such as 100 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 101 other. A major factor limiting the interoperability of telepresence 102 systems is the lack of a standardized way to describe and negotiate 103 the use of the multiple streams of audio and video comprising the 104 media flows. This draft provides a framework for a protocol to 105 enable interoperability by handling multiple streams in a 106 standardized way. It is intended to support the use cases described 107 in draft-ietf-clue-telepresence-use-cases-02 and to meet the 108 requirements in draft-ietf-clue-telepresence-requirements-01. 110 The solution described here is strongly focused on what is being done 111 today, rather than on a vision of future conferencing. At the same 112 time, the highest priority has been given to creating an extensible 113 framework to make it easy to accommodate future conferencing 114 functionality as it evolves. 116 The purpose of this effort is to make it possible to handle multiple 117 streams of media in such a way that a satisfactory user experience is 118 possible even when participants are using different vendor equipment, 119 and also when they are using devices with different types of 120 communication capabilities. Information about the relationship of 121 media streams at the provider's end must be communicated so that 122 streams can be chosen and audio/video rendering can be done in the 123 best possible manner. 125 There is no attempt here to dictate to the renderer what it should 126 do. What the renderer does is up to the renderer. 128 After the following Definitions, a short section introduces key 129 concepts. The body of the text comprises several sections about the 130 key elements of the framework, how a consumer chooses streams to 131 receive, and some examples. The appendix describe topics that are 132 under discussion for adding to the document. 134 2. Terminology 136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 137 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 138 document are to be interpreted as described in RFC 2119 [RFC2119]. 140 3. Definitions 142 The definitions marked with an "*" are new; all the others are from 143 *Audio Capture: Media Capture for audio. Denoted as ACn. 145 Camera-Left and Right: For media captures, camera-left and camera- 146 right are from the point of view of a person observing the rendered 147 media. They are the opposite of stage-left and stage-right. 149 Capture Device: A device that converts audio and video input into an 150 electrical signal, in most cases to be fed into a media encoder. 151 Cameras and microphones are examples for capture devices. 153 *Capture Scene: the scene that is captured by a collection of Capture 154 Devices. A Capture Scene may be represented by more than one type of 155 Media. A Capture Scene may include more than one Media Capture of 156 the same type. An example of a Capture Scene is the video image of a 157 group of people seated next to each other, along with the sound of 158 their voices, which could be represented by some number of VCs and 159 ACs. A middle box may also express Capture Scenes that it constructs 160 from Media streams it receives. 162 *Capture Set: A Capture Set includes media captures that are arranged 163 by the provider to help the consumer choose which captures it wants. 164 The entries in a Capture Set represent different alternatives for 165 representing the same Capture Scene. 167 Conference: used as defined in [RFC4353], A Framework for 168 Conferencing within the Session Initiation Protocol (SIP). 170 *Individual Encoding: A variable with a set of attributes that 171 describes the maximum values of a single audio or video capture 172 encoding. The attributes include: maximum bandwidth- and for video 173 maximum macroblocks (for H.264), maximum width, maximum height, 174 maximum frame rate. 176 *Encoding Group: A set of encoding parameters representing a media 177 provider's encoding capabilities. Media stream providers formed of 178 multiple physical units, in each of which resides some encoding 179 capability, would typically advertise themselves to the remote media 180 stream consumer using multiple encoding groups. Within each encoding 181 group, multiple potential encodings are possible, with the sum of the 182 chosen encodings' characteristics constrained to being less than or 183 equal to the group-wide constraints. 185 Endpoint: The logical point of final termination through receiving, 186 decoding and rendering, and/or initiation through capturing, 187 encoding, and sending of media streams. An endpoint consists of one 188 or more physical devices which source and sink media streams, and 189 exactly one [RFC4353] Participant (which, in turn, includes exactly 190 one SIP User Agent). In contrast to an endpoint, an MCU may also 191 send and receive media streams, but it is not the initiator nor the 192 final terminator in the sense that Media is Captured or Rendered. 193 Endpoints can be anything from multiscreen/multicamera rooms to 194 handheld devices. 196 Front: the portion of the room closest to the cameras. In going 197 towards back you move away from the cameras. 199 MCU: Multipoint Control Unit (MCU) - a device that connects two or 200 more endpoints together into one single multimedia conference 201 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 202 tardy in requiring that media from the mixer be sent to EACH 203 participant. I think we have practical use cases where this is not 204 the case. But the bug (if it is one) is in 4353 and not herein.] 206 Media: Any data that, after suitable encoding, can be conveyed over 207 RTP, including audio, video or timed text. 209 *Media Capture: a source of Media, such as from one or more Capture 210 Devices. A Media Capture (MC) may be the source of one or more Media 211 streams. A Media Capture may also be constructed from other Media 212 streams. A middle box can express Media Captures that it constructs 213 from Media streams it receives. 215 *Media Consumer: an Endpoint or middle box that receives media 216 streams 218 *Media Provider: an Endpoint or middle box that sends Media streams 220 Model: a set of assumptions a telepresence system of a given vendor 221 adheres to and expects the remote telepresence system(s) also to 222 adhere to. 224 *Plane of Interest: The spatial plane containing the most relevant 225 subject matter. 227 Render: the process of generating a representation from a media, such 228 as displayed motion video or sound emitted from loudspeakers. 230 *Simultaneous Transmission Set: a set of media captures that can be 231 transmitted simultaneously from a Media Provider. 233 Spatial Relation: The arrangement in space of two objects, in 234 contrast to relation in time or other relationships. See also 235 Camera-Left and Right. 237 Stage-Left and Right: For media captures, stage-left and stage-right 238 are the opposite of camera-left and camera-right. For the case of a 239 person facing (and captured by) a camera, stage-left and stage-right 240 are from the point of view of that person. 242 *Stream: RTP stream as in [RFC3550]. 244 Stream Characteristics: the media stream attributes commonly used in 245 non-CLUE SIP/SDP environments (such as: media codec, bit rate, 246 resolution, profile/level etc.) as well as CLUE specific attributes, 247 such as the ID of a capture or a spatial location. 249 Telepresence: an environment that gives non co-located users or user 250 groups a feeling of (co-located) presence - the feeling that a Local 251 user is in the same room with other Local users and the Remote 252 parties. The inclusion of Remote parties is achieved through 253 multimedia communication including at least audio and video signals 254 of high fidelity. 256 *Video Capture: Media Capture for video. Denoted as VCn. 258 Video composite: A single image that is formed from combining visual 259 elements from separate sources. 261 4. Overview of the Framework/Model 263 The CLUE framework specifies how multiple media streams are to be 264 handled in a telepresence conference. 266 The main goals include: 268 o Interoperability 270 o Extensibility 272 o Flexibility 274 Interoperability is achieved by the media provider describing the 275 relationships between media streams in constructs that are understood 276 by the consumer, who can then render the media. Extensibility is 277 achieved through abstractions and the generality of the model, making 278 it easy to add new parameters. Flexibility is achieved largely by 279 having the consumer choose what content and format it wants to 280 receive from what the provider is capable of sending. This 281 constitutes a significant change from previous video conferencing 282 systems in which transmission of content was determined primarily by 283 the sender. 285 A transmitting endpoint or MCU describes specific aspects of the 286 content of the media and the formatting of the media streams it can 287 send (advertisement); and the receiving end responds to the provider 288 by specifying which content and media streams it wants to receive 289 (configuration). The provider then transmits the asked for content 290 in the specified streams. 292 This advertisement and configuration occurs at call initiation but 293 may also happen at any time throughout the conference, whenever there 294 is a change in what the consumer wants or the provider can send. 296 An endpoint or MCU typically acts as both provider and consumer at 297 the same time, sending advertisements and sending configurations in 298 response to receiving advertisements. (It is possible to be just one 299 or the other.) 301 The data model is based around two main concepts: a capture and an 302 encoding. A media capture (MC), such as audio or video, describes 303 the content a provider can send. Media captures are described in 304 terms of CLUE-defined attributes, such as spatial relationships and 305 purpose of the capture. Providers tell consumers which media 306 captures they can provide, described in terms of the media capture 307 attributes. 309 A provider organizes its media captures that represent the same scene 310 into capture sets. A consumer chooses which media captures it wants 311 to receive according to the capture sets sent by the provider. 313 In addition, the provider sends the consumer a description of the 314 streams it can send in terms of the media attributes of the stream, 315 in particular, well-known audio and video parameters such as 316 bandwidth, frame rate, macroblocks per second. 318 The provider also specifies constraints on its ability to provide 319 media, and the consumer must take these into account in choosing the 320 content and streams it wants. Some constraints are due to the 321 physical limitations of devices - for example, a camera may not be 322 able to provide zoom and non-zoom views simultaneously. Other 323 constraints are system based constraints, such as maximum bandwidth 324 and maximum macroblocks/second. 326 The following sections discuss these constructs and processes in 327 detail, followed by use cases showing how the framework specification 328 can be used. 330 5. Spatial Relationships 332 In order for a consumer to perform a proper rendering, it is often 333 necessary to provide spatial information about the streams it is 334 receiving. CLUE defines a coordinate system that allows producers to 335 describe the spatial relationships of their Media Captures to enable 336 proper scaling and spatial rendering of their streams. The 337 coordinate system is based on a few principles: 339 o Simple systems which do not have multiple Media Captures to 340 associate spatially need not use the coordinate model. 342 o Coordinates can either be in real, physical units (millimeters), 343 have an unknown scale or have no physical scale. Systems which 344 know their physical dimensions should always provide those real- 345 world measurements. Systems which don't know specific physical 346 dimensions but still know relative distances should use 'unknown 347 scale'. 'No scale' is intended to be used where Media Captures 348 from different devices (with potentially different scales) will be 349 forwarded alongside one another (e.g. in the case of a middle 350 box). 352 * "millimeters" means the scale is in millimeters 354 * "Unknown" means the scale is not necessarily millimeters, but 355 the scale is the same for every capture in the capture set. 357 * "No Scale" means the scale could be different for each capture 358 - an MCU provider that advertises two adjacent captures and 359 picks sources (which can change quickly) from different 360 endpoints might use this value; the scale could be different 361 and changing for each capture. But the areas of capture still 362 represent a spatial relation between captures. 364 o The coordinate system is Cartesian X, Y, Z with the origin at a 365 spot of the provider's choosing. The provider must use the same 366 origin for all coordinates within the same capture set. 368 The direction of increasing coordinate values is: 369 X increases from camera left to camera right 370 Y increases from front to back 371 Z increases from low to high 373 6. Media Captures and Capture Sets 375 This section describes how media providers can describe the content 376 of media to consumers. 378 6.1. Media Captures 380 Media captures are the fundamental representations of streams that a 381 device can transmit. What a Media Capture actually represents is 382 flexible: 384 o It can represent the immediate output of a physical source (e.g. 385 camera, microphone) or 'synthetic' source (e.g. laptop computer, 386 DVD player). 388 o It can represent the output of an audio mixer or video composer 390 o It can represent a concept such as 'the loudest speaker' 392 o It can represent a conceptual position such as 'the leftmost 393 stream' 395 To distinguish between multiple instances, video and audio captures 396 are numbered such as: VC1, VC2 and AC1, AC2. VC1 and VC2 refer to 397 two different video captures and AC1 and AC2 refer to two different 398 audio captures. 400 Each Media Capture can be associated with attributes to describe what 401 it represents. 403 6.1.1. Media Capture Attributes 405 Media Capture Attributes describe static information about the 406 captures that can be used by the consumer to help decide which Media 407 Captures should be requested. Attributes are defined by a variable 408 and its value. The currently defined attributes and their values 409 are: 411 Purpose: {main, presentation} 413 A field with enumerated values which describes the role of the Media 414 Capture and can be applied to any media type. 416 A value of 'main' describes the primary content of the room (such as 417 participant media). 419 A value of 'presentation' describes the secondary content of the room 420 (such as media coming from a laptop). 422 Composed: {true, false} 424 A field with a Boolean value which indicates whether or not the Media 425 Capture is a mix (audio) or composition (video) of streams. 427 This attribute is not intended to describe the layout used when 428 compositing video streams. 430 Audio Channel Format: {mono, stereo} A field with enumerated values 431 which describes the method of encoding used for audio. 433 A value of 'mono' means the Audio Capture has one channel. 435 A value of 'stereo' means the Audio Capture has two audio channels, 436 left and right. 438 This attribute applies only to Audio Captures. 440 Switched: {true, false} 442 A field with a Boolean value which indicates whether or not the Media 443 Capture represents the (dynamic) most appropriate subset of a 444 'whole'. What is 'most appropriate' is up to the producer and could 445 be the active speaker, a lecturer or a VIP. 447 Point of Capture: {(X, Y, Z)} A field with a single Cartesian (X, Y, 448 Z) point value which describes the spatial location, virtual or 449 physical, of the capturing device (such as camera). 451 When the Point of Capture attribute is specified, it must include X, 452 Y and Z coordinates. 454 Area of Capture: 456 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3, 457 Z3), top right(X4, Y4, Z4)} 459 A field with a set of four (X, Y, Z) points as a value which describe 460 the spatial location of what is being "captured". By comparing the 461 Area of Capture for different Media Captures within the same capture 462 set a consumer can determine the spatial relationships between them 463 and render them correctly. 465 The four points should be co-planar. The four points form a 466 quadrilateral, not necessarily a rectangle. 468 The quadrilateral described by the four (X, Y, Z) points defines the 469 plane of interest for the particular media capture. 471 If the area of capture attribute is specified, it must include X, Y 472 and Z coordinates for all four points. 474 For a switched capture that switches between different sections 475 within a larger area, the area of capture should use coordinates for 476 the larger potential area. 478 EncodingGroup: {} 480 A field with a value equal to the encodeGroupID of the encoding group 481 associated with the media capture. 483 6.2. Capture Set 485 In order for a provider's individual media captures to be used 486 effectively by a consumer, the provider organizes the media captures 487 into capture sets, with the structure and contents of these sets 488 being sent from the provider to the consumer. 490 A provider may advertise multiple capture sets or just a single 491 capture set. A capture set can be said to correspond to a provided 492 "scene", and a media provider might typically use one capture set for 493 main participant media and another capture set for a computer 494 generated presentation. Capture sets will commonly include media 495 captures of different types, for instance, audio captures and video 496 captures. 498 A provider can express spatial relationships between media captures 499 that are included in the same capture set. But there is no spatial 500 relationship between media captures that are in different capture 501 sets. 503 A capture set is most usefully thought of as being a collection of 504 entries, with each entry being a list of media captures. In grouping 505 multiple media captures together within a capture set entry, the 506 provider is signaling that those captures together form a 507 representation of that capture set's scene. Media captures within 508 the same capture set entry must be of the same media type - it is not 509 possible to mix audio and video captures in the same capture set 510 entry, for instance. The provider must be capable of encoding and 511 sending all media captures in a single entry simultaneously. 513 When a provider advertises a capture set with multiple entries, it is 514 essentially signaling that there are multiple representations of the 515 same scene available. In some cases, these multiple representations 516 would typically be used simultaneously (for instance a "video entry" 517 and an "audio entry"). In some cases the entries would conceptually 518 be alternatives (for instance an entry consisting of 3 video captures 519 versus an entry consisting of just a single video capture). In this 520 latter example, the provider would in the simple case end up 521 providing to the consumer the entry containing the number of video 522 captures that most closely matched the media consumer's number of 523 display devices. 525 The following is an example of 4 potential capture set entries for an 526 endpoint-style media provider: 528 1. (VC0, VC1, VC2) - left, center and right camera video captures 530 2. (VC3) - video capture associated with loudest room segment 532 3. (VC4) - video capture zoomed out view of all people in the room 534 4. (AC0) - main audio 536 The first entry in this capture set example is a list of video 537 captures with a spatial relationship to each other. Determination of 538 the order of these captures (VC0, VC1 and VC2) for rendering purposes 539 is accomplished through use of their Area of Capture attributes. The 540 second entry (VC3) and the third entry (VC4) are additional 541 alternatives of how to capture the same room in different ways. The 542 inclusion of the audio capture in the same capture set indicates that 543 AC0 is associated with those video captures, meaning it comes from 544 the same scene. The audio should be rendered in conjunction with any 545 rendered video captures from the same capture set (for instance, the 546 consumer should attempt to perform lip sync between all audio and 547 video captures from the same capture set). 549 6.2.1. Capture set attributes 551 Attributes can be applied to capture sets as well as to individual 552 media captures. Attributes specified at this level apply to all 553 constituent media captures. 555 Area of Scene attribute 557 The area of scene attribute for a capture set has the same format as 558 the area of capture attribute for a media capture. The area of scene 559 is for the entire scene, which is captured by the one or more media 560 captures in the capture set entries. 562 Scale attribute 564 An optional attribute indicating if the numbers used for area of 565 scene, area of capture and point of capture are in terms of 566 millimeters, unknown scale factor, or not any scale, as described in 567 Section 5. If any media captures have an area of capture attribute 568 or point of capture attribute, then this scale attribute must also be 569 defined. The possible values for this attribute are: 571 "millimeters" 572 "unknown" 573 "no scale" 575 6.3. Simultaneous Transmission Set Constraints 577 The provider may have constraints or limitations on its ability to 578 send media captures. One type is caused by the physical limitations 579 of capture mechanisms; these constraints are represented by a 580 simultaneous transmission set. The second type of limitation 581 reflects the encoding resources available - bandwidth and 582 macroblocks/second. This type of constraint is captured by encoding 583 groups, discussed below. 585 An endpoint or MCU can send multiple captures simultaneously, however 586 sometimes there are constraints that limit which captures can be sent 587 simultaneously with other captures. A device may not be able to be 588 used in different ways at the same time. Provider advertisements are 589 made so that the consumer will choose one of several possible 590 mutually exclusive usages of the device. This type of constraint is 591 expressed in a Simultaneous Transmission Set, which lists all the 592 media captures that can be sent at the same time. This is easier to 593 show in an example. 595 Consider the example of a room system where there are 3 cameras each 596 of which can send a separate capture covering 2 persons each- VC0, 597 VC1, VC2. The middle camera can also zoom out and show all 6 598 persons, VC3. But the middle camera cannot be used in both modes at 599 the same time - it has to either show the space where 2 participants 600 sit or the whole 6 seats, but not both at the same time. 602 Simultaneous transmission sets are expressed as sets of the MCs that 603 could physically be transmitted at the same time, (though it may not 604 make sense to do so). In this example the two simultaneous sets are 605 shown in Table 1. The consumer must make sure that it chooses one 606 and not more of the mutually exclusive sets. 608 +-------------------+ 609 | Simultaneous Sets | 610 +-------------------+ 611 | {VC0, VC1, VC2} | 612 | {VC0, VC3, VC2} | 613 +-------------------+ 615 Table 1: Two Simultaneous Transmission Sets 617 The Simultaneous Transmission Sets MUST allow all the Media Captures 618 in a particular capture set entry to be used simultaneously. 620 7. Encodings 622 We have considered how providers can describe the content of media to 623 consumers. We will now consider how the providers communicate 624 information about their abilities to send streams. We introduce two 625 constructs - individual encodings and encoding groups. Consumers 626 will then map the media captures they want onto the encodings with 627 encoding parameters they want. This process is then described. 629 7.1. Individual Encodings 631 An individual encoding represents a way to encode a media capture to 632 become an encoded media stream sent from the media provider to the 633 media consumer. An individual encoding has a set of parameters 634 characterizing how the media is encoded. Different media types have 635 different parameters, and different encoding algorithms may have 636 different parameters. An individual encoding can be used for only 637 one actual encoded media stream at a time. 639 The parameters of an individual encoding represent the maximimum 640 values for certain aspects of the encoding. A particular 641 instantiation into an encoded stream might use lower values than 642 these maximums. 644 The following tables show the variables for audio and video encoding. 646 +--------------+----------------------------------------------------+ 647 | Name | Description | 648 +--------------+----------------------------------------------------+ 649 | encodeID | A unique identifier for the individual encoding | 650 | maxBandwidth | Maximum number of bits per second | 651 | maxH264Mbps | Maximum number of macroblocks per second: ((width | 652 | | + 15) / 16) * ((height + 15) / 16) * | 653 | | framesPerSecond | 654 | maxWidth | Video resolution's maximum supported width, | 655 | | expressed in pixels | 656 | maxHeight | Video resolution's maximum supported height, | 657 | | expressed in pixels | 658 | maxFrameRate | Maximum supported frame rate | 659 +--------------+----------------------------------------------------+ 661 Table 2: Individual Video Encoding Parameters 663 +--------------+-----------------------------------+ 664 | Name | Description | 665 +--------------+-----------------------------------+ 666 | maxBandwidth | Maximum number of bits per second | 667 +--------------+-----------------------------------+ 669 Table 3: Individual Audio Encoding Parameters 671 7.2. Encoding Group 673 An encoding group includes a set of one or more individual encodings, 674 plus some parameters that apply to the group as a whole. By grouping 675 multiple individual encodings together, an encoding group describes 676 additional constraints on bandwidth and other parameters for the 677 group. Table 4 shows the parameters and individual encoding sets 678 that are part of an encoding group. 680 +-------------------+-----------------------------------------------+ 681 | Name | Description | 682 +-------------------+-----------------------------------------------+ 683 | encodeGroupID | A unique identifier for the encoding group | 684 | maxGroupBandwidth | Maximum number of bits per second relating to | 685 | | all encodings combined | 686 | maxGroupH264Mbps | Maximum number of macroblocks per second | 687 | | relating to all video encodings combined | 688 | videoEncodings[] | Set of potential encodings (list of | 689 | | encodeIDs) | 690 | audioEncodings[] | Set of potential encodings (list of | 691 | | encodeIDs) | 692 +-------------------+-----------------------------------------------+ 694 Table 4: Encoding Group 696 When the individual encodings in a group are instantiated into actual 697 encoded media streams, each stream has a bandwidth that must be less 698 than or equal to the maxBandwidth for the particular individual 699 encoding. The maxGroupBandwidth parameter gives the additional 700 restriction that the sum of all the individual instantiated 701 bandwidths must be less than or equal to the maxGroupBandwidth value. 703 Likewise, the sum of the macroblocks per second of each instantiated 704 encoding in the group must not exceed the maxGroupH264Mbps value. 706 The following diagram illustrates the structure of a media provider's 707 Encoding Groups and their contents. 709 ,-------------------------------------------------. 710 | Media Provider | 711 | | 712 | ,--------------------------------------. | 713 | | ,--------------------------------------. | 714 | | | ,--------------------------------------. | 715 | | | | Encoding Group | | 716 | | | | ,-----------. | | 717 | | | | | | ,---------. | | 718 | | | | | | | | ,---------.| | 719 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 720 | `.| | | | | | `---------'| | 721 | `.| `-----------' `---------' | | 722 | `--------------------------------------' | 723 `-------------------------------------------------' 725 Figure 1: Encoding Group Structure 727 A media provider advertises one or more encoding groups. Each 728 encoding group includes one or more individual encodings. Each 729 individual encoding can represent a different way of encoding media. 730 For example one individual encoding may be 1080p60 video, another 731 could be 720p30, with a third being CIF. 733 While a typical 3 codec/display system might have one encoding group 734 per "codec box", there are many possibilities for the number of 735 encoding groups a provider may be able to offer and for the encoding 736 values in each encoding group. 738 There is no requirement for all encodings within an encoding group to 739 be instantiated at once. 741 8. Associating Media Captures with Encoding Groups 743 Every media capture is associated with an encoding group, which is 744 used to instantiate that media capture into one or more encoded 745 streams. Each media capture has an encoding group attribute. The 746 value of this attribute is the encodeGroupID for the encoding group 747 with which it is associated. More than one media capture may use the 748 same encoding group. 750 The maximum number of streams that can result from a particular 751 encoding group constraint is equal to the number of individual 752 encodings in the group. The actual number of streams used at any 753 time may be less than this maximum. Any of the media captures that 754 use a particular encoding group can be encoded according to any of 755 the individual encodings in the group. If there are multiple 756 individual encodings in the group, then a single media capture can be 757 encoded into multiple different streams at the same time, with each 758 stream following the constraints of a different individual encoding. 760 The Encoding Groups MUST allow all the media captures in a particular 761 capture set entry to be used simultaneously. 763 9. Consumer's Choice of Streams to Receive from the Provider 765 After receiving the provider's advertised media captures and 766 associated constraints, the consumer must choose which media captures 767 it wishes to receive, and which individual encodings from the 768 provider it wants to use to encode the capture. Each media capture 769 has an encoding group ID attribute which specifies which individual 770 encodings are available to be used for that media capture. 772 For each media capture the consumer wants to receive, it configures 773 one or more of the encodings in that capture's encoding group. The 774 consumer does this by telling the provider the resolution, frame 775 rate, bandwidth, etc. when asking for streams for its chosen 776 captures. Upon receipt of this configuration command from the 777 consumer, the provider generates streams for each such configured 778 encoding and sends those streams to the consumer. 780 The consumer must have received at least one capture advertisement 781 from the provider to be able to configure the provider's generation 782 of media streams. 784 The consumer is able to change its configuration of the provider's 785 encodings any number of times during the call, either in response to 786 a new capture advertisement from the provider or autonomously. The 787 consumer need not send a new configure message to the provider when 788 it receives a new capture advertisement from the provider unless the 789 contents of the new capture advertisement cause the consumer's 790 current configure message to become invalid. 792 When choosing which streams to receive from the provider, and the 793 encoding characteristics of those streams, the consumer needs to take 794 several things into account its local preference, simultaneity 795 restrictions, and encoding limits. 797 9.1. Local preference 799 A variety of local factors will influence the consumer's choice of 800 streams to be received from the provider: 802 o if the consumer is an endpoint, it is likely that it would choose, 803 where possible, to receive video and audio captures that match the 804 number of display devices and audio system it has 806 o if the consumer is a middle box such as an MCU, it may choose to 807 receive loudest speaker streams (in order to perform its own media 808 composition) and avoid pre-composed video captures 810 o user choice (for instance, selection of a new layout) may result 811 in a different set of media captures, or different encoding 812 characteristics, being required by the consumer 814 9.2. Physical simultaneity restrictions 816 There may be physical simultaneity constraints imposed by the 817 provider that affect the provider's ability to simultaneously send 818 all of the captures the consumer would wish to receive. For 819 instance, a middle box such as an MCU, when connected to a multi- 820 camera room system, might prefer to receive both individual camera 821 streams of the people present in the room and an overall view of the 822 room from a single camera. Some endpoint systems might be able to 823 provide both of these sets of streams simultaneously, whereas others 824 may not (if the overall room view were produced by changing the zoom 825 level on the center camera, for instance). 827 9.3. Encoding and encoding group limits 829 Each of the provider's encoding groups has limits on bandwidth and 830 macroblocks per second, and the constituent potential encodings have 831 limits on the bandwidth, macroblocks per second, video frame rate, 832 and resolution that can be provided. When choosing the media 833 captures to be received from a provider, a consumer device must 834 ensure that the encoding characteristics requested for each 835 individual media capture fits within the capability of the encoding 836 it is being configured to use, as well as ensuring that the combined 837 encoding characteristics for media captures fit within the 838 capabilities of their associated encoding groups. In some cases, 839 this could cause an otherwise "preferred" choice of streams to be 840 passed over in favour of different streams - for instance, if a set 841 of 3 media captures could only be provided at a low resolution then a 842 3 screen device could switch to favoring a single, higher quality, 843 stream. 845 9.4. Message Flow 847 The following diagram shows the basic flow of messages between a 848 media provider and a media consumer. The usage of the "capture 849 advertisement" and "configure encodings" message is described above. 851 The consumer also sends its own capability message to the provider 852 which may contain information about its own capabilities or 853 restrictions. 855 Diagram for Message Flow 857 Media Consumer Media Provider 858 -------------- ------------ 859 | | 860 |----- Consumer Capability ---------->| 861 | | 862 | | 863 |<---- Capture advertisement ---------| 864 | | 865 | | 866 |------ Configure encodings --------->| 867 | | 869 In order for a maximally-capable provider to be able to advertise a 870 manageable number of video captures to a consumer, there is a 871 potential use for the consumer, at the start of CLUE, to be able to 872 inform the provider of its capabilities. One example here would be 873 the video capture attribute set - a consumer could tell the provider 874 the complete set of video capture attributes it is able to understand 875 and so the provider would be able to reduce the capture set it 876 advertises to be tailored to the consumer. 878 TBD - the content of this message needs to be better defined. The 879 authors believe there is a need for this message, but have not worked 880 out the details yet. 882 10. Extensibility 884 One of the most important characteristics of the Framework is its 885 extensibility. Telepresence is a relatively new industry and while 886 we can foresee certain directions, we also do not know everything 887 about how it will develop. The standard for interoperability and 888 handling multiple streams must be future-proof. 890 The framework itself is inherently extensible through expanding the 891 data model types. For example: 893 o Adding more types of media, such as telemetry, can done by 894 defining additional types of captures in addition to audio and 895 video. 897 o Adding new functionalities , such as 3-D, say, will require 898 additional attributes describing the captures. 900 o Adding a new codecs, such as H.265, can be accomplished by 901 defining new encoding variables. 903 The infrastructure is designed to be extended rather than requiring 904 new infrastructure elements. Extension comes through adding to 905 defined types. 907 Assuming the implementation is in something like XML, adding data 908 elements and attributes makes extensibility easy. 910 11. Examples - Using the Framework 912 This section shows some examples in more detail how to use the 913 framework to represent a typical case for telepresence rooms. First 914 an endpoint is illustrated, then an MCU case is shown. 916 11.1. Three screen endpoint media provider 918 Consider an endpoint with the following description: 920 o 3 cameras, 3 displays, a 6 person table 922 o Each video device can provide one capture for each 1/3 section of 923 the table 925 o A single capture representing the active speaker can be provided 927 o A single capture representing the active speaker with the other 2 928 captures shown picture in picture within the stream can be 929 provided 931 o A capture showing a zoomed out view of all 6 seats in the room can 932 be provided 934 The audio and video captures for this endpoint can be described as 935 follows. 937 Video Captures: 939 o VC0- (the camera-left camera stream), encoding group=EG0, 940 purpose=main;auto-switched:no 942 o VC1- (the center camera stream), encoding group=EG1, purpose=main; 943 auto-switched:no 945 o VC2- (the camera-right camera stream), encoding group=EG2, 946 purpose=main;auto-switched:no 948 o VC3- (the loudest panel stream), encoding group=EG1, 949 purpose=main;auto-switched:yes 951 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 952 purpose=main; composed=true; auto-switched:yes 954 o VC5- (the zoomed out view of all people in the room), encoding 955 group=EG1, purpose=main; composed=no; auto-switched:no 957 o VC6- (presentation stream), encoding group=EG1, 958 purpose=presentation;auto-switched:no 960 The following diagram is a top view of the room with 3 cameras, 3 961 displays, and 6 seats. Each camera is capturing 2 people. The six 962 seats are not all in a straight line. 964 ,-. d 965 ( )`--.__ +---+ 966 `-' / `--.__ | | 967 ,-. | `-.._ |_-+Camera 2 (VC2) 968 ( ).' ___..-+-''`+-+ 969 `-' |_...---'' | | 970 ,-.c+-..__ +---+ 971 ( )| ``--..__ | | 972 `-' | ``+-..|_-+Camera 1 (VC1) 973 ,-. | __..--'|+-+ 974 ( )| __..--' | | 975 `-'b|..--' +---+ 976 ,-. |``---..___ | | 977 ( )\ ```--..._|_-+Camera 0 (VC0) 978 `-' \ _..-''`-+ 979 ,-. \ __.--'' | | 980 ( ) |..-'' +---+ 981 `-' a 983 The two points labeled b and c are intended to be at the midpoint 984 between the seating positions, and where the fields of view of the 985 cameras intersect. 986 The plane of interest for VC0 is a vertical plane that intersects 987 points 'a' and 'b'. 988 The plane of interest for VC1 intersects points 'b' and 'c'. 989 The plane of interest for VC2 intersects points 'c' and 'd'. 990 This example uses an area scale of millimeters. 992 Areas of capture: 993 bottom left bottom right top left top right 994 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 995 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 996 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 997 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 998 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 999 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1000 VC6 none 1002 Points of capture: 1003 VC0 (-1678,0,800) 1004 VC1 (0,0,800) 1005 VC2 (1678,0,800) 1006 VC3 none 1007 VC4 none 1008 VC5 (0,0,800) 1009 VC6 none 1011 In this example, the right edge of the VC0 area lines up with the 1012 left edge of the VC1 area. It doesn't have to be this way. There 1013 could be a gap or an overlap. One additional thing to note for this 1014 example is the distance from a to b is equal to the distance from b 1015 to c and the distance from c to d. All these distances are 1346 mm. 1016 This is the planar width of each area of capture for VC0, VC1, and 1017 VC2. 1019 Note the text in parentheses (e.g. "the camera-left camera stream") 1020 is not explicitly part of the model, it is just explanatory text for 1021 this example, and is not included in the model with the media 1022 captures and attributes. 1024 Audio Captures: 1026 o AC0 (camera-left), encoding group=EG3, purpose=main, channel 1027 format=mono 1029 o AC1 (camera-right), encoding group=EG3, purpose=main, channel 1030 format=mono 1032 o AC2 (center) encoding group=EG3, purpose=main, channel format=mono 1034 o AC3 being a simple pre-mixed audio stream from the room (mono), 1035 encoding group=EG3, purpose=main, channel format=mono 1037 o AC4 audio stream associated with the presentation video (mono) 1038 encoding group=EG3, purpose=presentation, channel format=mono 1040 Areas of capture: 1041 bottom left bottom right top left top right 1042 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1043 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1044 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1045 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1046 AC4 none 1048 The physical simultaneity information is: 1050 {VC0, VC1, VC2, VC3, VC4, VC6} 1052 {VC0, VC2, VC5, VC6} 1054 This constraint indicates it is not possible to use all the VCs at 1055 the same time. VC5 can not be used at the same time as VC1 or VC3 or 1056 VC4. Also, using every member in the set simultaneously may not make 1057 sense - for example VC3(loudest) and VC4 (loudest with PIP). (In 1058 addition, there are encoding constraints that make choosing all of 1059 the VCs in a set impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and 1060 EG1 has only 3 ENCs. This constraint shows up in the encoding 1061 groups, not in the simultaneous transmission sets.) 1063 In this example there are no restrictions on which audio captures can 1064 be sent simultaneously. 1066 Encoding Groups: 1068 This example has three encoding groups associated with the video 1069 captures. Each group can have 3 encodings, but with each potential 1070 encoding having a progressively lower specification. In this 1071 example, 1080p60 transmission is possible (as ENC0 has a maxMbps 1072 value compatible with that) as long as it is the only active encoding 1073 in the group(as maxMbps for the entire encoding group is also 1074 489600). Significantly, as up to 3 encodings are available per 1075 group, it is possible to transmit some video captures simultaneously 1076 that are not in the same entry in the capture set. For example VC1 1077 and VC3 at the same time. 1079 It is also possible to transmit multiple encodings of a single video 1080 capture. For example VC0 can be encoded using ENC0 and ENC1 at the 1081 same time, as long as the encoding parameters satisfy the constraints 1082 of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30. 1084 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1085 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1086 maxH264Mbps=489600, maxBandwidth=4000000 1087 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1088 maxH264Mbps=108000, maxBandwidth=4000000 1089 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1090 maxH264Mbps=61200, maxBandwidth=4000000 1092 encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1093 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1094 maxH264Mbps=489600, maxBandwidth=4000000 1095 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1096 maxH264Mbps=108000, maxBandwidth=4000000 1097 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1098 maxH264Mbps=61200, maxBandwidth=4000000 1100 encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1101 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1102 maxH264Mbps=489600, maxBandwidth=4000000 1103 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1104 maxH264Mbps=108000, maxBandwidth=4000000 1105 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1106 maxH264Mbps=61200, maxBandwidth=4000000 1108 Figure 2: Example Encoding Groups for Video 1110 For audio, there are five potential encodings available, so all five 1111 audio captures can be encoded at the same time. 1113 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1114 encodeID=ENC9, maxBandwidth=64000 1115 encodeID=ENC10, maxBandwidth=64000 1116 encodeID=ENC11, maxBandwidth=64000 1117 encodeID=ENC12, maxBandwidth=64000 1118 encodeID=ENC13, maxBandwidth=64000 1120 Figure 3: Example Encoding Group for Audio 1122 Capture Sets: 1124 The following table represents the capture sets for this provider. 1125 Recall that a capture set is composed of alternative captures 1126 covering the same scene. Capture Set #1 is for the main people 1127 captures, and Capture Set #2 is for presentation. 1129 Each row in the table is a separate entry in the capture set 1131 +----------------+ 1132 | Capture Set #1 | 1133 +----------------+ 1134 | VC0, VC1, VC2 | 1135 | VC3 | 1136 | VC4 | 1137 | VC5 | 1138 | AC0, AC1, AC2 | 1139 | AC3 | 1140 +----------------+ 1142 +----------------+ 1143 | Capture Set #2 | 1144 +----------------+ 1145 | VC6 | 1146 | AC4 | 1147 +----------------+ 1149 Different capture sets are unique to each other, non-overlapping. A 1150 consumer can choose an entry from each capture set. In this case the 1151 three captures VC0, VC1, and VC2 are one way of representing the 1152 video from the endpoint. These three captures should appear adjacent 1153 next to each other. Alternatively, another way of representing the 1154 Capture Scene is with the capture VC3, which automatically shows the 1155 person who is talking. Similarly for the VC4 and VC5 alternatives. 1157 As in the video case, the different entries of audio in Capture Set 1158 #1 represent the "same thing", in that one way to receive the audio 1159 is with the 3 audio captures (AC0, AC1, AC2), and another way is with 1160 the mixed AC3. The Media Consumer can choose an audio capture entry 1161 it is capable of receiving. 1163 The spatial ordering is understood by the media capture attributes 1164 area and point of capture. 1166 A Media Consumer would likely want to choose a capture set entry to 1167 receive based in part on how many streams it can simultaneously 1168 receive. A consumer that can receive three people streams would 1169 probably prefer to receive the first entry of Capture Set #1 (VC0, 1170 VC1, VC2) and not receive the other entries. A consumer that can 1171 receive only one people stream would probably choose one of the other 1172 entries. 1174 If the consumer can receive a presentation stream too, it would also 1175 choose to receive the only entry from Capture Set #2 (VC6). 1177 11.2. Encoding Group Example 1179 This is an example of an encoding group to illustrate how it can 1180 express dependencies between encodings. 1182 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1183 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1184 maxH264Mbps=244800, maxBandwidth=4000000 1185 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1186 maxH264Mbps=244800, maxBandwidth=4000000 1187 encodeID=AUDENC0, maxBandwidth=96000 1188 encodeID=AUDENC1, maxBandwidth=96000 1189 encodeID=AUDENC2, maxBandwidth=96000 1191 Here, the encoding group is EG0. It can transmit up to two 1080p30 1192 encodings (Mbps for 1080p = 244800), but it is capable of 1193 transmitting a maxFrameRate of 60 frames per second (fps). To 1194 achieve the maximum resolution (1920 x 1088) the frame rate is 1195 limited to 30 fps. However 60 fps can be achieved at a lower 1196 resolution if required by the consumer. Although the encoding group 1197 is capable of transmitting up to 6Mbit/s, no individual video 1198 encoding can exceed 4Mbit/s. 1200 This encoding group also allows up to 3 audio encodings, AUDENC<0-2>. 1201 It is not required that audio and video encodings reside within the 1202 same encoding group, but if so then the group's overall maxBandwidth 1203 value is a limit on the sum of all audio and video encodings 1204 configured by the consumer. A system that does not wish or need to 1205 combine bandwidth limitations in this way should instead use separate 1206 encoding groups for audio and video in order for the bandwidth 1207 limitations on audio and video to not interact. 1209 Audio and video can be expressed in separate encoding groups, as in 1210 this illustration. 1212 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1213 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1214 maxH264Mbps=244800, maxBandwidth=4000000 1215 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1216 maxH264Mbps=244800, maxBandwidth=4000000 1218 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1219 encodeID=AUDENC0, maxBandwidth=96000 1220 encodeID=AUDENC1, maxBandwidth=96000 1221 encodeID=AUDENC2, maxBandwidth=96000 1223 11.3. The MCU Case 1225 This section shows how an MCU might express its Capture Sets, 1226 intending to offer different choices for consumers that can handle 1227 different numbers of streams. A single audio capture stream is 1228 provided for all single and multi-screen configurations that can be 1229 associated (e.g. lip-synced) with any combination of video captures 1230 at the consumer. 1232 +--------------------+---------------------------------------------+ 1233 | Capture Set #1 | note | 1234 +--------------------+---------------------------------------------+ 1235 | VC0 | video capture for single screen consumer | 1236 | VC1, VC2 | video capture for 2 screen consumer | 1237 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1238 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1239 | AC0 | audio capture representing all participants | 1240 +--------------------+---------------------------------------------+ 1242 If / when a presentation stream becomes active within the conference, 1243 the MCU might re-advertise the available media as: 1245 +----------------+--------------------------------------+ 1246 | Capture Set #2 | note | 1247 +----------------+--------------------------------------+ 1248 | VC10 | video capture for presentation | 1249 | AC1 | presentation audio to accompany VC10 | 1250 +----------------+--------------------------------------+ 1252 11.4. Media Consumer Behavior 1254 This section gives an example of how a media consumer might behave 1255 when deciding how to request streams from the three screen endpoint 1256 described in the previous section. 1258 The receive side of a call needs to balance its requirements, based 1259 on number of screens and speakers, its decoding capabilities and 1260 available bandwidth, and the provider's capabilities in order to 1261 optimally configure the provider's streams. Typically it would want 1262 to receive and decode media from each capture set advertised by the 1263 provider. 1265 A sane, basic, algorithm might be for the consumer to go through each 1266 capture set in turn and find the collection of video captures that 1267 best matches the number of screens it has (this might include 1268 consideration of screens dedicated to presentation video display 1269 rather than "people" video) and then decide between alternative 1270 entries in the video capture sets based either on hard-coded 1271 preferences or user choice. Once this choice has been made, the 1272 consumer would then decide how to configure the provider's encoding 1273 groups in order to make best use of the available network bandwidth 1274 and its own decoding capabilities. 1276 11.4.1. One screen consumer 1278 VC3, VC4 and VC5 are all different entries by themselves, not grouped 1279 together in a single entry, so the receiving device should choose 1280 between one of those. The choice would come down to whether to see 1281 the greatest number of participants simultaneously at roughly equal 1282 precedence (VC5), a switched view of just the loudest region (VC3) or 1283 a switched view with PiPs (VC4). An endpoint device with a small 1284 amount of knowledge of these differences could offer a dynamic choice 1285 of these options, in-call, to the user. 1287 11.4.2. Two screen consumer configuring the example 1289 Mixing systems with an even number of screens, "2n", and those with 1290 "2n+1" cameras (and vice versa) is always likely to be the 1291 problematic case. In this instance, the behavior is likely to be 1292 determined by whether a "2 screen" system is really a "2 decoder" 1293 system, i.e., whether only one received stream can be displayed per 1294 screen or whether more than 2 streams can be received and spread 1295 across the available screen area. To enumerate 3 possible behaviors 1296 here for the 2 screen system when it learns that the far end is 1297 "ideally" expressed via 3 capture streams: 1299 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1300 per the 1 screen consumer case above) and either leave one screen 1301 blank or use it for presentation if / when a presentation becomes 1302 active 1304 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1305 (either with each capture being scaled to 2/3 of a screen and the 1306 centre capture being split across 2 screens) or, as would be 1307 necessary if there were large bezels on the screens, with each 1308 stream being scaled to 1/2 the screen width and height and there 1309 being a 4th "blank" panel. This 4th panel could potentially be 1310 used for any presentation that became active during the call. 1312 3. Receive 3 streams, decode all 3, and use control information 1313 indicating which was the most active to switch between showing 1314 the left and centre streams (one per screen) and the centre and 1315 right streams. 1317 For an endpoint capable of all 3 methods of working described above, 1318 again it might be appropriate to offer the user the choice of display 1319 mode. 1321 11.4.3. Three screen consumer configuring the example 1323 This is the most straightforward case - the consumer would look to 1324 identify a set of streams to receive that best matched its available 1325 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1326 spatial ordering would give sufficient information for the correct 1327 video capture to be shown on the correct screen, and the consumer 1328 would either need to divide a single encoding group's capability by 3 1329 to determine what resolution and frame rate to configure the provider 1330 with or to configure the individual video captures' encoding groups 1331 with what makes most sense (taking into account the receive side 1332 decode capabilities, overall call bandwidth, the resolution of the 1333 screens plus any user preferences such as motion vs sharpness). 1335 12. Acknowledgements 1337 Mark Gorzyinski contributed much to the approach. We want to thank 1338 Stephen Botzko for helpful discussions on audio. 1340 13. IANA Considerations 1342 TBD 1344 14. Security Considerations 1346 TBD 1348 15. Informative References 1350 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1351 Requirement Levels", BCP 14, RFC 2119, March 1997. 1353 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1354 A., Peterson, J., Sparks, R., Handley, M., and E. 1355 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1356 June 2002. 1358 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1359 Jacobson, "RTP: A Transport Protocol for Real-Time 1360 Applications", STD 64, RFC 3550, July 2003. 1362 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1363 Session Initiation Protocol (SIP)", RFC 4353, 1364 February 2006. 1366 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1367 January 2008. 1369 Appendix A. Open Issues 1371 A.1. Video layout arrangements and centralized composition 1373 In the context of a conference with a central MCU, there has been 1374 discussion about a consumer requesting the provider to provide a 1375 certain type of layout arrangement or perform a certain composition 1376 algorithm, such as combining some number of most recent talkers, or 1377 producing a video layout using a 2x2 grid or 1 large cell with 5 1378 smaller cells around it. The current framework does not address 1379 this. It isn't clear if this topic should be included in this 1380 framework, or maybe a different part of CLUE, or maybe outside of 1381 CLUE altogether. 1383 A.2. Source is selectable 1385 A Boolean variable. True indicates the media consumer can request a 1386 particular media source be mapped to a media capture. Default is 1387 false. 1389 TBD - how does the consumer make the request for a particular source? 1390 How does the consumer know what is available? Need to explain better 1391 how multiple media captures are different from a single media capture 1392 with choices for the source, and when each concept should be used. 1394 A.3. Media Source Selection 1396 The use cases include a case where the person at a receiving endpoint 1397 can request to receive media from a particular other endpoint, for 1398 example in a multipoint call to request to receive the video from a 1399 certain section of a certain room, whether or not people there are 1400 talking. 1402 TBD - this framework should address this case. Maybe need a roster 1403 list of rooms or people in the conference, with a mechanism to select 1404 from the roster and associate it with media captures. This is 1405 different from selecting a particular media capture from a capture 1406 set. The mechanism to do this will probably need to be different 1407 than selecting media captures based on capture sets and attributes. 1409 A.4. Endpoint requesting many streams from MCU 1411 TBD - how to do VC selection for a system where the endpoint media 1412 consumers want to receive lots of streams and do their own 1413 composition, rather than MCU doing transcoding and composing. 1414 Example is 3 screen consumer that wants 3 large loudest speaker 1415 streams, and a bunch of small ones to render as PiP. How the small 1416 ones are chosen, which could potentially be chosen by either the 1417 endpoint or MCU. There are other more complicated examples also. Is 1418 the current framework adequate to support this? 1420 A.5. VAD (voice activity detection) tagging of audio streams 1422 TBD - do we want to have VAD be mandatory? All audio streams 1423 originating from a media provider must be tagged with VAD 1424 information. This tagging would include an overall energy value for 1425 the stream plus information on which sections of the capture scene 1426 are "active". 1428 Each audio stream which forms a constituent of an entry within a 1429 capture set should include this tagging, and the energy value within 1430 it calculated using a fixed, consistent algorithm. 1432 When a system determines the most active area of a capture scene 1433 (either "loudest", or determined by other means such as a button 1434 press) it should convey that information to the corresponding media 1435 stream consumer via any audio streams being sent within that capture 1436 set. Specifically, there should be a list of active coordinates and 1437 their VAD characteristics within the audio stream in addition to the 1438 overall VAD information for the capture set. This is to ensure all 1439 media stream consumers receive the same, consistent, audio energy 1440 information whichever audio capture or captures they choose to 1441 receive for a capture set. Additionally, coordinate information can 1442 be mapped to video captures by a media stream consumer in order that 1443 it can perform "panel switching" if required. 1445 A.6. Private Information 1447 Do we want a way to include private information? 1449 Authors' Addresses 1451 Allyn Romanow 1452 Cisco Systems 1453 San Jose, CA 95134 1454 USA 1456 Email: allyn@cisco.com 1458 Mark Duckworth (editor) 1459 Polycom 1460 Andover, MA 01810 1461 US 1463 Email: mark.duckworth@polycom.com 1465 Andrew Pepperell 1466 Langley, England 1467 UK 1469 Email: apeppere@gmail.com 1471 Brian Baldino 1472 Cisco Systems 1473 San Jose, CA 95134 1474 US 1476 Email: bbaldino@cisco.com