idnits 2.17.1 draft-ietf-clue-framework-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1084 has weird spacing: '...om left bot...' == Line 1135 has weird spacing: '...om left bot...' -- The document date (May 25, 2012) is 4354 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: November 26, 2012 Polycom 6 A. Pepperell 8 B. Baldino 9 Cisco Systems 10 May 25, 2012 12 Framework for Telepresence Multi-Streams 13 draft-ietf-clue-framework-05.txt 15 Abstract 17 This memo offers a framework for a protocol that enables devices in a 18 telepresence conference to interoperate by specifying the 19 relationships between multiple media streams. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on November 26, 2012. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 4. Overview of the Framework/Model . . . . . . . . . . . . . . . 7 59 5. Spatial Relationships . . . . . . . . . . . . . . . . . . . . 8 60 6. Media Captures and Capture Scenes . . . . . . . . . . . . . . 9 61 6.1. Media Captures . . . . . . . . . . . . . . . . . . . . . . 9 62 6.1.1. Media Capture Attributes . . . . . . . . . . . . . . . 10 63 6.2. Capture Scene . . . . . . . . . . . . . . . . . . . . . . 12 64 6.2.1. Capture scene attributes . . . . . . . . . . . . . . . 14 65 6.2.2. Capture scene entry attributes . . . . . . . . . . . . 14 66 6.3. Simultaneous Transmission Set Constraints . . . . . . . . 15 67 7. Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 16 68 7.1. Individual Encodings . . . . . . . . . . . . . . . . . . . 17 69 7.2. Encoding Group . . . . . . . . . . . . . . . . . . . . . . 18 70 8. Associating Media Captures with Encoding Groups . . . . . . . 19 71 9. Consumer's Choice of Streams to Receive from the Provider . . 20 72 9.1. Local preference . . . . . . . . . . . . . . . . . . . . . 20 73 9.2. Physical simultaneity restrictions . . . . . . . . . . . . 21 74 9.3. Encoding and encoding group limits . . . . . . . . . . . . 21 75 9.4. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21 76 10. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 22 77 11. Examples - Using the Framework . . . . . . . . . . . . . . . . 23 78 11.1. Three screen endpoint media provider . . . . . . . . . . . 23 79 11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29 80 11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30 81 11.4. Media Consumer Behavior . . . . . . . . . . . . . . . . . 30 82 11.4.1. One screen consumer . . . . . . . . . . . . . . . . . 31 83 11.4.2. Two screen consumer configuring the example . . . . . 31 84 11.4.3. Three screen consumer configuring the example . . . . 32 85 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32 86 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 87 14. Security Considerations . . . . . . . . . . . . . . . . . . . 32 88 15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32 89 16. Informative References . . . . . . . . . . . . . . . . . . . . 33 90 Appendix A. Open Issues . . . . . . . . . . . . . . . . . . . . . 34 91 A.1. Video layout arrangements and centralized composition . . 34 92 A.2. Source is selectable . . . . . . . . . . . . . . . . . . . 34 93 A.3. Media Source Selection . . . . . . . . . . . . . . . . . . 35 94 A.4. Endpoint requesting many streams from MCU . . . . . . . . 35 95 A.5. VAD (voice activity detection) tagging of audio streams . 35 96 A.6. Private Information . . . . . . . . . . . . . . . . . . . 36 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36 99 1. Introduction 101 Current telepresence systems, though based on open standards such as 102 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 103 other. A major factor limiting the interoperability of telepresence 104 systems is the lack of a standardized way to describe and negotiate 105 the use of the multiple streams of audio and video comprising the 106 media flows. This draft provides a framework for a protocol to 107 enable interoperability by handling multiple streams in a 108 standardized way. It is intended to support the use cases described 109 in draft-ietf-clue-telepresence-use-cases-02 and to meet the 110 requirements in draft-ietf-clue-telepresence-requirements-01. 112 The solution described here is strongly focused on what is being done 113 today, rather than on a vision of future conferencing. At the same 114 time, the highest priority has been given to creating an extensible 115 framework to make it easy to accommodate future conferencing 116 functionality as it evolves. 118 The purpose of this effort is to make it possible to handle multiple 119 streams of media in such a way that a satisfactory user experience is 120 possible even when participants are using different vendor equipment, 121 and also when they are using devices with different types of 122 communication capabilities. Information about the relationship of 123 media streams at the provider's end must be communicated so that 124 streams can be chosen and audio/video rendering can be done in the 125 best possible manner. 127 There is no attempt here to dictate to the renderer what it should 128 do. What the renderer does is up to the renderer. 130 After the following Definitions, a short section introduces key 131 concepts. The body of the text comprises several sections about the 132 key elements of the framework, how a consumer chooses streams to 133 receive, and some examples. The appendix describe topics that are 134 under discussion for adding to the document. 136 2. Terminology 138 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 139 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 140 document are to be interpreted as described in RFC 2119 [RFC2119]. 142 3. Definitions 144 The definitions marked with an "*" are new; all the others are from 145 *Audio Capture: Media Capture for audio. Denoted as ACn. 147 Camera-Left and Right: For media captures, camera-left and camera- 148 right are from the point of view of a person observing the rendered 149 media. They are the opposite of stage-left and stage-right. 151 Capture Device: A device that converts audio and video input into an 152 electrical signal, in most cases to be fed into a media encoder. 153 Cameras and microphones are examples for capture devices. 155 *Capture Scene: a structure representing the scene that is captured 156 by a collection of capture devices. A capture scene includes 157 attributes and one or more capture scene entries, with each entry 158 including one or more media captures. 160 *Capture Scene Entry: a list of media captures of the same media type 161 that together form one way to represent the capture scene. 163 Conference: used as defined in [RFC4353], A Framework for 164 Conferencing within the Session Initiation Protocol (SIP). 166 *Individual Encoding: A variable with a set of attributes that 167 describes the maximum values of a single audio or video capture 168 encoding. The attributes include: maximum bandwidth- and for video 169 maximum macroblocks (for H.264), maximum width, maximum height, 170 maximum frame rate. 172 *Encoding Group: A set of encoding parameters representing a media 173 provider's encoding capabilities. Media stream providers formed of 174 multiple physical units, in each of which resides some encoding 175 capability, would typically advertise themselves to the remote media 176 stream consumer using multiple encoding groups. Within each encoding 177 group, multiple potential encodings are possible, with the sum of the 178 chosen encodings' characteristics constrained to being less than or 179 equal to the group-wide constraints. 181 Endpoint: The logical point of final termination through receiving, 182 decoding and rendering, and/or initiation through capturing, 183 encoding, and sending of media streams. An endpoint consists of one 184 or more physical devices which source and sink media streams, and 185 exactly one [RFC4353] Participant (which, in turn, includes exactly 186 one SIP User Agent). In contrast to an endpoint, an MCU may also 187 send and receive media streams, but it is not the initiator nor the 188 final terminator in the sense that Media is Captured or Rendered. 189 Endpoints can be anything from multiscreen/multicamera rooms to 190 handheld devices. 192 Front: the portion of the room closest to the cameras. In going 193 towards back you move away from the cameras. 195 MCU: Multipoint Control Unit (MCU) - a device that connects two or 196 more endpoints together into one single multimedia conference 197 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 198 tardy in requiring that media from the mixer be sent to EACH 199 participant. I think we have practical use cases where this is not 200 the case. But the bug (if it is one) is in 4353 and not herein.] 202 Media: Any data that, after suitable encoding, can be conveyed over 203 RTP, including audio, video or timed text. 205 *Media Capture: a source of Media, such as from one or more Capture 206 Devices. A Media Capture (MC) may be the source of one or more Media 207 streams. A Media Capture may also be constructed from other Media 208 streams. A middle box can express Media Captures that it constructs 209 from Media streams it receives. 211 *Media Consumer: an Endpoint or middle box that receives media 212 streams 214 *Media Provider: an Endpoint or middle box that sends Media streams 216 Model: a set of assumptions a telepresence system of a given vendor 217 adheres to and expects the remote telepresence system(s) also to 218 adhere to. 220 *Plane of Interest: The spatial plane containing the most relevant 221 subject matter. 223 Render: the process of generating a representation from a media, such 224 as displayed motion video or sound emitted from loudspeakers. 226 *Simultaneous Transmission Set: a set of media captures that can be 227 transmitted simultaneously from a Media Provider. 229 Spatial Relation: The arrangement in space of two objects, in 230 contrast to relation in time or other relationships. See also 231 Camera-Left and Right. 233 Stage-Left and Right: For media captures, stage-left and stage-right 234 are the opposite of camera-left and camera-right. For the case of a 235 person facing (and captured by) a camera, stage-left and stage-right 236 are from the point of view of that person. 238 *Stream: RTP stream as in [RFC3550]. 240 Stream Characteristics: the media stream attributes commonly used in 241 non-CLUE SIP/SDP environments (such as: media codec, bit rate, 242 resolution, profile/level etc.) as well as CLUE specific attributes, 243 such as the ID of a capture or a spatial location. 245 Telepresence: an environment that gives non co-located users or user 246 groups a feeling of (co-located) presence - the feeling that a Local 247 user is in the same room with other Local users and the Remote 248 parties. The inclusion of Remote parties is achieved through 249 multimedia communication including at least audio and video signals 250 of high fidelity. 252 *Video Capture: Media Capture for video. Denoted as VCn. 254 Video composite: A single image that is formed from combining visual 255 elements from separate sources. 257 4. Overview of the Framework/Model 259 The CLUE framework specifies how multiple media streams are to be 260 handled in a telepresence conference. 262 The main goals include: 264 o Interoperability 266 o Extensibility 268 o Flexibility 270 Interoperability is achieved by the media provider describing the 271 relationships between media streams in constructs that are understood 272 by the consumer, who can then render the media. Extensibility is 273 achieved through abstractions and the generality of the model, making 274 it easy to add new parameters. Flexibility is achieved largely by 275 having the consumer choose what content and format it wants to 276 receive from what the provider is capable of sending. 278 A transmitting endpoint or MCU describes specific aspects of the 279 content of the media and the formatting of the media streams it can 280 send (advertisement); and the receiving end responds to the provider 281 by specifying which content and media streams it wants to receive 282 (configuration). The provider then transmits the asked for content 283 in the specified streams. 285 This advertisement and configuration occurs at call initiation but 286 may also happen at any time throughout the conference, whenever there 287 is a change in what the consumer wants or the provider can send. 289 An endpoint or MCU typically acts as both provider and consumer at 290 the same time, sending advertisements and sending configurations in 291 response to receiving advertisements. (It is possible to be just one 292 or the other.) 294 The data model is based around two main concepts: a capture and an 295 encoding. A media capture (MC), such as audio or video, describes 296 the content a provider can send. Media captures are described in 297 terms of CLUE-defined attributes, such as spatial relationships and 298 purpose of the capture. Providers tell consumers which media 299 captures they can provide, described in terms of the media capture 300 attributes. 302 A provider organizes its media captures that represent the same scene 303 into capture scenes. A consumer chooses which media captures it 304 wants to receive according to the capture scenes sent by the 305 provider. 307 In addition, the provider sends the consumer a description of the 308 streams it can send in terms of the media attributes of the stream, 309 in particular, well-known audio and video parameters such as 310 bandwidth, frame rate, macroblocks per second. 312 The provider also specifies constraints on its ability to provide 313 media, and the consumer must take these into account in choosing the 314 content and streams it wants. Some constraints are due to the 315 physical limitations of devices - for example, a camera may not be 316 able to provide zoom and non-zoom views simultaneously. Other 317 constraints are system based constraints, such as maximum bandwidth 318 and maximum macroblocks/second. 320 The following sections discuss these constructs and processes in 321 detail, followed by use cases showing how the framework specification 322 can be used. 324 5. Spatial Relationships 326 In order for a consumer to perform a proper rendering, it is often 327 necessary to provide spatial information about the streams it is 328 receiving. CLUE defines a coordinate system that allows media 329 providers to describe the spatial relationships of their media 330 captures to enable proper scaling and spatial rendering of their 331 streams. The coordinate system is based on a few principles: 333 o Simple systems which do not have multiple Media Captures to 334 associate spatially need not use the coordinate model. 336 o Coordinates can either be in real, physical units (millimeters), 337 have an unknown scale or have no physical scale. Systems which 338 know their physical dimensions should always provide those real- 339 world measurements. Systems which don't know specific physical 340 dimensions but still know relative distances should use 'unknown 341 scale'. 'No scale' is intended to be used where Media Captures 342 from different devices (with potentially different scales) will be 343 forwarded alongside one another (e.g. in the case of a middle 344 box). 346 * "millimeters" means the scale is in millimeters 348 * "Unknown" means the scale is not necessarily millimeters, but 349 the scale is the same for every capture in the capture scene. 351 * "No Scale" means the scale could be different for each capture 352 - an MCU provider that advertises two adjacent captures and 353 picks sources (which can change quickly) from different 354 endpoints might use this value; the scale could be different 355 and changing for each capture. But the areas of capture still 356 represent a spatial relation between captures. 358 o The coordinate system is Cartesian X, Y, Z with the origin at a 359 spot of the provider's choosing. The provider must use the same 360 origin for all coordinates within the same capture scene. 362 The direction of increasing coordinate values is: 363 X increases from camera left to camera right 364 Y increases from front to back 365 Z increases from low to high 367 6. Media Captures and Capture Scenes 369 This section describes how media providers can describe the content 370 of media to consumers. 372 6.1. Media Captures 374 Media captures are the fundamental representations of streams that a 375 device can transmit. What a Media Capture actually represents is 376 flexible: 378 o It can represent the immediate output of a physical source (e.g. 379 camera, microphone) or 'synthetic' source (e.g. laptop computer, 380 DVD player). 382 o It can represent the output of an audio mixer or video composer 384 o It can represent a concept such as 'the loudest speaker' 386 o It can represent a conceptual position such as 'the leftmost 387 stream' 389 To distinguish between multiple instances, video and audio captures 390 are numbered such as: VC1, VC2 and AC1, AC2. VC1 and VC2 refer to 391 two different video captures and AC1 and AC2 refer to two different 392 audio captures. 394 Each Media Capture can be associated with attributes to describe what 395 it represents. 397 6.1.1. Media Capture Attributes 399 Media Capture Attributes describe static information about the 400 captures. A provider uses the media capture attributes to describe 401 the media captures to the consumer. The consumer will select the 402 captures it wants to receive. Attributes are defined by a variable 403 and its value. The currently defined attributes and their values 404 are: 406 Content: {slides, speaker, sl, main, alt} 408 A field with enumerated values which describes the role of the media 409 capture and can be applied to any media type. The enumerated values 410 are defined by [RFC4796]. The values for this attribute are the same 411 as the mediacnt values for the content attribute in [RFC4796]. This 412 attribute can have multiple values, for example content={main, 413 speaker}. 415 Composed: {true, false} 417 A field with a Boolean value which indicates whether or not the Media 418 Capture is a mix (audio) or composition (video) of streams. 420 This attribute is useful for a media consumer to avoid nesting a 421 composed video capture into another composed capture or rendering. 422 This attribute is not intended to describe the layout a media 423 provider uses when composing video streams. 425 Audio Channel Format: {mono, stereo} A field with enumerated values 426 which describes the method of encoding used for audio. 428 A value of 'mono' means the Audio Capture has one channel. 430 A value of 'stereo' means the Audio Capture has two audio channels, 431 left and right. 433 This attribute applies only to Audio Captures. A single stereo 434 capture is different from two mono captures that have a left-right 435 spatial relationship. A stereo capture maps to a single RTP stream, 436 while each mono audio capture maps to a separate RTP stream. 438 Switched: {true, false} 440 A field with a Boolean value which indicates whether or not the Media 441 Capture represents the (dynamic) most appropriate subset of a 442 'whole'. What is 'most appropriate' is up to the provider and could 443 be the active speaker, a lecturer or a VIP. 445 Point of Capture: {(X, Y, Z)} A field with a single Cartesian (X, Y, 446 Z) point value which describes the spatial location, virtual or 447 physical, of the capturing device (such as camera). 449 When the Point of Capture attribute is specified, it must include X, 450 Y and Z coordinates. If the point of capture is not specified, it 451 means the consumer should not assume anything about the spatial 452 location of the capturing device. Even if the provider specifies an 453 area of capture attribute, it does not need to specify the point of 454 capture. 456 Area of Capture: 458 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3, 459 Z3), top right(X4, Y4, Z4)} 461 A field with a set of four (X, Y, Z) points as a value which describe 462 the spatial location of what is being "captured". By comparing the 463 Area of Capture for different Media Captures within the same capture 464 scene a consumer can determine the spatial relationships between them 465 and render them correctly. 467 The four points should be co-planar. The four points form a 468 quadrilateral, not necessarily a rectangle. 470 The quadrilateral described by the four (X, Y, Z) points defines the 471 plane of interest for the particular media capture. 473 If the area of capture attribute is specified, it must include X, Y 474 and Z coordinates for all four points. If the area of capture is not 475 specified, it means the media capture is not spatially related to any 476 other media capture (but this can change in a subsequent provider 477 advertisement). 479 For a switched capture that switches between different sections 480 within a larger area, the area of capture should use coordinates for 481 the larger potential area. 483 EncodingGroup: {} 485 A field with a value equal to the encodeGroupID of the encoding group 486 associated with the media capture. 488 6.2. Capture Scene 490 In order for a provider's individual media captures to be used 491 effectively by a consumer, the provider organizes the media captures 492 into capture scenes, with the structure and contents of these capture 493 scenes being sent from the provider to the consumer. 495 A capture scene is a structure representing the scene that is 496 captured by a collection of capture devices. A capture scene 497 includes one or more capture scene entries, with each entry including 498 one or more media captures. A capture scene represents, for example, 499 the video image of a group of people seated next to each other, along 500 with the sound of their voices, which could be represented by some 501 number of VCs and ACs in the capture scene entries. A middle box may 502 also express capture scenes that it constructs from media streams it 503 receives. 505 A provider may advertise multiple capture scenes or just a single 506 capture scene. A media provider might typically use one capture 507 scene for main participant media and another capture scene for a 508 computer generated presentation. A capture scene may include more 509 than one type of media. For example, a capture scene can include 510 several capture scene entries for video captures, and several capture 511 scene entries for audio captures. 513 A provider can express spatial relationships between media captures 514 that are included in the same capture scene. But there is no spatial 515 relationship between media captures that are in different capture 516 scenes. 518 A media provider arranges media captures in a capture scene to help 519 the media consumer choose which captures it wants. The capture scene 520 entries in a capture scene are different alternatives the provider is 521 suggesting for representing the capture scene. The media consumer 522 can choose to receive all media captures from one capture scene entry 523 for each media type (e.g. audio and video), or it can pick and choose 524 media captures regardless of how the provider arranges them in 525 capture scene entries. 527 Media captures within the same capture scene entry must be of the 528 same media type - it is not possible to mix audio and video captures 529 in the same capture scene entry, for instance. The provider must be 530 capable of encoding and sending all media captures in a single entry 531 simultaneously. A consumer may decide to receive all the media 532 captures in a single capture scene entry, but a consumer could also 533 decide to receive just a subset of those captures. A consumer can 534 also decide to receive media captures from different capture scene 535 entries. 537 When a provider advertises a capture scene with multiple entries, it 538 is essentially signaling that there are multiple representations of 539 the same scene available. In some cases, these multiple 540 representations would typically be used simultaneously (for instance 541 a "video entry" and an "audio entry"). In some cases the entries 542 would conceptually be alternatives (for instance an entry consisting 543 of 3 video captures versus an entry consisting of just a single video 544 capture). In this latter example, the provider would in the simple 545 case end up providing to the consumer the entry containing the number 546 of video captures that most closely matched the media consumer's 547 number of display devices. 549 The following is an example of 4 potential capture scene entries for 550 an endpoint-style media provider: 552 1. (VC0, VC1, VC2) - left, center and right camera video captures 554 2. (VC3) - video capture associated with loudest room segment 556 3. (VC4) - video capture zoomed out view of all people in the room 558 4. (AC0) - main audio 560 The first entry in this capture scene example is a list of video 561 captures with a spatial relationship to each other. Determination of 562 the order of these captures (VC0, VC1 and VC2) for rendering purposes 563 is accomplished through use of their Area of Capture attributes. The 564 second entry (VC3) and the third entry (VC4) are additional 565 alternatives of how to capture the same room in different ways. The 566 inclusion of the audio capture in the same capture scene indicates 567 that AC0 is associated with those video captures, meaning it comes 568 from the same scene. The audio should be rendered in conjunction 569 with any rendered video captures from the same capture scene. 571 6.2.1. Capture scene attributes 573 Attributes can be applied to capture scenes as well as to individual 574 media captures. Attributes specified at this level apply to all 575 constituent media captures. 577 Description attribute 579 The description attribute is a human readable text string which 580 describes the capture scene. A provider that advertises multiple 581 capture scenes may use different descriptions to differentiate 582 between them. This attribute can contain text in any language. 584 Description Language attribute 586 This attribute contains only one language, which is the language of 587 the text in the description attribute. The possible values of this 588 element are the values of the 'Subtag' column of the "Language Subtag 589 Registry" at [IANA-Lan] originally defined in [RFC5646]. 591 Area of Scene attribute 593 The area of scene attribute for a capture scene has the same format 594 as the area of capture attribute for a media capture. The area of 595 scene is for the entire scene, which is captured by the one or more 596 media captures in the capture scene entries. If the provider does 597 not specify the area of scene, but does specify areas of capture, 598 then the consumer may assume the area of scene is greater than or 599 equal to the outer extents of the individual areas of capture. 601 Scale attribute 603 An optional attribute indicating if the numbers used for area of 604 scene, area of capture and point of capture are in terms of 605 millimeters, unknown scale factor, or not any scale, as described in 606 Section 5. If any media captures have an area of capture attribute 607 or point of capture attribute, then this scale attribute must also be 608 defined. The possible values for this attribute are: 610 "millimeters" 611 "unknown" 612 "no scale" 614 6.2.2. Capture scene entry attributes 616 Attributes can be applied to capture scene entries. Attributes 617 specified at this level apply to the capture scene entry as a whole. 619 Scene-switch-policy: {site-switch, segment-switch} 621 A media provider uses this scene-switch-policy attribute to indicate 622 its support for different switching policies. In the provider's 623 advertisement, this attribute can have multiple values, which means 624 the provider supports each of the indicated policies. The consumer, 625 when it requests media captures from this capture scene entry, should 626 also include this attribute but with only the single value (from 627 among the values indicated by the provider) indicating the consumer's 628 choice for which policy it wants the provider to use. If the 629 provider does not support any of these policies, it should omit this 630 attribute. 632 The "site-switch" policy means all captures are switched at the same 633 time to keep captures from the same endpoint site together. Let's 634 say the speaker is at site A and everyone else is at a "remote" site. 635 When the room at site A shown, all the camera images from site A are 636 forwarded to the remote sites. Therefore at each receiving remote 637 site, all the screens display camera images from site A. This can be 638 used to preserve full size image display, and also provide full 639 visual context of the displayed far end, site A. In site switching, 640 there is a fixed relation between the cameras in each room and the 641 displays in remote rooms. The room or participants being shown is 642 switched from time to time based on who is speaking or by manual 643 control. 645 The "segment-switch" policy means different captures can switch at 646 different times, and can be coming from different endpoints. Still 647 using site A as where the speaker is, and "remote" to refer to all 648 the other sites, in segment switching, rather than sending all the 649 images from site A, only the image containing the speaker at site A 650 is shown. The camera images of the current speaker and previous 651 speakers (if any) are forwarded to the other sites in the conference. 652 Therefore the screens in each site are usually displaying images from 653 different remote sites - the current speaker at site A and the 654 previous ones. This strategy can be used to preserve full size image 655 display, and also capture the non-verbal communication between the 656 speakers. In segment switching, the display depends on the activity 657 in the remote rooms - generally, but not necessarily based on audio / 658 speech detection. 660 6.3. Simultaneous Transmission Set Constraints 662 The provider may have constraints or limitations on its ability to 663 send media captures. One type is caused by the physical limitations 664 of capture mechanisms; these constraints are represented by a 665 simultaneous transmission set. The second type of limitation 666 reflects the encoding resources available - bandwidth and 667 macroblocks/second. This type of constraint is captured by encoding 668 groups, discussed below. 670 An endpoint or MCU can send multiple captures simultaneously, however 671 sometimes there are constraints that limit which captures can be sent 672 simultaneously with other captures. A device may not be able to be 673 used in different ways at the same time. Provider advertisements are 674 made so that the consumer will choose one of several possible 675 mutually exclusive usages of the device. This type of constraint is 676 expressed in a Simultaneous Transmission Set, which lists all the 677 media captures that can be sent at the same time. This is easier to 678 show in an example. 680 Consider the example of a room system where there are 3 cameras each 681 of which can send a separate capture covering 2 persons each- VC0, 682 VC1, VC2. The middle camera can also zoom out and show all 6 683 persons, VC3. But the middle camera cannot be used in both modes at 684 the same time - it has to either show the space where 2 participants 685 sit or the whole 6 seats, but not both at the same time. 687 Simultaneous transmission sets are expressed as sets of the MCs that 688 could physically be transmitted at the same time, (though it may not 689 make sense to do so). In this example the two simultaneous sets are 690 shown in Table 1. The consumer must make sure that it chooses one 691 and not more of the mutually exclusive sets. A consumer may choose 692 any subset of the media captures in a simultaneous set, it does not 693 have to choose all the captures in a simultaneous set if it does not 694 want to receive all of them. 696 +-------------------+ 697 | Simultaneous Sets | 698 +-------------------+ 699 | {VC0, VC1, VC2} | 700 | {VC0, VC3, VC2} | 701 +-------------------+ 703 Table 1: Two Simultaneous Transmission Sets 705 A media provider includes the simultaneous sets in its provider 706 advertisement. These simultaneous set constraints apply across all 707 the captures scenes in the advertisement. The simultaneous 708 transmission sets MUST allow all the media captures in a particular 709 capture scene entry to be used simultaneously. 711 7. Encodings 713 We have considered how providers can describe the content of media to 714 consumers. We will now consider how the providers communicate 715 information about their abilities to send streams. We introduce two 716 constructs - individual encodings and encoding groups. Consumers 717 will then map the media captures they want onto the encodings with 718 encoding parameters they want. This process is then described. 720 7.1. Individual Encodings 722 An individual encoding represents a way to encode a media capture to 723 become an encoded media stream sent from the media provider to the 724 media consumer. An individual encoding has a set of parameters 725 characterizing how the media is encoded. Different media types have 726 different parameters, and different encoding algorithms may have 727 different parameters. An individual encoding can be used for only 728 one actual encoded media stream at a time. 730 The parameters of an individual encoding represent the maximimum 731 values for certain aspects of the encoding. A particular 732 instantiation into an encoded stream might use lower values than 733 these maximums. 735 The following tables show the variables for audio and video encoding. 737 +--------------+----------------------------------------------------+ 738 | Name | Description | 739 +--------------+----------------------------------------------------+ 740 | encodeID | A unique identifier for the individual encoding | 741 | maxBandwidth | Maximum number of bits per second | 742 | maxH264Mbps | Maximum number of macroblocks per second: ((width | 743 | | + 15) / 16) * ((height + 15) / 16) * | 744 | | framesPerSecond | 745 | maxWidth | Video resolution's maximum supported width, | 746 | | expressed in pixels | 747 | maxHeight | Video resolution's maximum supported height, | 748 | | expressed in pixels | 749 | maxFrameRate | Maximum supported frame rate | 750 +--------------+----------------------------------------------------+ 752 Table 2: Individual Video Encoding Parameters 754 +--------------+-----------------------------------+ 755 | Name | Description | 756 +--------------+-----------------------------------+ 757 | maxBandwidth | Maximum number of bits per second | 758 +--------------+-----------------------------------+ 760 Table 3: Individual Audio Encoding Parameters 762 7.2. Encoding Group 764 An encoding group includes a set of one or more individual encodings, 765 plus some parameters that apply to the group as a whole. By grouping 766 multiple individual encodings together, an encoding group describes 767 additional constraints on bandwidth and other parameters for the 768 group. Table 4 shows the parameters and individual encoding sets 769 that are part of an encoding group. 771 +-------------------+-----------------------------------------------+ 772 | Name | Description | 773 +-------------------+-----------------------------------------------+ 774 | encodeGroupID | A unique identifier for the encoding group | 775 | maxGroupBandwidth | Maximum number of bits per second relating to | 776 | | all encodings combined | 777 | maxGroupH264Mbps | Maximum number of macroblocks per second | 778 | | relating to all video encodings combined | 779 | videoEncodings[] | Set of potential encodings (list of | 780 | | encodeIDs) | 781 | audioEncodings[] | Set of potential encodings (list of | 782 | | encodeIDs) | 783 +-------------------+-----------------------------------------------+ 785 Table 4: Encoding Group 787 When the individual encodings in a group are instantiated into actual 788 encoded media streams, each stream has a bandwidth that must be less 789 than or equal to the maxBandwidth for the particular individual 790 encoding. The maxGroupBandwidth parameter gives the additional 791 restriction that the sum of all the individual instantiated 792 bandwidths must be less than or equal to the maxGroupBandwidth value. 794 Likewise, the sum of the macroblocks per second of each instantiated 795 encoding in the group must not exceed the maxGroupH264Mbps value. 797 The following diagram illustrates the structure of a media provider's 798 Encoding Groups and their contents. 800 ,-------------------------------------------------. 801 | Media Provider | 802 | | 803 | ,--------------------------------------. | 804 | | ,--------------------------------------. | 805 | | | ,--------------------------------------. | 806 | | | | Encoding Group | | 807 | | | | ,-----------. | | 808 | | | | | | ,---------. | | 809 | | | | | | | | ,---------.| | 810 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 811 | `.| | | | | | `---------'| | 812 | `.| `-----------' `---------' | | 813 | `--------------------------------------' | 814 `-------------------------------------------------' 816 Figure 1: Encoding Group Structure 818 A media provider advertises one or more encoding groups. Each 819 encoding group includes one or more individual encodings. Each 820 individual encoding can represent a different way of encoding media. 821 For example one individual encoding may be 1080p60 video, another 822 could be 720p30, with a third being CIF. 824 While a typical 3 codec/display system might have one encoding group 825 per "codec box", there are many possibilities for the number of 826 encoding groups a provider may be able to offer and for the encoding 827 values in each encoding group. 829 There is no requirement for all encodings within an encoding group to 830 be instantiated at once. 832 8. Associating Media Captures with Encoding Groups 834 Every media capture is associated with an encoding group, which is 835 used to instantiate that media capture into one or more encoded 836 streams. Each media capture has an encoding group attribute. The 837 value of this attribute is the encodeGroupID for the encoding group 838 with which it is associated. More than one media capture may use the 839 same encoding group. 841 The maximum number of streams that can result from a particular 842 encoding group constraint is equal to the number of individual 843 encodings in the group. The actual number of streams used at any 844 time may be less than this maximum. Any of the media captures that 845 use a particular encoding group can be encoded according to any of 846 the individual encodings in the group. If there are multiple 847 individual encodings in the group, then a single media capture can be 848 encoded into multiple different streams at the same time, with each 849 stream following the constraints of a different individual encoding. 851 The Encoding Groups MUST allow all the media captures in a particular 852 capture scene entry to be used simultaneously. 854 9. Consumer's Choice of Streams to Receive from the Provider 856 After receiving the provider's advertised media captures and 857 associated constraints, the consumer must choose which media captures 858 it wishes to receive, and which individual encodings from the 859 provider it wants to use to encode the capture. Each media capture 860 has an encoding group ID attribute which specifies which individual 861 encodings are available to be used for that media capture. 863 For each media capture the consumer wants to receive, it configures 864 one or more of the encodings in that capture's encoding group. The 865 consumer does this by telling the provider the resolution, frame 866 rate, bandwidth, etc. when asking for streams for its chosen 867 captures. Upon receipt of this configuration command from the 868 consumer, the provider generates streams for each such configured 869 encoding and sends those streams to the consumer. 871 The consumer must have received at least one capture advertisement 872 from the provider to be able to configure the provider's generation 873 of media streams. 875 The consumer is able to change its configuration of the provider's 876 encodings any number of times during the call, either in response to 877 a new capture advertisement from the provider or autonomously. The 878 consumer need not send a new configure message to the provider when 879 it receives a new capture advertisement from the provider unless the 880 contents of the new capture advertisement cause the consumer's 881 current configure message to become invalid. 883 When choosing which streams to receive from the provider, and the 884 encoding characteristics of those streams, the consumer needs to take 885 several things into account its local preference, simultaneity 886 restrictions, and encoding limits. 888 9.1. Local preference 890 A variety of local factors will influence the consumer's choice of 891 streams to be received from the provider: 893 o if the consumer is an endpoint, it is likely that it would choose, 894 where possible, to receive video and audio captures that match the 895 number of display devices and audio system it has 897 o if the consumer is a middle box such as an MCU, it may choose to 898 receive loudest speaker streams (in order to perform its own media 899 composition) and avoid pre-composed video captures 901 o user choice (for instance, selection of a new layout) may result 902 in a different set of media captures, or different encoding 903 characteristics, being required by the consumer 905 9.2. Physical simultaneity restrictions 907 There may be physical simultaneity constraints imposed by the 908 provider that affect the provider's ability to simultaneously send 909 all of the captures the consumer would wish to receive. For 910 instance, a middle box such as an MCU, when connected to a multi- 911 camera room system, might prefer to receive both individual camera 912 streams of the people present in the room and an overall view of the 913 room from a single camera. Some endpoint systems might be able to 914 provide both of these sets of streams simultaneously, whereas others 915 may not (if the overall room view were produced by changing the zoom 916 level on the center camera, for instance). 918 9.3. Encoding and encoding group limits 920 Each of the provider's encoding groups has limits on bandwidth and 921 macroblocks per second, and the constituent potential encodings have 922 limits on the bandwidth, macroblocks per second, video frame rate, 923 and resolution that can be provided. When choosing the media 924 captures to be received from a provider, a consumer device must 925 ensure that the encoding characteristics requested for each 926 individual media capture fits within the capability of the encoding 927 it is being configured to use, as well as ensuring that the combined 928 encoding characteristics for media captures fit within the 929 capabilities of their associated encoding groups. In some cases, 930 this could cause an otherwise "preferred" choice of streams to be 931 passed over in favour of different streams - for instance, if a set 932 of 3 media captures could only be provided at a low resolution then a 933 3 screen device could switch to favoring a single, higher quality, 934 stream. 936 9.4. Message Flow 938 The following diagram shows the basic flow of messages between a 939 media provider and a media consumer. The usage of the "capture 940 advertisement" and "configure encodings" message is described above. 942 The consumer also sends its own capability message to the provider 943 which may contain information about its own capabilities or 944 restrictions. 946 Diagram for Message Flow 948 Media Consumer Media Provider 949 -------------- ------------ 950 | | 951 |----- Consumer Capability ---------->| 952 | | 953 | | 954 |<---- Capture advertisement ---------| 955 | | 956 | | 957 |------ Configure encodings --------->| 958 | | 960 In order for a maximally-capable provider to be able to advertise a 961 manageable number of video captures to a consumer, there is a 962 potential use for the consumer, at the start of CLUE, to be able to 963 inform the provider of its capabilities. One example here would be 964 the video capture attribute set - a consumer could tell the provider 965 the complete set of video capture attributes it is able to understand 966 and so the provider would be able to reduce the capture scene it 967 advertises to be tailored to the consumer. 969 TBD - the content of the consumer capability message needs to be 970 better defined. The authors believe there is a need for this 971 message, but have not worked out the details yet. 973 10. Extensibility 975 One of the most important characteristics of the Framework is its 976 extensibility. Telepresence is a relatively new industry and while 977 we can foresee certain directions, we also do not know everything 978 about how it will develop. The standard for interoperability and 979 handling multiple streams must be future-proof. 981 The framework itself is inherently extensible through expanding the 982 data model types. For example: 984 o Adding more types of media, such as telemetry, can done by 985 defining additional types of captures in addition to audio and 986 video. 988 o Adding new functionalities , such as 3-D, say, will require 989 additional attributes describing the captures. 991 o Adding a new codecs, such as H.265, can be accomplished by 992 defining new encoding variables. 994 The infrastructure is designed to be extended rather than requiring 995 new infrastructure elements. Extension comes through adding to 996 defined types. 998 Assuming the implementation is in something like XML, adding data 999 elements and attributes makes extensibility easy. 1001 11. Examples - Using the Framework 1003 This section shows some examples in more detail how to use the 1004 framework to represent a typical case for telepresence rooms. First 1005 an endpoint is illustrated, then an MCU case is shown. 1007 11.1. Three screen endpoint media provider 1009 Consider an endpoint with the following description: 1011 o 3 cameras, 3 displays, a 6 person table 1013 o Each video device can provide one capture for each 1/3 section of 1014 the table 1016 o A single capture representing the active speaker can be provided 1018 o A single capture representing the active speaker with the other 2 1019 captures shown picture in picture within the stream can be 1020 provided 1022 o A capture showing a zoomed out view of all 6 seats in the room can 1023 be provided 1025 The audio and video captures for this endpoint can be described as 1026 follows. 1028 Video Captures: 1030 o VC0- (the camera-left camera stream), encoding group=EG0, 1031 content=main, switched=false 1033 o VC1- (the center camera stream), encoding group=EG1, content=main, 1034 switched=false 1036 o VC2- (the camera-right camera stream), encoding group=EG2, 1037 content=main, switched=false 1039 o VC3- (the loudest panel stream), encoding group=EG1, content=main, 1040 switched=true 1042 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1043 content=main, composed=true, switched=true 1045 o VC5- (the zoomed out view of all people in the room), encoding 1046 group=EG1, content=main, composed=false, switched=false 1048 o VC6- (presentation stream), encoding group=EG1, content=slides, 1049 switched=false 1051 The following diagram is a top view of the room with 3 cameras, 3 1052 displays, and 6 seats. Each camera is capturing 2 people. The six 1053 seats are not all in a straight line. 1055 ,-. d 1056 ( )`--.__ +---+ 1057 `-' / `--.__ | | 1058 ,-. | `-.._ |_-+Camera 2 (VC2) 1059 ( ).' ___..-+-''`+-+ 1060 `-' |_...---'' | | 1061 ,-.c+-..__ +---+ 1062 ( )| ``--..__ | | 1063 `-' | ``+-..|_-+Camera 1 (VC1) 1064 ,-. | __..--'|+-+ 1065 ( )| __..--' | | 1066 `-'b|..--' +---+ 1067 ,-. |``---..___ | | 1068 ( )\ ```--..._|_-+Camera 0 (VC0) 1069 `-' \ _..-''`-+ 1070 ,-. \ __.--'' | | 1071 ( ) |..-'' +---+ 1072 `-' a 1074 The two points labeled b and c are intended to be at the midpoint 1075 between the seating positions, and where the fields of view of the 1076 cameras intersect. 1077 The plane of interest for VC0 is a vertical plane that intersects 1078 points 'a' and 'b'. 1079 The plane of interest for VC1 intersects points 'b' and 'c'. 1080 The plane of interest for VC2 intersects points 'c' and 'd'. 1081 This example uses an area scale of millimeters. 1083 Areas of capture: 1084 bottom left bottom right top left top right 1085 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1086 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1087 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1088 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1089 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1090 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1091 VC6 none 1093 Points of capture: 1094 VC0 (-1678,0,800) 1095 VC1 (0,0,800) 1096 VC2 (1678,0,800) 1097 VC3 none 1098 VC4 none 1099 VC5 (0,0,800) 1100 VC6 none 1102 In this example, the right edge of the VC0 area lines up with the 1103 left edge of the VC1 area. It doesn't have to be this way. There 1104 could be a gap or an overlap. One additional thing to note for this 1105 example is the distance from a to b is equal to the distance from b 1106 to c and the distance from c to d. All these distances are 1346 mm. 1107 This is the planar width of each area of capture for VC0, VC1, and 1108 VC2. 1110 Note the text in parentheses (e.g. "the camera-left camera stream") 1111 is not explicitly part of the model, it is just explanatory text for 1112 this example, and is not included in the model with the media 1113 captures and attributes. Also, the "composed" boolean attribute 1114 doesn't say anything about how a capture is composed, so the media 1115 consumer can't tell based on this attribute that VC4 is composed of a 1116 "loudest panel with PiPs". 1118 Audio Captures: 1120 o AC0 (camera-left), encoding group=EG3, content=main, channel 1121 format=mono 1123 o AC1 (camera-right), encoding group=EG3, content=main, channel 1124 format=mono 1126 o AC2 (center) encoding group=EG3, content=main, channel format=mono 1128 o AC3 being a simple pre-mixed audio stream from the room (mono), 1129 encoding group=EG3, content=main, channel format=mono 1131 o AC4 audio stream associated with the presentation video (mono) 1132 encoding group=EG3, content=slides, channel format=mono 1134 Areas of capture: 1135 bottom left bottom right top left top right 1136 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1137 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1138 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1139 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1140 AC4 none 1142 The physical simultaneity information is: 1144 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1146 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1148 This constraint indicates it is not possible to use all the VCs at 1149 the same time. VC5 can not be used at the same time as VC1 or VC3 or 1150 VC4. Also, using every member in the set simultaneously may not make 1151 sense - for example VC3(loudest) and VC4 (loudest with PIP). (In 1152 addition, there are encoding constraints that make choosing all of 1153 the VCs in a set impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and 1154 EG1 has only 3 ENCs. This constraint shows up in the encoding 1155 groups, not in the simultaneous transmission sets.) 1157 In this example there are no restrictions on which audio captures can 1158 be sent simultaneously. 1160 Encoding Groups: 1162 This example has three encoding groups associated with the video 1163 captures. Each group can have 3 encodings, but with each potential 1164 encoding having a progressively lower specification. In this 1165 example, 1080p60 transmission is possible (as ENC0 has a maxMbps 1166 value compatible with that) as long as it is the only active encoding 1167 in the group(as maxMbps for the entire encoding group is also 1168 489600). Significantly, as up to 3 encodings are available per 1169 group, it is possible to transmit some video captures simultaneously 1170 that are not in the same entry in the capture scene. For example VC1 1171 and VC3 at the same time. 1173 It is also possible to transmit multiple encodings of a single video 1174 capture. For example VC0 can be encoded using ENC0 and ENC1 at the 1175 same time, as long as the encoding parameters satisfy the constraints 1176 of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30. 1178 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1179 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1180 maxH264Mbps=489600, maxBandwidth=4000000 1181 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1182 maxH264Mbps=108000, maxBandwidth=4000000 1183 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1184 maxH264Mbps=61200, maxBandwidth=4000000 1186 encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1187 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1188 maxH264Mbps=489600, maxBandwidth=4000000 1189 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1190 maxH264Mbps=108000, maxBandwidth=4000000 1191 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1192 maxH264Mbps=61200, maxBandwidth=4000000 1194 encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1195 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1196 maxH264Mbps=489600, maxBandwidth=4000000 1197 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1198 maxH264Mbps=108000, maxBandwidth=4000000 1199 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1200 maxH264Mbps=61200, maxBandwidth=4000000 1202 Figure 2: Example Encoding Groups for Video 1204 For audio, there are five potential encodings available, so all five 1205 audio captures can be encoded at the same time. 1207 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1208 encodeID=ENC9, maxBandwidth=64000 1209 encodeID=ENC10, maxBandwidth=64000 1210 encodeID=ENC11, maxBandwidth=64000 1211 encodeID=ENC12, maxBandwidth=64000 1212 encodeID=ENC13, maxBandwidth=64000 1214 Figure 3: Example Encoding Group for Audio 1216 Capture Scenes: 1218 The following table represents the capture scenes for this provider. 1219 Recall that a capture scene is composed of alternative capture scene 1220 entries covering the same scene. Capture Scene #1 is for the main 1221 people captures, and Capture Scene #2 is for presentation. 1223 Each row in the table is a separate entry in the capture scene 1225 +------------------+ 1226 | Capture Scene #1 | 1227 +------------------+ 1228 | VC0, VC1, VC2 | 1229 | VC3 | 1230 | VC4 | 1231 | VC5 | 1232 | AC0, AC1, AC2 | 1233 | AC3 | 1234 +------------------+ 1236 +------------------+ 1237 | Capture Scene #2 | 1238 +------------------+ 1239 | VC6 | 1240 | AC4 | 1241 +------------------+ 1243 Different capture scenes are unique to each other, non-overlapping. 1244 A consumer can choose an entry from each capture scene. In this case 1245 the three captures VC0, VC1, and VC2 are one way of representing the 1246 video from the endpoint. These three captures should appear adjacent 1247 next to each other. Alternatively, another way of representing the 1248 Capture Scene is with the capture VC3, which automatically shows the 1249 person who is talking. Similarly for the VC4 and VC5 alternatives. 1251 As in the video case, the different entries of audio in Capture Scene 1252 #1 represent the "same thing", in that one way to receive the audio 1253 is with the 3 audio captures (AC0, AC1, AC2), and another way is with 1254 the mixed AC3. The Media Consumer can choose an audio capture entry 1255 it is capable of receiving. 1257 The spatial ordering is understood by the media capture attributes 1258 area and point of capture. 1260 A Media Consumer would likely want to choose a capture scene entry to 1261 receive based in part on how many streams it can simultaneously 1262 receive. A consumer that can receive three people streams would 1263 probably prefer to receive the first entry of Capture Scene #1 (VC0, 1264 VC1, VC2) and not receive the other entries. A consumer that can 1265 receive only one people stream would probably choose one of the other 1266 entries. 1268 If the consumer can receive a presentation stream too, it would also 1269 choose to receive the only entry from Capture Scene #2 (VC6). 1271 11.2. Encoding Group Example 1273 This is an example of an encoding group to illustrate how it can 1274 express dependencies between encodings. 1276 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1277 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1278 maxH264Mbps=244800, maxBandwidth=4000000 1279 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1280 maxH264Mbps=244800, maxBandwidth=4000000 1281 encodeID=AUDENC0, maxBandwidth=96000 1282 encodeID=AUDENC1, maxBandwidth=96000 1283 encodeID=AUDENC2, maxBandwidth=96000 1285 Here, the encoding group is EG0. It can transmit up to two 1080p30 1286 encodings (Mbps for 1080p = 244800), but it is capable of 1287 transmitting a maxFrameRate of 60 frames per second (fps). To 1288 achieve the maximum resolution (1920 x 1088) the frame rate is 1289 limited to 30 fps. However 60 fps can be achieved at a lower 1290 resolution if required by the consumer. Although the encoding group 1291 is capable of transmitting up to 6Mbit/s, no individual video 1292 encoding can exceed 4Mbit/s. 1294 This encoding group also allows up to 3 audio encodings, AUDENC<0-2>. 1295 It is not required that audio and video encodings reside within the 1296 same encoding group, but if so then the group's overall maxBandwidth 1297 value is a limit on the sum of all audio and video encodings 1298 configured by the consumer. A system that does not wish or need to 1299 combine bandwidth limitations in this way should instead use separate 1300 encoding groups for audio and video in order for the bandwidth 1301 limitations on audio and video to not interact. 1303 Audio and video can be expressed in separate encoding groups, as in 1304 this illustration. 1306 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1307 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1308 maxH264Mbps=244800, maxBandwidth=4000000 1309 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1310 maxH264Mbps=244800, maxBandwidth=4000000 1312 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1313 encodeID=AUDENC0, maxBandwidth=96000 1314 encodeID=AUDENC1, maxBandwidth=96000 1315 encodeID=AUDENC2, maxBandwidth=96000 1317 11.3. The MCU Case 1319 This section shows how an MCU might express its Capture Scenes, 1320 intending to offer different choices for consumers that can handle 1321 different numbers of streams. A single audio capture stream is 1322 provided for all single and multi-screen configurations that can be 1323 associated (e.g. lip-synced) with any combination of video captures 1324 at the consumer. 1326 +--------------------+---------------------------------------------+ 1327 | Capture Scene #1 | note | 1328 +--------------------+---------------------------------------------+ 1329 | VC0 | video capture for single screen consumer | 1330 | VC1, VC2 | video capture for 2 screen consumer | 1331 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1332 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1333 | AC0 | audio capture representing all participants | 1334 +--------------------+---------------------------------------------+ 1336 If / when a presentation stream becomes active within the conference, 1337 the MCU might re-advertise the available media as: 1339 +------------------+--------------------------------------+ 1340 | Capture Scene #2 | note | 1341 +------------------+--------------------------------------+ 1342 | VC10 | video capture for presentation | 1343 | AC1 | presentation audio to accompany VC10 | 1344 +------------------+--------------------------------------+ 1346 11.4. Media Consumer Behavior 1348 This section gives an example of how a media consumer might behave 1349 when deciding how to request streams from the three screen endpoint 1350 described in the previous section. 1352 The receive side of a call needs to balance its requirements, based 1353 on number of screens and speakers, its decoding capabilities and 1354 available bandwidth, and the provider's capabilities in order to 1355 optimally configure the provider's streams. Typically it would want 1356 to receive and decode media from each capture scene advertised by the 1357 provider. 1359 A sane, basic, algorithm might be for the consumer to go through each 1360 capture scene in turn and find the collection of video captures that 1361 best matches the number of screens it has (this might include 1362 consideration of screens dedicated to presentation video display 1363 rather than "people" video) and then decide between alternative 1364 entries in the video capture scenes based either on hard-coded 1365 preferences or user choice. Once this choice has been made, the 1366 consumer would then decide how to configure the provider's encoding 1367 groups in order to make best use of the available network bandwidth 1368 and its own decoding capabilities. 1370 11.4.1. One screen consumer 1372 VC3, VC4 and VC5 are all different entries by themselves, not grouped 1373 together in a single entry, so the receiving device should choose 1374 between one of those. The choice would come down to whether to see 1375 the greatest number of participants simultaneously at roughly equal 1376 precedence (VC5), a switched view of just the loudest region (VC3) or 1377 a switched view with PiPs (VC4). An endpoint device with a small 1378 amount of knowledge of these differences could offer a dynamic choice 1379 of these options, in-call, to the user. 1381 11.4.2. Two screen consumer configuring the example 1383 Mixing systems with an even number of screens, "2n", and those with 1384 "2n+1" cameras (and vice versa) is always likely to be the 1385 problematic case. In this instance, the behavior is likely to be 1386 determined by whether a "2 screen" system is really a "2 decoder" 1387 system, i.e., whether only one received stream can be displayed per 1388 screen or whether more than 2 streams can be received and spread 1389 across the available screen area. To enumerate 3 possible behaviors 1390 here for the 2 screen system when it learns that the far end is 1391 "ideally" expressed via 3 capture streams: 1393 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1394 per the 1 screen consumer case above) and either leave one screen 1395 blank or use it for presentation if / when a presentation becomes 1396 active 1398 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1399 (either with each capture being scaled to 2/3 of a screen and the 1400 centre capture being split across 2 screens) or, as would be 1401 necessary if there were large bezels on the screens, with each 1402 stream being scaled to 1/2 the screen width and height and there 1403 being a 4th "blank" panel. This 4th panel could potentially be 1404 used for any presentation that became active during the call. 1406 3. Receive 3 streams, decode all 3, and use control information 1407 indicating which was the most active to switch between showing 1408 the left and centre streams (one per screen) and the centre and 1409 right streams. 1411 For an endpoint capable of all 3 methods of working described above, 1412 again it might be appropriate to offer the user the choice of display 1413 mode. 1415 11.4.3. Three screen consumer configuring the example 1417 This is the most straightforward case - the consumer would look to 1418 identify a set of streams to receive that best matched its available 1419 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1420 spatial ordering would give sufficient information for the correct 1421 video capture to be shown on the correct screen, and the consumer 1422 would either need to divide a single encoding group's capability by 3 1423 to determine what resolution and frame rate to configure the provider 1424 with or to configure the individual video captures' encoding groups 1425 with what makes most sense (taking into account the receive side 1426 decode capabilities, overall call bandwidth, the resolution of the 1427 screens plus any user preferences such as motion vs sharpness). 1429 12. Acknowledgements 1431 Mark Gorzyinski contributed much to the approach. We want to thank 1432 Stephen Botzko for helpful discussions on audio. 1434 13. IANA Considerations 1436 TBD 1438 14. Security Considerations 1440 TBD 1442 15. Changes Since Last Version 1444 NOTE TO THE RFC-Editor: Please remove this section prior to 1445 publication as an RFC. 1447 Changes from 04 to 05: 1449 1. Clarify limitations of "composed" attribute. 1451 2. Add new section "capture scene entry attributes" and add the 1452 attribute "scene-switch-policy". 1454 3. Add capture scene description attribute and description language 1455 attribute. 1457 4. Editorial changes to examples section for consistency with the 1458 rest of the document. 1460 Changes from 03 to 04: 1462 1. Remove sentence from overview - "This constitutes a significant 1463 change ..." 1465 2. Clarify a consumer can choose a subset of captures from a 1466 capture scene entry or a simultaneous set (in section "capture 1467 scene" and "consumer's choice..."). 1469 3. Reword first paragraph of Media Capture Attributes section. 1471 4. Clarify a stereo audio capture is different from two mono audio 1472 captures (description of audio channel format attribute). 1474 5. Clarify what it means when coordinate information is not 1475 specified for area of capture, point of capture, area of scene. 1477 6. Change the term "producer" to "provider" to be consistent (it 1478 was just in two places). 1480 7. Change name of "purpose" attribute to "content" and refer to 1481 RFC4796 for values. 1483 8. Clarify simultaneous sets are part of a provider advertisement, 1484 and apply across all capture scenes in the advertisement. 1486 9. Remove sentence about lip-sync between all media captures in a 1487 capture scene. 1489 10. Combine the concepts of "capture scene" and "capture set" into a 1490 single concept, using the term "capture scene" to replace the 1491 previous term "capture set", and eliminating the original 1492 separate capture scene concept. 1494 16. Informative References 1496 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1497 Requirement Levels", BCP 14, RFC 2119, March 1997. 1499 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1500 A., Peterson, J., Sparks, R., Handley, M., and E. 1501 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1502 June 2002. 1504 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1505 Jacobson, "RTP: A Transport Protocol for Real-Time 1506 Applications", STD 64, RFC 3550, July 2003. 1508 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1509 Session Initiation Protocol (SIP)", RFC 4353, 1510 February 2006. 1512 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description 1513 Protocol (SDP) Content Attribute", RFC 4796, 1514 February 2007. 1516 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1517 January 2008. 1519 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1520 Languages", BCP 47, RFC 5646, September 2009. 1522 [IANA-Lan] 1523 IANA, "Language Subtag Registry", 1524 . 1527 Appendix A. Open Issues 1529 A.1. Video layout arrangements and centralized composition 1531 In the context of a conference with a central MCU, there has been 1532 discussion about a consumer requesting the provider to provide a 1533 certain type of layout arrangement or perform a certain composition 1534 algorithm, such as combining some number of most recent talkers, or 1535 producing a video layout using a 2x2 grid or 1 large cell with 5 1536 smaller cells around it. The current framework does not address 1537 this. It isn't clear if this topic should be included in this 1538 framework, or maybe a different part of CLUE, or maybe outside of 1539 CLUE altogether. 1541 A.2. Source is selectable 1543 A Boolean variable. True indicates the media consumer can request a 1544 particular media source be mapped to a media capture. Default is 1545 false. 1547 TBD - how does the consumer make the request for a particular source? 1548 How does the consumer know what is available? Need to explain better 1549 how multiple media captures are different from a single media capture 1550 with choices for the source, and when each concept should be used. 1552 A.3. Media Source Selection 1554 The use cases include a case where the person at a receiving endpoint 1555 can request to receive media from a particular other endpoint, for 1556 example in a multipoint call to request to receive the video from a 1557 certain section of a certain room, whether or not people there are 1558 talking. 1560 TBD - this framework should address this case. Maybe need a roster 1561 list of rooms or people in the conference, with a mechanism to select 1562 from the roster and associate it with media captures. This is 1563 different from selecting a particular media capture from a capture 1564 scene. The mechanism to do this will probably need to be different 1565 than selecting media captures based on capture scenes and attributes. 1567 A.4. Endpoint requesting many streams from MCU 1569 TBD - how to do VC selection for a system where the endpoint media 1570 consumers want to receive lots of streams and do their own 1571 composition, rather than MCU doing transcoding and composing. 1572 Example is 3 screen consumer that wants 3 large loudest speaker 1573 streams, and a bunch of small ones to render as PiP. How the small 1574 ones are chosen, which could potentially be chosen by either the 1575 endpoint or MCU. There are other more complicated examples also. Is 1576 the current framework adequate to support this? 1578 A.5. VAD (voice activity detection) tagging of audio streams 1580 TBD - do we want to have VAD be mandatory? All audio streams 1581 originating from a media provider must be tagged with VAD 1582 information. This tagging would include an overall energy value for 1583 the stream plus information on which sections of the capture scene 1584 are "active". 1586 Each audio stream which forms a constituent of an entry within a 1587 capture scene should include this tagging, and the energy value 1588 within it calculated using a fixed, consistent algorithm. 1590 When a system determines the most active area of a capture scene 1591 (either "loudest", or determined by other means such as a button 1592 press) it should convey that information to the corresponding media 1593 stream consumer via any audio streams being sent within that capture 1594 scene. Specifically, there should be a list of active coordinates 1595 and their VAD characteristics within the audio stream in addition to 1596 the overall VAD information for the capture scene. This is to ensure 1597 all media stream consumers receive the same, consistent, audio energy 1598 information whichever audio capture or captures they choose to 1599 receive for a capture scene. Additionally, coordinate information 1600 can be mapped to video captures by a media stream consumer in order 1601 that it can perform "panel switching" if required. 1603 A.6. Private Information 1605 Do we want a way to include private information? 1607 Authors' Addresses 1609 Allyn Romanow 1610 Cisco Systems 1611 San Jose, CA 95134 1612 USA 1614 Email: allyn@cisco.com 1616 Mark Duckworth (editor) 1617 Polycom 1618 Andover, MA 01810 1619 US 1621 Email: mark.duckworth@polycom.com 1623 Andrew Pepperell 1624 Langley, England 1625 UK 1627 Email: apeppere@gmail.com 1629 Brian Baldino 1630 Cisco Systems 1631 San Jose, CA 95134 1632 US 1634 Email: bbaldino@cisco.com