idnits 2.17.1 draft-ietf-clue-framework-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1112 has weird spacing: '...om left bot...' == Line 1163 has weird spacing: '...om left bot...' -- The document date (October 22, 2012) is 4205 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: April 25, 2013 Polycom 6 A. Pepperell 7 Silverflare 8 B. Baldino 9 Cisco Systems 10 October 22, 2012 12 Framework for Telepresence Multi-Streams 13 draft-ietf-clue-framework-07.txt 15 Abstract 17 This memo offers a framework for a protocol that enables devices in a 18 telepresence conference to interoperate by specifying the 19 relationships between multiple media streams. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on April 25, 2013. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 4. Overview of the Framework/Model . . . . . . . . . . . . . . . 6 59 5. Spatial Relationships . . . . . . . . . . . . . . . . . . . . 7 60 6. Media Captures and Capture Scenes . . . . . . . . . . . . . . 8 61 6.1. Media Captures . . . . . . . . . . . . . . . . . . . . . . 9 62 6.1.1. Media Capture Attributes . . . . . . . . . . . . . . . 9 63 6.2. Capture Scene . . . . . . . . . . . . . . . . . . . . . . 12 64 6.2.1. Capture scene attributes . . . . . . . . . . . . . . . 13 65 6.2.2. Capture scene entry attributes . . . . . . . . . . . . 14 66 6.3. Simultaneous Transmission Set Constraints . . . . . . . . 15 67 7. Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 16 68 7.1. Individual Encodings . . . . . . . . . . . . . . . . . . . 16 69 7.2. Encoding Group . . . . . . . . . . . . . . . . . . . . . . 17 70 8. Associating Media Captures with Encoding Groups . . . . . . . 19 71 9. Consumer's Choice of Streams to Receive from the Provider . . 19 72 9.1. Local preference . . . . . . . . . . . . . . . . . . . . . 20 73 9.2. Physical simultaneity restrictions . . . . . . . . . . . . 20 74 9.3. Encoding and encoding group limits . . . . . . . . . . . . 21 75 9.4. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21 76 10. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 22 77 11. Examples - Using the Framework . . . . . . . . . . . . . . . . 22 78 11.1. Three screen endpoint media provider . . . . . . . . . . . 23 79 11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29 80 11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30 81 11.4. Media Consumer Behavior . . . . . . . . . . . . . . . . . 30 82 11.4.1. One screen consumer . . . . . . . . . . . . . . . . . 31 83 11.4.2. Two screen consumer configuring the example . . . . . 31 84 11.4.3. Three screen consumer configuring the example . . . . 32 85 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32 86 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 87 14. Security Considerations . . . . . . . . . . . . . . . . . . . 32 88 15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32 89 16. Informative References . . . . . . . . . . . . . . . . . . . . 34 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 92 1. Introduction 94 Current telepresence systems, though based on open standards such as 95 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 96 other. A major factor limiting the interoperability of telepresence 97 systems is the lack of a standardized way to describe and negotiate 98 the use of the multiple streams of audio and video comprising the 99 media flows. This draft provides a framework for a protocol to 100 enable interoperability by handling multiple streams in a 101 standardized way. It is intended to support the use cases described 102 in draft-ietf-clue-telepresence-use-cases-02 and to meet the 103 requirements in draft-ietf-clue-telepresence-requirements-01. 105 The solution described here is strongly focused on what is being done 106 today, rather than on a vision of future conferencing. At the same 107 time, the highest priority has been given to creating an extensible 108 framework to make it easy to accommodate future conferencing 109 functionality as it evolves. 111 The purpose of this effort is to make it possible to handle multiple 112 streams of media in such a way that a satisfactory user experience is 113 possible even when participants are using different vendor equipment, 114 and also when they are using devices with different types of 115 communication capabilities. Information about the relationship of 116 media streams at the provider's end must be communicated so that 117 streams can be chosen and audio/video rendering can be done in the 118 best possible manner. 120 There is no attempt here to dictate to the renderer what it should 121 do. What the renderer does is up to the renderer. 123 After the following Definitions, a short section introduces key 124 concepts. The body of the text comprises several sections about the 125 key elements of the framework, how a consumer chooses streams to 126 receive, and some examples. The appendix describe topics that are 127 under discussion for adding to the document. 129 2. Terminology 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 133 document are to be interpreted as described in RFC 2119 [RFC2119]. 135 3. Definitions 137 The definitions marked with an "*" are new; all the others are from 138 *Audio Capture: Media Capture for audio. Denoted as ACn. 140 Camera-Left and Right: For media captures, camera-left and camera- 141 right are from the point of view of a person observing the rendered 142 media. They are the opposite of stage-left and stage-right. 144 Capture Device: A device that converts audio and video input into an 145 electrical signal, in most cases to be fed into a media encoder. 146 Cameras and microphones are examples for capture devices. 148 *Capture Encoding: A specific encoding of a media capture, to be sent 149 by a media provider to a media consumer via RTP. 151 *Capture Scene: a structure representing the scene that is captured 152 by a collection of capture devices. A capture scene includes 153 attributes and one or more capture scene entries, with each entry 154 including one or more media captures. 156 *Capture Scene Entry: a list of media captures of the same media type 157 that together form one way to represent the capture scene. 159 Conference: used as defined in [RFC4353], A Framework for 160 Conferencing within the Session Initiation Protocol (SIP). 162 *Individual Encoding: A variable with a set of attributes that 163 describes the maximum values of a single audio or video capture 164 encoding. The attributes include: maximum bandwidth- and for video 165 maximum macroblocks (for H.264), maximum width, maximum height, 166 maximum frame rate. 168 *Encoding Group: A set of encoding parameters representing a media 169 provider's encoding capabilities. Media stream providers formed of 170 multiple physical units, in each of which resides some encoding 171 capability, would typically advertise themselves to the remote media 172 stream consumer using multiple encoding groups. Within each encoding 173 group, multiple potential encodings are possible, with the sum of the 174 chosen encodings' characteristics constrained to being less than or 175 equal to the group-wide constraints. 177 Endpoint: The logical point of final termination through receiving, 178 decoding and rendering, and/or initiation through capturing, 179 encoding, and sending of media streams. An endpoint consists of one 180 or more physical devices which source and sink media streams, and 181 exactly one [RFC4353] Participant (which, in turn, includes exactly 182 one SIP User Agent). In contrast to an endpoint, an MCU may also 183 send and receive media streams, but it is not the initiator nor the 184 final terminator in the sense that Media is Captured or Rendered. 185 Endpoints can be anything from multiscreen/multicamera rooms to 186 handheld devices. 188 Front: the portion of the room closest to the cameras. In going 189 towards back you move away from the cameras. 191 MCU: Multipoint Control Unit (MCU) - a device that connects two or 192 more endpoints together into one single multimedia conference 193 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 194 tardy in requiring that media from the mixer be sent to EACH 195 participant. I think we have practical use cases where this is not 196 the case. But the bug (if it is one) is in 4353 and not herein.] 198 Media: Any data that, after suitable encoding, can be conveyed over 199 RTP, including audio, video or timed text. 201 *Media Capture: a source of Media, such as from one or more Capture 202 Devices. A Media Capture (MC) may be the source of one or more 203 capture encodings. A Media Capture may also be constructed from 204 other Media streams. A middle box can express Media Captures that it 205 constructs from Media streams it receives. 207 *Media Consumer: an Endpoint or middle box that receives media 208 streams 210 *Media Provider: an Endpoint or middle box that sends Media streams 212 Model: a set of assumptions a telepresence system of a given vendor 213 adheres to and expects the remote telepresence system(s) also to 214 adhere to. 216 *Plane of Interest: The spatial plane containing the most relevant 217 subject matter. 219 Render: the process of generating a representation from a media, such 220 as displayed motion video or sound emitted from loudspeakers. 222 *Simultaneous Transmission Set: a set of media captures that can be 223 transmitted simultaneously from a Media Provider. 225 Spatial Relation: The arrangement in space of two objects, in 226 contrast to relation in time or other relationships. See also 227 Camera-Left and Right. 229 Stage-Left and Right: For media captures, stage-left and stage-right 230 are the opposite of camera-left and camera-right. For the case of a 231 person facing (and captured by) a camera, stage-left and stage-right 232 are from the point of view of that person. 234 *Stream: a capture encoding sent from a media provider to a media 235 consumer via RTP [RFC3550]. 237 Stream Characteristics: the media stream attributes commonly used in 238 non-CLUE SIP/SDP environments (such as: media codec, bit rate, 239 resolution, profile/level etc.) as well as CLUE specific attributes, 240 such as the ID of a capture or a spatial location. 242 Telepresence: an environment that gives non co-located users or user 243 groups a feeling of (co-located) presence - the feeling that a Local 244 user is in the same room with other Local users and the Remote 245 parties. The inclusion of Remote parties is achieved through 246 multimedia communication including at least audio and video signals 247 of high fidelity. 249 *Video Capture: Media Capture for video. Denoted as VCn. 251 Video composite: A single image that is formed from combining visual 252 elements from separate sources. 254 4. Overview of the Framework/Model 256 The CLUE framework specifies how multiple media streams are to be 257 handled in a telepresence conference. 259 The main goals include: 261 o Interoperability 263 o Extensibility 265 o Flexibility 267 Interoperability is achieved by the media provider describing the 268 relationships between media streams in constructs that are understood 269 by the consumer, who can then render the media. Extensibility is 270 achieved through abstractions and the generality of the model, making 271 it easy to add new parameters. Flexibility is achieved largely by 272 having the consumer choose what content and format it wants to 273 receive from what the provider is capable of sending. 275 A transmitting endpoint or MCU describes specific aspects of the 276 content of the media and the formatting of the media streams it can 277 send (advertisement); and the receiving end responds to the provider 278 by specifying which content and media streams it wants to receive 279 (configuration). The provider then transmits the asked for content 280 in the specified streams. 282 This advertisement and configuration occurs at call initiation but 283 may also happen at any time throughout the conference, whenever there 284 is a change in what the consumer wants or the provider can send. 286 An endpoint or MCU typically acts as both provider and consumer at 287 the same time, sending advertisements and sending configurations in 288 response to receiving advertisements. (It is possible to be just one 289 or the other.) 291 The data model is based around two main concepts: a capture and an 292 encoding. A media capture (MC), such as audio or video, describes 293 the content a provider can send. Media captures are described in 294 terms of CLUE-defined attributes, such as spatial relationships and 295 purpose of the capture. Providers tell consumers which media 296 captures they can provide, described in terms of the media capture 297 attributes. 299 A provider organizes its media captures that represent the same scene 300 into capture scenes. A consumer chooses which media captures it 301 wants to receive according to the capture scenes sent by the 302 provider. 304 In addition, the provider sends the consumer a description of the 305 individual encodings it can send in terms of the media attributes of 306 the encodings, in particular, well-known audio and video parameters 307 such as bandwidth, frame rate, macroblocks per second. 309 The provider also specifies constraints on its ability to provide 310 media, and the consumer must take these into account in choosing the 311 content and capture encodings it wants. Some constraints are due to 312 the physical limitations of devices - for example, a camera may not 313 be able to provide zoom and non-zoom views simultaneously. Other 314 constraints are system based constraints, such as maximum bandwidth 315 and maximum macroblocks/second. 317 The following sections discuss these constructs and processes in 318 detail, followed by use cases showing how the framework specification 319 can be used. 321 5. Spatial Relationships 323 In order for a consumer to perform a proper rendering, it is often 324 necessary to provide spatial information about the streams it is 325 receiving. CLUE defines a coordinate system that allows media 326 providers to describe the spatial relationships of their media 327 captures to enable proper scaling and spatial rendering of their 328 streams. The coordinate system is based on a few principles: 330 o Simple systems which do not have multiple Media Captures to 331 associate spatially need not use the coordinate model. 333 o Coordinates can either be in real, physical units (millimeters), 334 have an unknown scale or have no physical scale. Systems which 335 know their physical dimensions should always provide those real- 336 world measurements. Systems which don't know specific physical 337 dimensions but still know relative distances should use 'unknown 338 scale'. 'No scale' is intended to be used where Media Captures 339 from different devices (with potentially different scales) will be 340 forwarded alongside one another (e.g. in the case of a middle 341 box). 343 * "millimeters" means the scale is in millimeters 345 * "Unknown" means the scale is not necessarily millimeters, but 346 the scale is the same for every capture in the capture scene. 348 * "No Scale" means the scale could be different for each capture 349 - an MCU provider that advertises two adjacent captures and 350 picks sources (which can change quickly) from different 351 endpoints might use this value; the scale could be different 352 and changing for each capture. But the areas of capture still 353 represent a spatial relation between captures. 355 o The coordinate system is Cartesian X, Y, Z with the origin at a 356 spot of the provider's choosing. The provider must use the same 357 coordinate system with same scale and origin for all coordinates 358 within the same capture scene. 360 The direction of increasing coordinate values is: 361 X increases from camera left to camera right 362 Y increases from front to back 363 Z increases from low to high 365 6. Media Captures and Capture Scenes 367 This section describes how media providers can describe the content 368 of media to consumers. 370 6.1. Media Captures 372 Media captures are the fundamental representations of streams that a 373 device can transmit. What a Media Capture actually represents is 374 flexible: 376 o It can represent the immediate output of a physical source (e.g. 377 camera, microphone) or 'synthetic' source (e.g. laptop computer, 378 DVD player). 380 o It can represent the output of an audio mixer or video composer 382 o It can represent a concept such as 'the loudest speaker' 384 o It can represent a conceptual position such as 'the leftmost 385 stream' 387 To distinguish between multiple instances, video and audio captures 388 are numbered such as: VC1, VC2 and AC1, AC2. VC1 and VC2 refer to 389 two different video captures and AC1 and AC2 refer to two different 390 audio captures. 392 Each Media Capture can be associated with attributes to describe what 393 it represents. 395 6.1.1. Media Capture Attributes 397 Media Capture Attributes describe static information about the 398 captures. A provider uses the media capture attributes to describe 399 the media captures to the consumer. The consumer will select the 400 captures it wants to receive. Attributes are defined by a variable 401 and its value. The currently defined attributes and their values 402 are: 404 Content: {slides, speaker, sl, main, alt} 406 A field with enumerated values which describes the role of the media 407 capture and can be applied to any media type. The enumerated values 408 are defined by [RFC4796]. The values for this attribute are the same 409 as the mediacnt values for the content attribute in [RFC4796]. This 410 attribute can have multiple values, for example content={main, 411 speaker}. 413 Composed: {true, false} 415 A field with a Boolean value which indicates whether or not the Media 416 Capture is a mix (audio) or composition (video) of streams. 418 This attribute is useful for a media consumer to avoid nesting a 419 composed video capture into another composed capture or rendering. 420 This attribute is not intended to describe the layout a media 421 provider uses when composing video streams. 423 Audio Channel Format: {mono, stereo} A field with enumerated values 424 which describes the method of encoding used for audio. 426 A value of 'mono' means the Audio Capture has one channel. 428 A value of 'stereo' means the Audio Capture has two audio channels, 429 left and right. 431 This attribute applies only to Audio Captures. A single stereo 432 capture is different from two mono captures that have a left-right 433 spatial relationship. A stereo capture maps to a single RTP stream, 434 while each mono audio capture maps to a separate RTP stream. 436 Switched: {true, false} 438 A field with a Boolean value which indicates whether or not the Media 439 Capture represents the (dynamic) most appropriate subset of a 440 'whole'. What is 'most appropriate' is up to the provider and could 441 be the active speaker, a lecturer or a VIP. 443 Point of Capture: {(X, Y, Z)} 445 A field with a single Cartesian (X, Y, Z) point value which describes 446 the spatial location, virtual or physical, of the capturing device 447 (such as camera). 449 When the Point of Capture attribute is specified, it must include X, 450 Y and Z coordinates. If the point of capture is not specified, it 451 means the consumer should not assume anything about the spatial 452 location of the capturing device. Even if the provider specifies an 453 area of capture attribute, it does not need to specify the point of 454 capture. 456 Point on Line of Capture: {(X,Y,Z)} 458 A field with a single Cartesian (X, Y, Z) point value (virtual or 459 physical) which describes a position in space of a second point on 460 the axis of the capturing device; the first point being the Point of 461 Capture (see above). This point MUST lie between the Point of 462 Capture and the Area of Capture. 464 The Point on Line of Capture MUST be ignored if the Point of Capture 465 is not present for this capture device. When the Point on Line of 466 Capture attribute is specified, it must include X, Y and Z 467 coordinates. These coordinates MUST NOT be identical to the Point of 468 Capture coordinates. If the Point on Line of Capture is not 469 specified, no assumptions are made about the axis of the capturing 470 device. 472 Area of Capture: 474 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3, 475 Z3), top right(X4, Y4, Z4)} 477 A field with a set of four (X, Y, Z) points as a value which describe 478 the spatial location of what is being "captured". By comparing the 479 Area of Capture for different Media Captures within the same capture 480 scene a consumer can determine the spatial relationships between them 481 and render them correctly. 483 The four points should be co-planar. The four points form a 484 quadrilateral, not necessarily a rectangle. 486 The quadrilateral described by the four (X, Y, Z) points defines the 487 plane of interest for the particular media capture. 489 If the area of capture attribute is specified, it must include X, Y 490 and Z coordinates for all four points. If the area of capture is not 491 specified, it means the media capture is not spatially related to any 492 other media capture (but this can change in a subsequent provider 493 advertisement). 495 For a switched capture that switches between different sections 496 within a larger area, the area of capture should use coordinates for 497 the larger potential area. 499 EncodingGroup: {} 501 A field with a value equal to the encodeGroupID of the encoding group 502 associated with the media capture. 504 Max Capture Encodings: {unsigned integer} 506 An optional attribute indicating the maximum number of capture 507 encodings that can be simultaneously active for the media capture. 508 If absent, this parameter defaults to 1. The minimum value for this 509 attribute is 1. The number of simultaneous capture encodings is also 510 limited by the restrictions of the encoding group for the media 511 capture. 513 6.2. Capture Scene 515 In order for a provider's individual media captures to be used 516 effectively by a consumer, the provider organizes the media captures 517 into capture scenes, with the structure and contents of these capture 518 scenes being sent from the provider to the consumer. 520 A capture scene is a structure representing the scene that is 521 captured by a collection of capture devices. A capture scene 522 includes one or more capture scene entries, with each entry including 523 one or more media captures. A capture scene represents, for example, 524 the video image of a group of people seated next to each other, along 525 with the sound of their voices, which could be represented by some 526 number of VCs and ACs in the capture scene entries. A middle box may 527 also express capture scenes that it constructs from media streams it 528 receives. 530 A provider may advertise multiple capture scenes or just a single 531 capture scene. A media provider might typically use one capture 532 scene for main participant media and another capture scene for a 533 computer generated presentation. A capture scene may include more 534 than one type of media. For example, a capture scene can include 535 several capture scene entries for video captures, and several capture 536 scene entries for audio captures. 538 A provider can express spatial relationships between media captures 539 that are included in the same capture scene. But there is no spatial 540 relationship between media captures that are in different capture 541 scenes. 543 A media provider arranges media captures in a capture scene to help 544 the media consumer choose which captures it wants. The capture scene 545 entries in a capture scene are different alternatives the provider is 546 suggesting for representing the capture scene. The media consumer 547 can choose to receive all media captures from one capture scene entry 548 for each media type (e.g. audio and video), or it can pick and choose 549 media captures regardless of how the provider arranges them in 550 capture scene entries. Different capture scene entries of the same 551 media type are not necessarily mutually exclusive alternatives. 553 Media captures within the same capture scene entry must be of the 554 same media type - it is not possible to mix audio and video captures 555 in the same capture scene entry, for instance. The provider must be 556 capable of encoding and sending all media captures in a single entry 557 simultaneously. A consumer may decide to receive all the media 558 captures in a single capture scene entry, but a consumer could also 559 decide to receive just a subset of those captures. A consumer can 560 also decide to receive media captures from different capture scene 561 entries. 563 When a provider advertises a capture scene with multiple entries, it 564 is essentially signaling that there are multiple representations of 565 the same scene available. In some cases, these multiple 566 representations would typically be used simultaneously (for instance 567 a "video entry" and an "audio entry"). In some cases the entries 568 would conceptually be alternatives (for instance an entry consisting 569 of 3 video captures versus an entry consisting of just a single video 570 capture). In this latter example, the provider would in the simple 571 case end up providing to the consumer the entry containing the number 572 of video captures that most closely matched the media consumer's 573 number of display devices. 575 The following is an example of 4 potential capture scene entries for 576 an endpoint-style media provider: 578 1. (VC0, VC1, VC2) - left, center and right camera video captures 580 2. (VC3) - video capture associated with loudest room segment 582 3. (VC4) - video capture zoomed out view of all people in the room 584 4. (AC0) - main audio 586 The first entry in this capture scene example is a list of video 587 captures with a spatial relationship to each other. Determination of 588 the order of these captures (VC0, VC1 and VC2) for rendering purposes 589 is accomplished through use of their Area of Capture attributes. The 590 second entry (VC3) and the third entry (VC4) are additional 591 alternatives of how to capture the same room in different ways. The 592 inclusion of the audio capture in the same capture scene indicates 593 that AC0 is associated with those video captures, meaning it comes 594 from the same scene. The audio should be rendered in conjunction 595 with any rendered video captures from the same capture scene. 597 6.2.1. Capture scene attributes 599 Attributes can be applied to capture scenes as well as to individual 600 media captures. Attributes specified at this level apply to all 601 constituent media captures. 603 Description attribute - list of {, } 605 The optional description attribute is a list of human readable text 606 strings which describe the capture scene. If there is more than one 607 string in the list, then each string in the list should contain the 608 same description, but in a different language. A provider that 609 advertises multiple capture scenes can provide descriptions for each 610 of them. This attribute can contain text in any number of languages. 612 The language tag identifies the language of the corresponding 613 description text. The possible values for a language tag are the 614 values of the 'Subtag' column for the "Type: language" entries in the 615 "Language Subtag Registry" at [IANA-Lan] originally defined in 616 [RFC5646]. A particular language tag value MUST NOT be used more 617 than once in the description attribute list. 619 Area of Scene attribute 621 The area of scene attribute for a capture scene has the same format 622 as the area of capture attribute for a media capture. The area of 623 scene is for the entire scene, which is captured by the one or more 624 media captures in the capture scene entries. If the provider does 625 not specify the area of scene, but does specify areas of capture, 626 then the consumer may assume the area of scene is greater than or 627 equal to the outer extents of the individual areas of capture. 629 Scale attribute 631 An optional attribute indicating if the numbers used for area of 632 scene, area of capture and point of capture are in terms of 633 millimeters, unknown scale factor, or not any scale, as described in 634 Section 5. If any media captures have an area of capture attribute 635 or point of capture attribute, then this scale attribute must also be 636 defined. The possible values for this attribute are: 638 "millimeters" 639 "unknown" 640 "no scale" 642 6.2.2. Capture scene entry attributes 644 Attributes can be applied to capture scene entries. Attributes 645 specified at this level apply to the capture scene entry as a whole. 647 Scene-switch-policy: {site-switch, segment-switch} 649 A media provider uses this scene-switch-policy attribute to indicate 650 its support for different switching policies. In the provider's 651 advertisement, this attribute can have multiple values, which means 652 the provider supports each of the indicated policies. The consumer, 653 when it requests media captures from this capture scene entry, should 654 also include this attribute but with only the single value (from 655 among the values indicated by the provider) indicating the consumer's 656 choice for which policy it wants the provider to use. If the 657 provider does not support any of these policies, it should omit this 658 attribute. 660 The "site-switch" policy means all captures are switched at the same 661 time to keep captures from the same endpoint site together. Let's 662 say the speaker is at site A and everyone else is at a "remote" site. 663 When the room at site A shown, all the camera images from site A are 664 forwarded to the remote sites. Therefore at each receiving remote 665 site, all the screens display camera images from site A. This can be 666 used to preserve full size image display, and also provide full 667 visual context of the displayed far end, site A. In site switching, 668 there is a fixed relation between the cameras in each room and the 669 displays in remote rooms. The room or participants being shown is 670 switched from time to time based on who is speaking or by manual 671 control. 673 The "segment-switch" policy means different captures can switch at 674 different times, and can be coming from different endpoints. Still 675 using site A as where the speaker is, and "remote" to refer to all 676 the other sites, in segment switching, rather than sending all the 677 images from site A, only the image containing the speaker at site A 678 is shown. The camera images of the current speaker and previous 679 speakers (if any) are forwarded to the other sites in the conference. 680 Therefore the screens in each site are usually displaying images from 681 different remote sites - the current speaker at site A and the 682 previous ones. This strategy can be used to preserve full size image 683 display, and also capture the non-verbal communication between the 684 speakers. In segment switching, the display depends on the activity 685 in the remote rooms - generally, but not necessarily based on audio / 686 speech detection. 688 6.3. Simultaneous Transmission Set Constraints 690 The provider may have constraints or limitations on its ability to 691 send media captures. One type is caused by the physical limitations 692 of capture mechanisms; these constraints are represented by a 693 simultaneous transmission set. The second type of limitation 694 reflects the encoding resources available - bandwidth and 695 macroblocks/second. This type of constraint is captured by encoding 696 groups, discussed below. 698 An endpoint or MCU can send multiple captures simultaneously, however 699 sometimes there are constraints that limit which captures can be sent 700 simultaneously with other captures. A device may not be able to be 701 used in different ways at the same time. Provider advertisements are 702 made so that the consumer will choose one of several possible 703 mutually exclusive usages of the device. This type of constraint is 704 expressed in a Simultaneous Transmission Set, which lists all the 705 media captures that can be sent at the same time. This is easier to 706 show in an example. 708 Consider the example of a room system where there are 3 cameras each 709 of which can send a separate capture covering 2 persons each- VC0, 710 VC1, VC2. The middle camera can also zoom out and show all 6 711 persons, VC3. But the middle camera cannot be used in both modes at 712 the same time - it has to either show the space where 2 participants 713 sit or the whole 6 seats, but not both at the same time. 715 Simultaneous transmission sets are expressed as sets of the MCs that 716 could physically be transmitted at the same time, (though it may not 717 make sense to do so). In this example the two simultaneous sets are 718 shown in Table 1. The consumer must make sure that it chooses one 719 and not more of the mutually exclusive sets. A consumer may choose 720 any subset of the media captures in a simultaneous set, it does not 721 have to choose all the captures in a simultaneous set if it does not 722 want to receive all of them. 724 +-------------------+ 725 | Simultaneous Sets | 726 +-------------------+ 727 | {VC0, VC1, VC2} | 728 | {VC0, VC3, VC2} | 729 +-------------------+ 731 Table 1: Two Simultaneous Transmission Sets 733 A media provider includes the simultaneous sets in its provider 734 advertisement. These simultaneous set constraints apply across all 735 the captures scenes in the advertisement. The simultaneous 736 transmission sets MUST allow all the media captures in a particular 737 capture scene entry to be used simultaneously. 739 7. Encodings 741 We have considered how providers can describe the content of media to 742 consumers. We will now consider how the providers communicate 743 information about their abilities to send streams. We introduce two 744 constructs - individual encodings and encoding groups. Consumers 745 will then map the media captures they want onto the encodings with 746 encoding parameters they want. This process is then described. 748 7.1. Individual Encodings 750 An individual encoding represents a way to encode a media capture to 751 become a capture encoding, to be sent as an encoded media stream from 752 the media provider to the media consumer. An individual encoding has 753 a set of parameters characterizing how the media is encoded. 754 Different media types have different parameters, and different 755 encoding algorithms may have different parameters. An individual 756 encoding can be assigned to only one capture encoding at a time. 758 The parameters of an individual encoding represent the maximum values 759 for certain aspects of the encoding. A particular instantiation into 760 a capture encoding might use lower values than these maximums. 762 The following tables show the variables for audio and video encoding. 764 +--------------+----------------------------------------------------+ 765 | Name | Description | 766 +--------------+----------------------------------------------------+ 767 | encodeID | A unique identifier for the individual encoding | 768 | maxBandwidth | Maximum number of bits per second | 769 | maxH264Mbps | Maximum number of macroblocks per second: ((width | 770 | | + 15) / 16) * ((height + 15) / 16) * | 771 | | framesPerSecond | 772 | maxWidth | Video resolution's maximum supported width, | 773 | | expressed in pixels | 774 | maxHeight | Video resolution's maximum supported height, | 775 | | expressed in pixels | 776 | maxFrameRate | Maximum supported frame rate | 777 +--------------+----------------------------------------------------+ 779 Table 2: Individual Video Encoding Parameters 781 +--------------+-----------------------------------+ 782 | Name | Description | 783 +--------------+-----------------------------------+ 784 | maxBandwidth | Maximum number of bits per second | 785 +--------------+-----------------------------------+ 787 Table 3: Individual Audio Encoding Parameters 789 7.2. Encoding Group 791 An encoding group includes a set of one or more individual encodings, 792 plus some parameters that apply to the group as a whole. By grouping 793 multiple individual encodings together, an encoding group describes 794 additional constraints on bandwidth and other parameters for the 795 group. Table 4 shows the parameters and individual encoding sets 796 that are part of an encoding group. 798 +-------------------+-----------------------------------------------+ 799 | Name | Description | 800 +-------------------+-----------------------------------------------+ 801 | encodeGroupID | A unique identifier for the encoding group | 802 | maxGroupBandwidth | Maximum number of bits per second relating to | 803 | | all encodings combined | 804 | maxGroupH264Mbps | Maximum number of macroblocks per second | 805 | | relating to all video encodings combined | 806 | videoEncodings[] | Set of potential encodings (list of | 807 | | encodeIDs) | 808 | audioEncodings[] | Set of potential encodings (list of | 809 | | encodeIDs) | 810 +-------------------+-----------------------------------------------+ 812 Table 4: Encoding Group 814 When the individual encodings in a group are instantiated into 815 capture encodings, each capture encoding has a bandwidth that must be 816 less than or equal to the maxBandwidth for the particular individual 817 encoding. The maxGroupBandwidth parameter gives the additional 818 restriction that the sum of all the individual capture encoding 819 bandwidths must be less than or equal to the maxGroupBandwidth value. 821 Likewise, the sum of the macroblocks per second of each instantiated 822 encoding in the group must not exceed the maxGroupH264Mbps value. 824 The following diagram illustrates the structure of a media provider's 825 Encoding Groups and their contents. 827 ,-------------------------------------------------. 828 | Media Provider | 829 | | 830 | ,--------------------------------------. | 831 | | ,--------------------------------------. | 832 | | | ,--------------------------------------. | 833 | | | | Encoding Group | | 834 | | | | ,-----------. | | 835 | | | | | | ,---------. | | 836 | | | | | | | | ,---------.| | 837 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 838 | `.| | | | | | `---------'| | 839 | `.| `-----------' `---------' | | 840 | `--------------------------------------' | 841 `-------------------------------------------------' 843 Figure 1: Encoding Group Structure 845 A media provider advertises one or more encoding groups. Each 846 encoding group includes one or more individual encodings. Each 847 individual encoding can represent a different way of encoding media. 848 For example one individual encoding may be 1080p60 video, another 849 could be 720p30, with a third being CIF. 851 While a typical 3 codec/display system might have one encoding group 852 per "codec box", there are many possibilities for the number of 853 encoding groups a provider may be able to offer and for the encoding 854 values in each encoding group. 856 There is no requirement for all encodings within an encoding group to 857 be instantiated at once. 859 8. Associating Media Captures with Encoding Groups 861 Every media capture is associated with an encoding group, which is 862 used to instantiate that media capture into one or more capture 863 encodings. Each media capture has an encoding group attribute. The 864 value of this attribute is the encodeGroupID for the encoding group 865 with which it is associated. More than one media capture may use the 866 same encoding group. 868 The maximum number of streams that can result from a particular 869 encoding group constraint is equal to the number of individual 870 encodings in the group. The actual number of capture encodings used 871 at any time may be less than this maximum. Any of the media captures 872 that use a particular encoding group can be encoded according to any 873 of the individual encodings in the group. If there are multiple 874 individual encodings in the group, then the media consumer can 875 configure the media provider to encode a single media capture into 876 multiple different capture encodings at the same time, subject to the 877 Max Capture Encodings constraint, with each capture encoding 878 following the constraints of a different individual encoding. 880 The Encoding Groups MUST allow all the media captures in a particular 881 capture scene entry to be used simultaneously. 883 9. Consumer's Choice of Streams to Receive from the Provider 885 After receiving the provider's advertised media captures and 886 associated constraints, the consumer must choose which media captures 887 it wishes to receive, and which individual encodings from the 888 provider it wants to use to encode the captures. Each media capture 889 has an encoding group ID attribute which specifies which individual 890 encodings are available to be used for that media capture. 892 For each media capture the consumer wants to receive, it configures 893 one or more of the encodings in that capture's encoding group. The 894 consumer does this by telling the provider the resolution, frame 895 rate, bandwidth, etc. when asking for capture encodings for its 896 chosen captures. Upon receipt of this configuration command from the 897 consumer, the provider generates a stream for each such configured 898 capture encoding and sends those streams to the consumer. 900 The consumer must have received at least one capture advertisement 901 from the provider to be able to configure the provider's generation 902 of media streams. 904 The consumer is able to change its configuration of the provider's 905 encodings any number of times during the call, either in response to 906 a new capture advertisement from the provider or autonomously. The 907 consumer need not send a new configure message to the provider when 908 it receives a new capture advertisement from the provider unless the 909 contents of the new capture advertisement cause the consumer's 910 current configure message to become invalid. 912 When choosing which streams to receive from the provider, and the 913 encoding characteristics of those streams, the consumer needs to take 914 several things into account: its local preference, simultaneity 915 restrictions, and encoding limits. 917 9.1. Local preference 919 A variety of local factors will influence the consumer's choice of 920 streams to be received from the provider: 922 o if the consumer is an endpoint, it is likely that it would choose, 923 where possible, to receive video and audio captures that match the 924 number of display devices and audio system it has 926 o if the consumer is a middle box such as an MCU, it may choose to 927 receive loudest speaker streams (in order to perform its own media 928 composition) and avoid pre-composed video captures 930 o user choice (for instance, selection of a new layout) may result 931 in a different set of media captures, or different encoding 932 characteristics, being required by the consumer 934 9.2. Physical simultaneity restrictions 936 There may be physical simultaneity constraints imposed by the 937 provider that affect the provider's ability to simultaneously send 938 all of the captures the consumer would wish to receive. For 939 instance, a middle box such as an MCU, when connected to a multi- 940 camera room system, might prefer to receive both individual camera 941 streams of the people present in the room and an overall view of the 942 room from a single camera. Some endpoint systems might be able to 943 provide both of these sets of streams simultaneously, whereas others 944 may not (if the overall room view were produced by changing the zoom 945 level on the center camera, for instance). 947 9.3. Encoding and encoding group limits 949 Each of the provider's encoding groups has limits on bandwidth and 950 macroblocks per second, and the constituent potential encodings have 951 limits on the bandwidth, macroblocks per second, video frame rate, 952 and resolution that can be provided. When choosing the media 953 captures to be received from a provider, a consumer device must 954 ensure that the encoding characteristics requested for each 955 individual media capture fits within the capability of the encoding 956 it is being configured to use, as well as ensuring that the combined 957 encoding characteristics for media captures fit within the 958 capabilities of their associated encoding groups. In some cases, 959 this could cause an otherwise "preferred" choice of capture encodings 960 to be passed over in favour of different capture encodings - for 961 instance, if a set of 3 media captures could only be provided at a 962 low resolution then a 3 screen device could switch to favoring a 963 single, higher quality, capture encoding. 965 9.4. Message Flow 967 The following diagram shows the basic flow of messages between a 968 media provider and a media consumer. The usage of the "capture 969 advertisement" and "configure encodings" message is described above. 970 The consumer also sends its own capability message to the provider 971 which may contain information about its own capabilities or 972 restrictions. 974 Diagram for Message Flow 976 Media Consumer Media Provider 977 -------------- ------------ 978 | | 979 |----- Consumer Capability ---------->| 980 | | 981 | | 982 |<---- Capture advertisement ---------| 983 | | 984 | | 985 |------ Configure encodings --------->| 986 | | 988 In order for a maximally-capable provider to be able to advertise a 989 manageable number of video captures to a consumer, there is a 990 potential use for the consumer, at the start of CLUE, to be able to 991 inform the provider of its capabilities. One example here would be 992 the video capture attribute set - a consumer could tell the provider 993 the complete set of video capture attributes it is able to understand 994 and so the provider would be able to reduce the capture scene it 995 advertises to be tailored to the consumer. 997 TBD - the content of the consumer capability message needs to be 998 better defined. The authors believe there is a need for this 999 message, but have not worked out the details yet. 1001 10. Extensibility 1003 One of the most important characteristics of the Framework is its 1004 extensibility. Telepresence is a relatively new industry and while 1005 we can foresee certain directions, we also do not know everything 1006 about how it will develop. The standard for interoperability and 1007 handling multiple streams must be future-proof. 1009 The framework itself is inherently extensible through expanding the 1010 data model types. For example: 1012 o Adding more types of media, such as telemetry, can done by 1013 defining additional types of captures in addition to audio and 1014 video. 1016 o Adding new functionalities , such as 3-D, say, will require 1017 additional attributes describing the captures. 1019 o Adding a new codecs, such as H.265, can be accomplished by 1020 defining new encoding variables. 1022 The infrastructure is designed to be extended rather than requiring 1023 new infrastructure elements. Extension comes through adding to 1024 defined types. 1026 Assuming the implementation is in something like XML, adding data 1027 elements and attributes makes extensibility easy. 1029 11. Examples - Using the Framework 1031 This section shows some examples in more detail how to use the 1032 framework to represent a typical case for telepresence rooms. First 1033 an endpoint is illustrated, then an MCU case is shown. 1035 11.1. Three screen endpoint media provider 1037 Consider an endpoint with the following description: 1039 o 3 cameras, 3 displays, a 6 person table 1041 o Each video device can provide one capture for each 1/3 section of 1042 the table 1044 o A single capture representing the active speaker can be provided 1046 o A single capture representing the active speaker with the other 2 1047 captures shown picture in picture within the stream can be 1048 provided 1050 o A capture showing a zoomed out view of all 6 seats in the room can 1051 be provided 1053 The audio and video captures for this endpoint can be described as 1054 follows. 1056 Video Captures: 1058 o VC0- (the camera-left camera stream), encoding group=EG0, 1059 content=main, switched=false 1061 o VC1- (the center camera stream), encoding group=EG1, content=main, 1062 switched=false 1064 o VC2- (the camera-right camera stream), encoding group=EG2, 1065 content=main, switched=false 1067 o VC3- (the loudest panel stream), encoding group=EG1, content=main, 1068 switched=true 1070 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1071 content=main, composed=true, switched=true 1073 o VC5- (the zoomed out view of all people in the room), encoding 1074 group=EG1, content=main, composed=false, switched=false 1076 o VC6- (presentation stream), encoding group=EG1, content=slides, 1077 switched=false 1079 The following diagram is a top view of the room with 3 cameras, 3 1080 displays, and 6 seats. Each camera is capturing 2 people. The six 1081 seats are not all in a straight line. 1083 ,-. d 1084 ( )`--.__ +---+ 1085 `-' / `--.__ | | 1086 ,-. | `-.._ |_-+Camera 2 (VC2) 1087 ( ).' ___..-+-''`+-+ 1088 `-' |_...---'' | | 1089 ,-.c+-..__ +---+ 1090 ( )| ``--..__ | | 1091 `-' | ``+-..|_-+Camera 1 (VC1) 1092 ,-. | __..--'|+-+ 1093 ( )| __..--' | | 1094 `-'b|..--' +---+ 1095 ,-. |``---..___ | | 1096 ( )\ ```--..._|_-+Camera 0 (VC0) 1097 `-' \ _..-''`-+ 1098 ,-. \ __.--'' | | 1099 ( ) |..-'' +---+ 1100 `-' a 1102 The two points labeled b and c are intended to be at the midpoint 1103 between the seating positions, and where the fields of view of the 1104 cameras intersect. 1105 The plane of interest for VC0 is a vertical plane that intersects 1106 points 'a' and 'b'. 1107 The plane of interest for VC1 intersects points 'b' and 'c'. 1108 The plane of interest for VC2 intersects points 'c' and 'd'. 1109 This example uses an area scale of millimeters. 1111 Areas of capture: 1112 bottom left bottom right top left top right 1113 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1114 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1115 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1116 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1117 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1118 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1119 VC6 none 1121 Points of capture: 1122 VC0 (-1678,0,800) 1123 VC1 (0,0,800) 1124 VC2 (1678,0,800) 1125 VC3 none 1126 VC4 none 1127 VC5 (0,0,800) 1128 VC6 none 1130 In this example, the right edge of the VC0 area lines up with the 1131 left edge of the VC1 area. It doesn't have to be this way. There 1132 could be a gap or an overlap. One additional thing to note for this 1133 example is the distance from a to b is equal to the distance from b 1134 to c and the distance from c to d. All these distances are 1346 mm. 1135 This is the planar width of each area of capture for VC0, VC1, and 1136 VC2. 1138 Note the text in parentheses (e.g. "the camera-left camera stream") 1139 is not explicitly part of the model, it is just explanatory text for 1140 this example, and is not included in the model with the media 1141 captures and attributes. Also, the "composed" boolean attribute 1142 doesn't say anything about how a capture is composed, so the media 1143 consumer can't tell based on this attribute that VC4 is composed of a 1144 "loudest panel with PiPs". 1146 Audio Captures: 1148 o AC0 (camera-left), encoding group=EG3, content=main, channel 1149 format=mono 1151 o AC1 (camera-right), encoding group=EG3, content=main, channel 1152 format=mono 1154 o AC2 (center) encoding group=EG3, content=main, channel format=mono 1156 o AC3 being a simple pre-mixed audio stream from the room (mono), 1157 encoding group=EG3, content=main, channel format=mono 1159 o AC4 audio stream associated with the presentation video (mono) 1160 encoding group=EG3, content=slides, channel format=mono 1162 Areas of capture: 1163 bottom left bottom right top left top right 1164 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1165 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1166 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1167 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1168 AC4 none 1170 The physical simultaneity information is: 1172 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1174 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1176 This constraint indicates it is not possible to use all the VCs at 1177 the same time. VC5 can not be used at the same time as VC1 or VC3 or 1178 VC4. Also, using every member in the set simultaneously may not make 1179 sense - for example VC3(loudest) and VC4 (loudest with PIP). (In 1180 addition, there are encoding constraints that make choosing all of 1181 the VCs in a set impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and 1182 EG1 has only 3 ENCs. This constraint shows up in the encoding 1183 groups, not in the simultaneous transmission sets.) 1185 In this example there are no restrictions on which audio captures can 1186 be sent simultaneously. 1188 Encoding Groups: 1190 This example has three encoding groups associated with the video 1191 captures. Each group can have 3 encodings, but with each potential 1192 encoding having a progressively lower specification. In this 1193 example, 1080p60 transmission is possible (as ENC0 has a maxMbps 1194 value compatible with that) as long as it is the only active encoding 1195 in the group(as maxMbps for the entire encoding group is also 1196 489600). Significantly, as up to 3 encodings are available per 1197 group, it is possible to transmit some video captures simultaneously 1198 that are not in the same entry in the capture scene. For example VC1 1199 and VC3 at the same time. 1201 It is also possible to transmit multiple capture encodings of a 1202 single video capture. For example VC0 can be encoded using ENC0 and 1203 ENC1 at the same time, as long as the encoding parameters satisfy the 1204 constraints of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 1205 720p30. 1207 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1208 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1209 maxH264Mbps=489600, maxBandwidth=4000000 1210 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1211 maxH264Mbps=108000, maxBandwidth=4000000 1212 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1213 maxH264Mbps=61200, maxBandwidth=4000000 1215 encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1216 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1217 maxH264Mbps=489600, maxBandwidth=4000000 1218 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1219 maxH264Mbps=108000, maxBandwidth=4000000 1220 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1221 maxH264Mbps=61200, maxBandwidth=4000000 1223 encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1224 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1225 maxH264Mbps=489600, maxBandwidth=4000000 1226 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1227 maxH264Mbps=108000, maxBandwidth=4000000 1228 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1229 maxH264Mbps=61200, maxBandwidth=4000000 1231 Figure 2: Example Encoding Groups for Video 1233 For audio, there are five potential encodings available, so all five 1234 audio captures can be encoded at the same time. 1236 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1237 encodeID=ENC9, maxBandwidth=64000 1238 encodeID=ENC10, maxBandwidth=64000 1239 encodeID=ENC11, maxBandwidth=64000 1240 encodeID=ENC12, maxBandwidth=64000 1241 encodeID=ENC13, maxBandwidth=64000 1243 Figure 3: Example Encoding Group for Audio 1245 Capture Scenes: 1247 The following table represents the capture scenes for this provider. 1248 Recall that a capture scene is composed of alternative capture scene 1249 entries covering the same scene. Capture Scene #1 is for the main 1250 people captures, and Capture Scene #2 is for presentation. 1252 Each row in the table is a separate entry in the capture scene 1254 +------------------+ 1255 | Capture Scene #1 | 1256 +------------------+ 1257 | VC0, VC1, VC2 | 1258 | VC3 | 1259 | VC4 | 1260 | VC5 | 1261 | AC0, AC1, AC2 | 1262 | AC3 | 1263 +------------------+ 1265 +------------------+ 1266 | Capture Scene #2 | 1267 +------------------+ 1268 | VC6 | 1269 | AC4 | 1270 +------------------+ 1272 Different capture scenes are unique to each other, non-overlapping. 1273 A consumer can choose an entry from each capture scene. In this case 1274 the three captures VC0, VC1, and VC2 are one way of representing the 1275 video from the endpoint. These three captures should appear adjacent 1276 next to each other. Alternatively, another way of representing the 1277 Capture Scene is with the capture VC3, which automatically shows the 1278 person who is talking. Similarly for the VC4 and VC5 alternatives. 1280 As in the video case, the different entries of audio in Capture Scene 1281 #1 represent the "same thing", in that one way to receive the audio 1282 is with the 3 audio captures (AC0, AC1, AC2), and another way is with 1283 the mixed AC3. The Media Consumer can choose an audio capture entry 1284 it is capable of receiving. 1286 The spatial ordering is understood by the media capture attributes 1287 area and point of capture. 1289 A Media Consumer would likely want to choose a capture scene entry to 1290 receive based in part on how many streams it can simultaneously 1291 receive. A consumer that can receive three people streams would 1292 probably prefer to receive the first entry of Capture Scene #1 (VC0, 1293 VC1, VC2) and not receive the other entries. A consumer that can 1294 receive only one people stream would probably choose one of the other 1295 entries. 1297 If the consumer can receive a presentation stream too, it would also 1298 choose to receive the only entry from Capture Scene #2 (VC6). 1300 11.2. Encoding Group Example 1302 This is an example of an encoding group to illustrate how it can 1303 express dependencies between encodings. 1305 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1306 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1307 maxH264Mbps=244800, maxBandwidth=4000000 1308 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1309 maxH264Mbps=244800, maxBandwidth=4000000 1310 encodeID=AUDENC0, maxBandwidth=96000 1311 encodeID=AUDENC1, maxBandwidth=96000 1312 encodeID=AUDENC2, maxBandwidth=96000 1314 Here, the encoding group is EG0. It can transmit up to two 1080p30 1315 capture encodings (Mbps for 1080p = 244800), but it is capable of 1316 transmitting a maxFrameRate of 60 frames per second (fps). To 1317 achieve the maximum resolution (1920 x 1088) the frame rate is 1318 limited to 30 fps. However 60 fps can be achieved at a lower 1319 resolution if required by the consumer. Although the encoding group 1320 is capable of transmitting up to 6Mbit/s, no individual video 1321 encoding can exceed 4Mbit/s. 1323 This encoding group also allows up to 3 audio encodings, AUDENC<0-2>. 1324 It is not required that audio and video encodings reside within the 1325 same encoding group, but if so then the group's overall maxBandwidth 1326 value is a limit on the sum of all audio and video encodings 1327 configured by the consumer. A system that does not wish or need to 1328 combine bandwidth limitations in this way should instead use separate 1329 encoding groups for audio and video in order for the bandwidth 1330 limitations on audio and video to not interact. 1332 Audio and video can be expressed in separate encoding groups, as in 1333 this illustration. 1335 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1336 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1337 maxH264Mbps=244800, maxBandwidth=4000000 1338 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1339 maxH264Mbps=244800, maxBandwidth=4000000 1341 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1342 encodeID=AUDENC0, maxBandwidth=96000 1343 encodeID=AUDENC1, maxBandwidth=96000 1344 encodeID=AUDENC2, maxBandwidth=96000 1346 11.3. The MCU Case 1348 This section shows how an MCU might express its Capture Scenes, 1349 intending to offer different choices for consumers that can handle 1350 different numbers of streams. A single audio capture stream is 1351 provided for all single and multi-screen configurations that can be 1352 associated (e.g. lip-synced) with any combination of video captures 1353 at the consumer. 1355 +--------------------+---------------------------------------------+ 1356 | Capture Scene #1 | note | 1357 +--------------------+---------------------------------------------+ 1358 | VC0 | video capture for single screen consumer | 1359 | VC1, VC2 | video capture for 2 screen consumer | 1360 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1361 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1362 | AC0 | audio capture representing all participants | 1363 +--------------------+---------------------------------------------+ 1365 If / when a presentation stream becomes active within the conference, 1366 the MCU might re-advertise the available media as: 1368 +------------------+--------------------------------------+ 1369 | Capture Scene #2 | note | 1370 +------------------+--------------------------------------+ 1371 | VC10 | video capture for presentation | 1372 | AC1 | presentation audio to accompany VC10 | 1373 +------------------+--------------------------------------+ 1375 11.4. Media Consumer Behavior 1377 This section gives an example of how a media consumer might behave 1378 when deciding how to request streams from the three screen endpoint 1379 described in the previous section. 1381 The receive side of a call needs to balance its requirements, based 1382 on number of screens and speakers, its decoding capabilities and 1383 available bandwidth, and the provider's capabilities in order to 1384 optimally configure the provider's streams. Typically it would want 1385 to receive and decode media from each capture scene advertised by the 1386 provider. 1388 A sane, basic, algorithm might be for the consumer to go through each 1389 capture scene in turn and find the collection of video captures that 1390 best matches the number of screens it has (this might include 1391 consideration of screens dedicated to presentation video display 1392 rather than "people" video) and then decide between alternative 1393 entries in the video capture scenes based either on hard-coded 1394 preferences or user choice. Once this choice has been made, the 1395 consumer would then decide how to configure the provider's encoding 1396 groups in order to make best use of the available network bandwidth 1397 and its own decoding capabilities. 1399 11.4.1. One screen consumer 1401 VC3, VC4 and VC5 are all different entries by themselves, not grouped 1402 together in a single entry, so the receiving device should choose 1403 between one of those. The choice would come down to whether to see 1404 the greatest number of participants simultaneously at roughly equal 1405 precedence (VC5), a switched view of just the loudest region (VC3) or 1406 a switched view with PiPs (VC4). An endpoint device with a small 1407 amount of knowledge of these differences could offer a dynamic choice 1408 of these options, in-call, to the user. 1410 11.4.2. Two screen consumer configuring the example 1412 Mixing systems with an even number of screens, "2n", and those with 1413 "2n+1" cameras (and vice versa) is always likely to be the 1414 problematic case. In this instance, the behavior is likely to be 1415 determined by whether a "2 screen" system is really a "2 decoder" 1416 system, i.e., whether only one received stream can be displayed per 1417 screen or whether more than 2 streams can be received and spread 1418 across the available screen area. To enumerate 3 possible behaviors 1419 here for the 2 screen system when it learns that the far end is 1420 "ideally" expressed via 3 capture streams: 1422 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1423 per the 1 screen consumer case above) and either leave one screen 1424 blank or use it for presentation if / when a presentation becomes 1425 active 1427 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1428 (either with each capture being scaled to 2/3 of a screen and the 1429 centre capture being split across 2 screens) or, as would be 1430 necessary if there were large bezels on the screens, with each 1431 stream being scaled to 1/2 the screen width and height and there 1432 being a 4th "blank" panel. This 4th panel could potentially be 1433 used for any presentation that became active during the call. 1435 3. Receive 3 streams, decode all 3, and use control information 1436 indicating which was the most active to switch between showing 1437 the left and centre streams (one per screen) and the centre and 1438 right streams. 1440 For an endpoint capable of all 3 methods of working described above, 1441 again it might be appropriate to offer the user the choice of display 1442 mode. 1444 11.4.3. Three screen consumer configuring the example 1446 This is the most straightforward case - the consumer would look to 1447 identify a set of streams to receive that best matched its available 1448 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1449 spatial ordering would give sufficient information for the correct 1450 video capture to be shown on the correct screen, and the consumer 1451 would either need to divide a single encoding group's capability by 3 1452 to determine what resolution and frame rate to configure the provider 1453 with or to configure the individual video captures' encoding groups 1454 with what makes most sense (taking into account the receive side 1455 decode capabilities, overall call bandwidth, the resolution of the 1456 screens plus any user preferences such as motion vs sharpness). 1458 12. Acknowledgements 1460 Mark Gorzyinski contributed much to the approach. We want to thank 1461 Stephen Botzko for helpful discussions on audio. 1463 13. IANA Considerations 1465 TBD 1467 14. Security Considerations 1469 TBD 1471 15. Changes Since Last Version 1473 NOTE TO THE RFC-Editor: Please remove this section prior to 1474 publication as an RFC. 1476 Changes from 06 to 07: 1478 1. Ticket #9. Rename Axis of Capture Point attribute to Point on 1479 Line of Capture. Clarify the description of this attribute. 1481 2. Ticket #17. Add "capture encoding" definition. Use this new 1482 term throughout document as appropriate, replacing some usage of 1483 the terms "stream" and "encoding". 1485 3. Ticket #18. Add Max Capture Encodings media capture attribute. 1487 4. Add clarification that different capture scene entries are not 1488 necessarily mutually exclusive. 1490 Changes from 05 to 06: 1492 1. Capture scene description attribute is a list of text strings, 1493 each in a different language, rather than just a single string. 1495 2. Add new Axis of Capture Point attribute. 1497 3. Remove appendices A.1 through A.6. 1499 4. Clarify that the provider must use the same coordinate system 1500 with same scale and origin for all coordinates within the same 1501 capture scene. 1503 Changes from 04 to 05: 1505 1. Clarify limitations of "composed" attribute. 1507 2. Add new section "capture scene entry attributes" and add the 1508 attribute "scene-switch-policy". 1510 3. Add capture scene description attribute and description language 1511 attribute. 1513 4. Editorial changes to examples section for consistency with the 1514 rest of the document. 1516 Changes from 03 to 04: 1518 1. Remove sentence from overview - "This constitutes a significant 1519 change ..." 1521 2. Clarify a consumer can choose a subset of captures from a 1522 capture scene entry or a simultaneous set (in section "capture 1523 scene" and "consumer's choice..."). 1525 3. Reword first paragraph of Media Capture Attributes section. 1527 4. Clarify a stereo audio capture is different from two mono audio 1528 captures (description of audio channel format attribute). 1530 5. Clarify what it means when coordinate information is not 1531 specified for area of capture, point of capture, area of scene. 1533 6. Change the term "producer" to "provider" to be consistent (it 1534 was just in two places). 1536 7. Change name of "purpose" attribute to "content" and refer to 1537 RFC4796 for values. 1539 8. Clarify simultaneous sets are part of a provider advertisement, 1540 and apply across all capture scenes in the advertisement. 1542 9. Remove sentence about lip-sync between all media captures in a 1543 capture scene. 1545 10. Combine the concepts of "capture scene" and "capture set" into a 1546 single concept, using the term "capture scene" to replace the 1547 previous term "capture set", and eliminating the original 1548 separate capture scene concept. 1550 16. Informative References 1552 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1553 Requirement Levels", BCP 14, RFC 2119, March 1997. 1555 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1556 A., Peterson, J., Sparks, R., Handley, M., and E. 1557 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1558 June 2002. 1560 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1561 Jacobson, "RTP: A Transport Protocol for Real-Time 1562 Applications", STD 64, RFC 3550, July 2003. 1564 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1565 Session Initiation Protocol (SIP)", RFC 4353, 1566 February 2006. 1568 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description 1569 Protocol (SDP) Content Attribute", RFC 4796, 1570 February 2007. 1572 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1573 January 2008. 1575 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1576 Languages", BCP 47, RFC 5646, September 2009. 1578 [IANA-Lan] 1579 IANA, "Language Subtag Registry", 1580 . 1583 Authors' Addresses 1585 Allyn Romanow 1586 Cisco Systems 1587 San Jose, CA 95134 1588 USA 1590 Email: allyn@cisco.com 1592 Mark Duckworth (editor) 1593 Polycom 1594 Andover, MA 01810 1595 USA 1597 Email: mark.duckworth@polycom.com 1599 Andrew Pepperell 1600 Silverflare 1601 Uxbridge, England 1602 UK 1604 Email: apeppere@gmail.com 1606 Brian Baldino 1607 Cisco Systems 1608 San Jose, CA 95134 1609 USA 1611 Email: bbaldino@cisco.com