idnits 2.17.1 draft-ietf-clue-framework-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1095 has weird spacing: '...om left bot...' == Line 1146 has weird spacing: '...om left bot...' -- The document date (July 6, 2012) is 4305 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth, Ed. 5 Expires: January 7, 2013 Polycom 6 A. Pepperell 8 B. Baldino 9 Cisco Systems 10 July 6, 2012 12 Framework for Telepresence Multi-Streams 13 draft-ietf-clue-framework-06.txt 15 Abstract 17 This memo offers a framework for a protocol that enables devices in a 18 telepresence conference to interoperate by specifying the 19 relationships between multiple media streams. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on January 7, 2013. 38 Copyright Notice 40 Copyright (c) 2012 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 4. Overview of the Framework/Model . . . . . . . . . . . . . . . 6 59 5. Spatial Relationships . . . . . . . . . . . . . . . . . . . . 7 60 6. Media Captures and Capture Scenes . . . . . . . . . . . . . . 8 61 6.1. Media Captures . . . . . . . . . . . . . . . . . . . . . . 8 62 6.1.1. Media Capture Attributes . . . . . . . . . . . . . . . 9 63 6.2. Capture Scene . . . . . . . . . . . . . . . . . . . . . . 11 64 6.2.1. Capture scene attributes . . . . . . . . . . . . . . . 13 65 6.2.2. Capture scene entry attributes . . . . . . . . . . . . 14 66 6.3. Simultaneous Transmission Set Constraints . . . . . . . . 15 67 7. Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . 16 68 7.1. Individual Encodings . . . . . . . . . . . . . . . . . . . 16 69 7.2. Encoding Group . . . . . . . . . . . . . . . . . . . . . . 17 70 8. Associating Media Captures with Encoding Groups . . . . . . . 19 71 9. Consumer's Choice of Streams to Receive from the Provider . . 19 72 9.1. Local preference . . . . . . . . . . . . . . . . . . . . . 20 73 9.2. Physical simultaneity restrictions . . . . . . . . . . . . 20 74 9.3. Encoding and encoding group limits . . . . . . . . . . . . 20 75 9.4. Message Flow . . . . . . . . . . . . . . . . . . . . . . . 21 76 10. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 22 77 11. Examples - Using the Framework . . . . . . . . . . . . . . . . 22 78 11.1. Three screen endpoint media provider . . . . . . . . . . . 22 79 11.2. Encoding Group Example . . . . . . . . . . . . . . . . . . 29 80 11.3. The MCU Case . . . . . . . . . . . . . . . . . . . . . . . 30 81 11.4. Media Consumer Behavior . . . . . . . . . . . . . . . . . 30 82 11.4.1. One screen consumer . . . . . . . . . . . . . . . . . 31 83 11.4.2. Two screen consumer configuring the example . . . . . 31 84 11.4.3. Three screen consumer configuring the example . . . . 32 85 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 32 86 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 87 14. Security Considerations . . . . . . . . . . . . . . . . . . . 32 88 15. Changes Since Last Version . . . . . . . . . . . . . . . . . . 32 89 16. Informative References . . . . . . . . . . . . . . . . . . . . 34 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 34 92 1. Introduction 94 Current telepresence systems, though based on open standards such as 95 RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with each 96 other. A major factor limiting the interoperability of telepresence 97 systems is the lack of a standardized way to describe and negotiate 98 the use of the multiple streams of audio and video comprising the 99 media flows. This draft provides a framework for a protocol to 100 enable interoperability by handling multiple streams in a 101 standardized way. It is intended to support the use cases described 102 in draft-ietf-clue-telepresence-use-cases-02 and to meet the 103 requirements in draft-ietf-clue-telepresence-requirements-01. 105 The solution described here is strongly focused on what is being done 106 today, rather than on a vision of future conferencing. At the same 107 time, the highest priority has been given to creating an extensible 108 framework to make it easy to accommodate future conferencing 109 functionality as it evolves. 111 The purpose of this effort is to make it possible to handle multiple 112 streams of media in such a way that a satisfactory user experience is 113 possible even when participants are using different vendor equipment, 114 and also when they are using devices with different types of 115 communication capabilities. Information about the relationship of 116 media streams at the provider's end must be communicated so that 117 streams can be chosen and audio/video rendering can be done in the 118 best possible manner. 120 There is no attempt here to dictate to the renderer what it should 121 do. What the renderer does is up to the renderer. 123 After the following Definitions, a short section introduces key 124 concepts. The body of the text comprises several sections about the 125 key elements of the framework, how a consumer chooses streams to 126 receive, and some examples. The appendix describe topics that are 127 under discussion for adding to the document. 129 2. Terminology 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 133 document are to be interpreted as described in RFC 2119 [RFC2119]. 135 3. Definitions 137 The definitions marked with an "*" are new; all the others are from 138 *Audio Capture: Media Capture for audio. Denoted as ACn. 140 Camera-Left and Right: For media captures, camera-left and camera- 141 right are from the point of view of a person observing the rendered 142 media. They are the opposite of stage-left and stage-right. 144 Capture Device: A device that converts audio and video input into an 145 electrical signal, in most cases to be fed into a media encoder. 146 Cameras and microphones are examples for capture devices. 148 *Capture Scene: a structure representing the scene that is captured 149 by a collection of capture devices. A capture scene includes 150 attributes and one or more capture scene entries, with each entry 151 including one or more media captures. 153 *Capture Scene Entry: a list of media captures of the same media type 154 that together form one way to represent the capture scene. 156 Conference: used as defined in [RFC4353], A Framework for 157 Conferencing within the Session Initiation Protocol (SIP). 159 *Individual Encoding: A variable with a set of attributes that 160 describes the maximum values of a single audio or video capture 161 encoding. The attributes include: maximum bandwidth- and for video 162 maximum macroblocks (for H.264), maximum width, maximum height, 163 maximum frame rate. 165 *Encoding Group: A set of encoding parameters representing a media 166 provider's encoding capabilities. Media stream providers formed of 167 multiple physical units, in each of which resides some encoding 168 capability, would typically advertise themselves to the remote media 169 stream consumer using multiple encoding groups. Within each encoding 170 group, multiple potential encodings are possible, with the sum of the 171 chosen encodings' characteristics constrained to being less than or 172 equal to the group-wide constraints. 174 Endpoint: The logical point of final termination through receiving, 175 decoding and rendering, and/or initiation through capturing, 176 encoding, and sending of media streams. An endpoint consists of one 177 or more physical devices which source and sink media streams, and 178 exactly one [RFC4353] Participant (which, in turn, includes exactly 179 one SIP User Agent). In contrast to an endpoint, an MCU may also 180 send and receive media streams, but it is not the initiator nor the 181 final terminator in the sense that Media is Captured or Rendered. 182 Endpoints can be anything from multiscreen/multicamera rooms to 183 handheld devices. 185 Front: the portion of the room closest to the cameras. In going 186 towards back you move away from the cameras. 188 MCU: Multipoint Control Unit (MCU) - a device that connects two or 189 more endpoints together into one single multimedia conference 190 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 191 tardy in requiring that media from the mixer be sent to EACH 192 participant. I think we have practical use cases where this is not 193 the case. But the bug (if it is one) is in 4353 and not herein.] 195 Media: Any data that, after suitable encoding, can be conveyed over 196 RTP, including audio, video or timed text. 198 *Media Capture: a source of Media, such as from one or more Capture 199 Devices. A Media Capture (MC) may be the source of one or more Media 200 streams. A Media Capture may also be constructed from other Media 201 streams. A middle box can express Media Captures that it constructs 202 from Media streams it receives. 204 *Media Consumer: an Endpoint or middle box that receives media 205 streams 207 *Media Provider: an Endpoint or middle box that sends Media streams 209 Model: a set of assumptions a telepresence system of a given vendor 210 adheres to and expects the remote telepresence system(s) also to 211 adhere to. 213 *Plane of Interest: The spatial plane containing the most relevant 214 subject matter. 216 Render: the process of generating a representation from a media, such 217 as displayed motion video or sound emitted from loudspeakers. 219 *Simultaneous Transmission Set: a set of media captures that can be 220 transmitted simultaneously from a Media Provider. 222 Spatial Relation: The arrangement in space of two objects, in 223 contrast to relation in time or other relationships. See also 224 Camera-Left and Right. 226 Stage-Left and Right: For media captures, stage-left and stage-right 227 are the opposite of camera-left and camera-right. For the case of a 228 person facing (and captured by) a camera, stage-left and stage-right 229 are from the point of view of that person. 231 *Stream: RTP stream as in [RFC3550]. 233 Stream Characteristics: the media stream attributes commonly used in 234 non-CLUE SIP/SDP environments (such as: media codec, bit rate, 235 resolution, profile/level etc.) as well as CLUE specific attributes, 236 such as the ID of a capture or a spatial location. 238 Telepresence: an environment that gives non co-located users or user 239 groups a feeling of (co-located) presence - the feeling that a Local 240 user is in the same room with other Local users and the Remote 241 parties. The inclusion of Remote parties is achieved through 242 multimedia communication including at least audio and video signals 243 of high fidelity. 245 *Video Capture: Media Capture for video. Denoted as VCn. 247 Video composite: A single image that is formed from combining visual 248 elements from separate sources. 250 4. Overview of the Framework/Model 252 The CLUE framework specifies how multiple media streams are to be 253 handled in a telepresence conference. 255 The main goals include: 257 o Interoperability 259 o Extensibility 261 o Flexibility 263 Interoperability is achieved by the media provider describing the 264 relationships between media streams in constructs that are understood 265 by the consumer, who can then render the media. Extensibility is 266 achieved through abstractions and the generality of the model, making 267 it easy to add new parameters. Flexibility is achieved largely by 268 having the consumer choose what content and format it wants to 269 receive from what the provider is capable of sending. 271 A transmitting endpoint or MCU describes specific aspects of the 272 content of the media and the formatting of the media streams it can 273 send (advertisement); and the receiving end responds to the provider 274 by specifying which content and media streams it wants to receive 275 (configuration). The provider then transmits the asked for content 276 in the specified streams. 278 This advertisement and configuration occurs at call initiation but 279 may also happen at any time throughout the conference, whenever there 280 is a change in what the consumer wants or the provider can send. 282 An endpoint or MCU typically acts as both provider and consumer at 283 the same time, sending advertisements and sending configurations in 284 response to receiving advertisements. (It is possible to be just one 285 or the other.) 287 The data model is based around two main concepts: a capture and an 288 encoding. A media capture (MC), such as audio or video, describes 289 the content a provider can send. Media captures are described in 290 terms of CLUE-defined attributes, such as spatial relationships and 291 purpose of the capture. Providers tell consumers which media 292 captures they can provide, described in terms of the media capture 293 attributes. 295 A provider organizes its media captures that represent the same scene 296 into capture scenes. A consumer chooses which media captures it 297 wants to receive according to the capture scenes sent by the 298 provider. 300 In addition, the provider sends the consumer a description of the 301 streams it can send in terms of the media attributes of the stream, 302 in particular, well-known audio and video parameters such as 303 bandwidth, frame rate, macroblocks per second. 305 The provider also specifies constraints on its ability to provide 306 media, and the consumer must take these into account in choosing the 307 content and streams it wants. Some constraints are due to the 308 physical limitations of devices - for example, a camera may not be 309 able to provide zoom and non-zoom views simultaneously. Other 310 constraints are system based constraints, such as maximum bandwidth 311 and maximum macroblocks/second. 313 The following sections discuss these constructs and processes in 314 detail, followed by use cases showing how the framework specification 315 can be used. 317 5. Spatial Relationships 319 In order for a consumer to perform a proper rendering, it is often 320 necessary to provide spatial information about the streams it is 321 receiving. CLUE defines a coordinate system that allows media 322 providers to describe the spatial relationships of their media 323 captures to enable proper scaling and spatial rendering of their 324 streams. The coordinate system is based on a few principles: 326 o Simple systems which do not have multiple Media Captures to 327 associate spatially need not use the coordinate model. 329 o Coordinates can either be in real, physical units (millimeters), 330 have an unknown scale or have no physical scale. Systems which 331 know their physical dimensions should always provide those real- 332 world measurements. Systems which don't know specific physical 333 dimensions but still know relative distances should use 'unknown 334 scale'. 'No scale' is intended to be used where Media Captures 335 from different devices (with potentially different scales) will be 336 forwarded alongside one another (e.g. in the case of a middle 337 box). 339 * "millimeters" means the scale is in millimeters 341 * "Unknown" means the scale is not necessarily millimeters, but 342 the scale is the same for every capture in the capture scene. 344 * "No Scale" means the scale could be different for each capture 345 - an MCU provider that advertises two adjacent captures and 346 picks sources (which can change quickly) from different 347 endpoints might use this value; the scale could be different 348 and changing for each capture. But the areas of capture still 349 represent a spatial relation between captures. 351 o The coordinate system is Cartesian X, Y, Z with the origin at a 352 spot of the provider's choosing. The provider must use the same 353 coordinate system with same scale and origin for all coordinates 354 within the same capture scene. 356 The direction of increasing coordinate values is: 357 X increases from camera left to camera right 358 Y increases from front to back 359 Z increases from low to high 361 6. Media Captures and Capture Scenes 363 This section describes how media providers can describe the content 364 of media to consumers. 366 6.1. Media Captures 368 Media captures are the fundamental representations of streams that a 369 device can transmit. What a Media Capture actually represents is 370 flexible: 372 o It can represent the immediate output of a physical source (e.g. 373 camera, microphone) or 'synthetic' source (e.g. laptop computer, 374 DVD player). 376 o It can represent the output of an audio mixer or video composer 378 o It can represent a concept such as 'the loudest speaker' 380 o It can represent a conceptual position such as 'the leftmost 381 stream' 383 To distinguish between multiple instances, video and audio captures 384 are numbered such as: VC1, VC2 and AC1, AC2. VC1 and VC2 refer to 385 two different video captures and AC1 and AC2 refer to two different 386 audio captures. 388 Each Media Capture can be associated with attributes to describe what 389 it represents. 391 6.1.1. Media Capture Attributes 393 Media Capture Attributes describe static information about the 394 captures. A provider uses the media capture attributes to describe 395 the media captures to the consumer. The consumer will select the 396 captures it wants to receive. Attributes are defined by a variable 397 and its value. The currently defined attributes and their values 398 are: 400 Content: {slides, speaker, sl, main, alt} 402 A field with enumerated values which describes the role of the media 403 capture and can be applied to any media type. The enumerated values 404 are defined by [RFC4796]. The values for this attribute are the same 405 as the mediacnt values for the content attribute in [RFC4796]. This 406 attribute can have multiple values, for example content={main, 407 speaker}. 409 Composed: {true, false} 411 A field with a Boolean value which indicates whether or not the Media 412 Capture is a mix (audio) or composition (video) of streams. 414 This attribute is useful for a media consumer to avoid nesting a 415 composed video capture into another composed capture or rendering. 416 This attribute is not intended to describe the layout a media 417 provider uses when composing video streams. 419 Audio Channel Format: {mono, stereo} A field with enumerated values 420 which describes the method of encoding used for audio. 422 A value of 'mono' means the Audio Capture has one channel. 424 A value of 'stereo' means the Audio Capture has two audio channels, 425 left and right. 427 This attribute applies only to Audio Captures. A single stereo 428 capture is different from two mono captures that have a left-right 429 spatial relationship. A stereo capture maps to a single RTP stream, 430 while each mono audio capture maps to a separate RTP stream. 432 Switched: {true, false} 434 A field with a Boolean value which indicates whether or not the Media 435 Capture represents the (dynamic) most appropriate subset of a 436 'whole'. What is 'most appropriate' is up to the provider and could 437 be the active speaker, a lecturer or a VIP. 439 Point of Capture: {(X, Y, Z)} 441 A field with a single Cartesian (X, Y, Z) point value which describes 442 the spatial location, virtual or physical, of the capturing device 443 (such as camera). 445 When the Point of Capture attribute is specified, it must include X, 446 Y and Z coordinates. If the point of capture is not specified, it 447 means the consumer should not assume anything about the spatial 448 location of the capturing device. Even if the provider specifies an 449 area of capture attribute, it does not need to specify the point of 450 capture. 452 Axis of Capture Point: {(X, Y, Z)} 454 A field with a single Cartesian (X, Y, Z) point value (virtual or 455 physical) which describes a position in space of a second point on 456 the axis of capture of the capturing device; the first point being 457 the Point of Capture (see above). 459 The axis of capture point MUST NOT be specified if the Point of 460 Capture is not present for this capture device. When the Axis of 461 Capture Point attribute is specified, it must include X, Y and Z 462 coordinates. These coordinates MUST NOT be identical to the Point of 463 Capture coordinates. If the Axis of Capture point is not specified, 464 it means the consumer should not assume anything about the axis of 465 Capture of the capturing device. 467 Area of Capture: 469 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, Y3, 470 Z3), top right(X4, Y4, Z4)} 472 A field with a set of four (X, Y, Z) points as a value which describe 473 the spatial location of what is being "captured". By comparing the 474 Area of Capture for different Media Captures within the same capture 475 scene a consumer can determine the spatial relationships between them 476 and render them correctly. 478 The four points should be co-planar. The four points form a 479 quadrilateral, not necessarily a rectangle. 481 The quadrilateral described by the four (X, Y, Z) points defines the 482 plane of interest for the particular media capture. 484 If the area of capture attribute is specified, it must include X, Y 485 and Z coordinates for all four points. If the area of capture is not 486 specified, it means the media capture is not spatially related to any 487 other media capture (but this can change in a subsequent provider 488 advertisement). 490 For a switched capture that switches between different sections 491 within a larger area, the area of capture should use coordinates for 492 the larger potential area. 494 EncodingGroup: {} 496 A field with a value equal to the encodeGroupID of the encoding group 497 associated with the media capture. 499 6.2. Capture Scene 501 In order for a provider's individual media captures to be used 502 effectively by a consumer, the provider organizes the media captures 503 into capture scenes, with the structure and contents of these capture 504 scenes being sent from the provider to the consumer. 506 A capture scene is a structure representing the scene that is 507 captured by a collection of capture devices. A capture scene 508 includes one or more capture scene entries, with each entry including 509 one or more media captures. A capture scene represents, for example, 510 the video image of a group of people seated next to each other, along 511 with the sound of their voices, which could be represented by some 512 number of VCs and ACs in the capture scene entries. A middle box may 513 also express capture scenes that it constructs from media streams it 514 receives. 516 A provider may advertise multiple capture scenes or just a single 517 capture scene. A media provider might typically use one capture 518 scene for main participant media and another capture scene for a 519 computer generated presentation. A capture scene may include more 520 than one type of media. For example, a capture scene can include 521 several capture scene entries for video captures, and several capture 522 scene entries for audio captures. 524 A provider can express spatial relationships between media captures 525 that are included in the same capture scene. But there is no spatial 526 relationship between media captures that are in different capture 527 scenes. 529 A media provider arranges media captures in a capture scene to help 530 the media consumer choose which captures it wants. The capture scene 531 entries in a capture scene are different alternatives the provider is 532 suggesting for representing the capture scene. The media consumer 533 can choose to receive all media captures from one capture scene entry 534 for each media type (e.g. audio and video), or it can pick and choose 535 media captures regardless of how the provider arranges them in 536 capture scene entries. 538 Media captures within the same capture scene entry must be of the 539 same media type - it is not possible to mix audio and video captures 540 in the same capture scene entry, for instance. The provider must be 541 capable of encoding and sending all media captures in a single entry 542 simultaneously. A consumer may decide to receive all the media 543 captures in a single capture scene entry, but a consumer could also 544 decide to receive just a subset of those captures. A consumer can 545 also decide to receive media captures from different capture scene 546 entries. 548 When a provider advertises a capture scene with multiple entries, it 549 is essentially signaling that there are multiple representations of 550 the same scene available. In some cases, these multiple 551 representations would typically be used simultaneously (for instance 552 a "video entry" and an "audio entry"). In some cases the entries 553 would conceptually be alternatives (for instance an entry consisting 554 of 3 video captures versus an entry consisting of just a single video 555 capture). In this latter example, the provider would in the simple 556 case end up providing to the consumer the entry containing the number 557 of video captures that most closely matched the media consumer's 558 number of display devices. 560 The following is an example of 4 potential capture scene entries for 561 an endpoint-style media provider: 563 1. (VC0, VC1, VC2) - left, center and right camera video captures 565 2. (VC3) - video capture associated with loudest room segment 567 3. (VC4) - video capture zoomed out view of all people in the room 569 4. (AC0) - main audio 571 The first entry in this capture scene example is a list of video 572 captures with a spatial relationship to each other. Determination of 573 the order of these captures (VC0, VC1 and VC2) for rendering purposes 574 is accomplished through use of their Area of Capture attributes. The 575 second entry (VC3) and the third entry (VC4) are additional 576 alternatives of how to capture the same room in different ways. The 577 inclusion of the audio capture in the same capture scene indicates 578 that AC0 is associated with those video captures, meaning it comes 579 from the same scene. The audio should be rendered in conjunction 580 with any rendered video captures from the same capture scene. 582 6.2.1. Capture scene attributes 584 Attributes can be applied to capture scenes as well as to individual 585 media captures. Attributes specified at this level apply to all 586 constituent media captures. 588 Description attribute - list of {, } 590 The optional description attribute is a list of human readable text 591 strings which describe the capture scene. If there is more than one 592 string in the list, then each string in the list should contain the 593 same description, but in a different language. A provider that 594 advertises multiple capture scenes can provide descriptions for each 595 of them. This attribute can contain text in any number of languages. 597 The language tag identifies the language of the corresponding 598 description text. The possible values for a language tag are the 599 values of the 'Subtag' column for the "Type: language" entries in the 600 "Language Subtag Registry" at [IANA-Lan] originally defined in 601 [RFC5646]. A particular language tag value MUST NOT be used more 602 than once in the description attribute list. 604 Area of Scene attribute 606 The area of scene attribute for a capture scene has the same format 607 as the area of capture attribute for a media capture. The area of 608 scene is for the entire scene, which is captured by the one or more 609 media captures in the capture scene entries. If the provider does 610 not specify the area of scene, but does specify areas of capture, 611 then the consumer may assume the area of scene is greater than or 612 equal to the outer extents of the individual areas of capture. 614 Scale attribute 616 An optional attribute indicating if the numbers used for area of 617 scene, area of capture and point of capture are in terms of 618 millimeters, unknown scale factor, or not any scale, as described in 619 Section 5. If any media captures have an area of capture attribute 620 or point of capture attribute, then this scale attribute must also be 621 defined. The possible values for this attribute are: 623 "millimeters" 624 "unknown" 625 "no scale" 627 6.2.2. Capture scene entry attributes 629 Attributes can be applied to capture scene entries. Attributes 630 specified at this level apply to the capture scene entry as a whole. 632 Scene-switch-policy: {site-switch, segment-switch} 634 A media provider uses this scene-switch-policy attribute to indicate 635 its support for different switching policies. In the provider's 636 advertisement, this attribute can have multiple values, which means 637 the provider supports each of the indicated policies. The consumer, 638 when it requests media captures from this capture scene entry, should 639 also include this attribute but with only the single value (from 640 among the values indicated by the provider) indicating the consumer's 641 choice for which policy it wants the provider to use. If the 642 provider does not support any of these policies, it should omit this 643 attribute. 645 The "site-switch" policy means all captures are switched at the same 646 time to keep captures from the same endpoint site together. Let's 647 say the speaker is at site A and everyone else is at a "remote" site. 648 When the room at site A shown, all the camera images from site A are 649 forwarded to the remote sites. Therefore at each receiving remote 650 site, all the screens display camera images from site A. This can be 651 used to preserve full size image display, and also provide full 652 visual context of the displayed far end, site A. In site switching, 653 there is a fixed relation between the cameras in each room and the 654 displays in remote rooms. The room or participants being shown is 655 switched from time to time based on who is speaking or by manual 656 control. 658 The "segment-switch" policy means different captures can switch at 659 different times, and can be coming from different endpoints. Still 660 using site A as where the speaker is, and "remote" to refer to all 661 the other sites, in segment switching, rather than sending all the 662 images from site A, only the image containing the speaker at site A 663 is shown. The camera images of the current speaker and previous 664 speakers (if any) are forwarded to the other sites in the conference. 665 Therefore the screens in each site are usually displaying images from 666 different remote sites - the current speaker at site A and the 667 previous ones. This strategy can be used to preserve full size image 668 display, and also capture the non-verbal communication between the 669 speakers. In segment switching, the display depends on the activity 670 in the remote rooms - generally, but not necessarily based on audio / 671 speech detection. 673 6.3. Simultaneous Transmission Set Constraints 675 The provider may have constraints or limitations on its ability to 676 send media captures. One type is caused by the physical limitations 677 of capture mechanisms; these constraints are represented by a 678 simultaneous transmission set. The second type of limitation 679 reflects the encoding resources available - bandwidth and 680 macroblocks/second. This type of constraint is captured by encoding 681 groups, discussed below. 683 An endpoint or MCU can send multiple captures simultaneously, however 684 sometimes there are constraints that limit which captures can be sent 685 simultaneously with other captures. A device may not be able to be 686 used in different ways at the same time. Provider advertisements are 687 made so that the consumer will choose one of several possible 688 mutually exclusive usages of the device. This type of constraint is 689 expressed in a Simultaneous Transmission Set, which lists all the 690 media captures that can be sent at the same time. This is easier to 691 show in an example. 693 Consider the example of a room system where there are 3 cameras each 694 of which can send a separate capture covering 2 persons each- VC0, 695 VC1, VC2. The middle camera can also zoom out and show all 6 696 persons, VC3. But the middle camera cannot be used in both modes at 697 the same time - it has to either show the space where 2 participants 698 sit or the whole 6 seats, but not both at the same time. 700 Simultaneous transmission sets are expressed as sets of the MCs that 701 could physically be transmitted at the same time, (though it may not 702 make sense to do so). In this example the two simultaneous sets are 703 shown in Table 1. The consumer must make sure that it chooses one 704 and not more of the mutually exclusive sets. A consumer may choose 705 any subset of the media captures in a simultaneous set, it does not 706 have to choose all the captures in a simultaneous set if it does not 707 want to receive all of them. 709 +-------------------+ 710 | Simultaneous Sets | 711 +-------------------+ 712 | {VC0, VC1, VC2} | 713 | {VC0, VC3, VC2} | 714 +-------------------+ 716 Table 1: Two Simultaneous Transmission Sets 718 A media provider includes the simultaneous sets in its provider 719 advertisement. These simultaneous set constraints apply across all 720 the captures scenes in the advertisement. The simultaneous 721 transmission sets MUST allow all the media captures in a particular 722 capture scene entry to be used simultaneously. 724 7. Encodings 726 We have considered how providers can describe the content of media to 727 consumers. We will now consider how the providers communicate 728 information about their abilities to send streams. We introduce two 729 constructs - individual encodings and encoding groups. Consumers 730 will then map the media captures they want onto the encodings with 731 encoding parameters they want. This process is then described. 733 7.1. Individual Encodings 735 An individual encoding represents a way to encode a media capture to 736 become an encoded media stream sent from the media provider to the 737 media consumer. An individual encoding has a set of parameters 738 characterizing how the media is encoded. Different media types have 739 different parameters, and different encoding algorithms may have 740 different parameters. An individual encoding can be used for only 741 one actual encoded media stream at a time. 743 The parameters of an individual encoding represent the maximimum 744 values for certain aspects of the encoding. A particular 745 instantiation into an encoded stream might use lower values than 746 these maximums. 748 The following tables show the variables for audio and video encoding. 750 +--------------+----------------------------------------------------+ 751 | Name | Description | 752 +--------------+----------------------------------------------------+ 753 | encodeID | A unique identifier for the individual encoding | 754 | maxBandwidth | Maximum number of bits per second | 755 | maxH264Mbps | Maximum number of macroblocks per second: ((width | 756 | | + 15) / 16) * ((height + 15) / 16) * | 757 | | framesPerSecond | 758 | maxWidth | Video resolution's maximum supported width, | 759 | | expressed in pixels | 760 | maxHeight | Video resolution's maximum supported height, | 761 | | expressed in pixels | 762 | maxFrameRate | Maximum supported frame rate | 763 +--------------+----------------------------------------------------+ 765 Table 2: Individual Video Encoding Parameters 767 +--------------+-----------------------------------+ 768 | Name | Description | 769 +--------------+-----------------------------------+ 770 | maxBandwidth | Maximum number of bits per second | 771 +--------------+-----------------------------------+ 773 Table 3: Individual Audio Encoding Parameters 775 7.2. Encoding Group 777 An encoding group includes a set of one or more individual encodings, 778 plus some parameters that apply to the group as a whole. By grouping 779 multiple individual encodings together, an encoding group describes 780 additional constraints on bandwidth and other parameters for the 781 group. Table 4 shows the parameters and individual encoding sets 782 that are part of an encoding group. 784 +-------------------+-----------------------------------------------+ 785 | Name | Description | 786 +-------------------+-----------------------------------------------+ 787 | encodeGroupID | A unique identifier for the encoding group | 788 | maxGroupBandwidth | Maximum number of bits per second relating to | 789 | | all encodings combined | 790 | maxGroupH264Mbps | Maximum number of macroblocks per second | 791 | | relating to all video encodings combined | 792 | videoEncodings[] | Set of potential encodings (list of | 793 | | encodeIDs) | 794 | audioEncodings[] | Set of potential encodings (list of | 795 | | encodeIDs) | 796 +-------------------+-----------------------------------------------+ 797 Table 4: Encoding Group 799 When the individual encodings in a group are instantiated into actual 800 encoded media streams, each stream has a bandwidth that must be less 801 than or equal to the maxBandwidth for the particular individual 802 encoding. The maxGroupBandwidth parameter gives the additional 803 restriction that the sum of all the individual instantiated 804 bandwidths must be less than or equal to the maxGroupBandwidth value. 806 Likewise, the sum of the macroblocks per second of each instantiated 807 encoding in the group must not exceed the maxGroupH264Mbps value. 809 The following diagram illustrates the structure of a media provider's 810 Encoding Groups and their contents. 812 ,-------------------------------------------------. 813 | Media Provider | 814 | | 815 | ,--------------------------------------. | 816 | | ,--------------------------------------. | 817 | | | ,--------------------------------------. | 818 | | | | Encoding Group | | 819 | | | | ,-----------. | | 820 | | | | | | ,---------. | | 821 | | | | | | | | ,---------.| | 822 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 823 | `.| | | | | | `---------'| | 824 | `.| `-----------' `---------' | | 825 | `--------------------------------------' | 826 `-------------------------------------------------' 828 Figure 1: Encoding Group Structure 830 A media provider advertises one or more encoding groups. Each 831 encoding group includes one or more individual encodings. Each 832 individual encoding can represent a different way of encoding media. 833 For example one individual encoding may be 1080p60 video, another 834 could be 720p30, with a third being CIF. 836 While a typical 3 codec/display system might have one encoding group 837 per "codec box", there are many possibilities for the number of 838 encoding groups a provider may be able to offer and for the encoding 839 values in each encoding group. 841 There is no requirement for all encodings within an encoding group to 842 be instantiated at once. 844 8. Associating Media Captures with Encoding Groups 846 Every media capture is associated with an encoding group, which is 847 used to instantiate that media capture into one or more encoded 848 streams. Each media capture has an encoding group attribute. The 849 value of this attribute is the encodeGroupID for the encoding group 850 with which it is associated. More than one media capture may use the 851 same encoding group. 853 The maximum number of streams that can result from a particular 854 encoding group constraint is equal to the number of individual 855 encodings in the group. The actual number of streams used at any 856 time may be less than this maximum. Any of the media captures that 857 use a particular encoding group can be encoded according to any of 858 the individual encodings in the group. If there are multiple 859 individual encodings in the group, then a single media capture can be 860 encoded into multiple different streams at the same time, with each 861 stream following the constraints of a different individual encoding. 863 The Encoding Groups MUST allow all the media captures in a particular 864 capture scene entry to be used simultaneously. 866 9. Consumer's Choice of Streams to Receive from the Provider 868 After receiving the provider's advertised media captures and 869 associated constraints, the consumer must choose which media captures 870 it wishes to receive, and which individual encodings from the 871 provider it wants to use to encode the capture. Each media capture 872 has an encoding group ID attribute which specifies which individual 873 encodings are available to be used for that media capture. 875 For each media capture the consumer wants to receive, it configures 876 one or more of the encodings in that capture's encoding group. The 877 consumer does this by telling the provider the resolution, frame 878 rate, bandwidth, etc. when asking for streams for its chosen 879 captures. Upon receipt of this configuration command from the 880 consumer, the provider generates streams for each such configured 881 encoding and sends those streams to the consumer. 883 The consumer must have received at least one capture advertisement 884 from the provider to be able to configure the provider's generation 885 of media streams. 887 The consumer is able to change its configuration of the provider's 888 encodings any number of times during the call, either in response to 889 a new capture advertisement from the provider or autonomously. The 890 consumer need not send a new configure message to the provider when 891 it receives a new capture advertisement from the provider unless the 892 contents of the new capture advertisement cause the consumer's 893 current configure message to become invalid. 895 When choosing which streams to receive from the provider, and the 896 encoding characteristics of those streams, the consumer needs to take 897 several things into account its local preference, simultaneity 898 restrictions, and encoding limits. 900 9.1. Local preference 902 A variety of local factors will influence the consumer's choice of 903 streams to be received from the provider: 905 o if the consumer is an endpoint, it is likely that it would choose, 906 where possible, to receive video and audio captures that match the 907 number of display devices and audio system it has 909 o if the consumer is a middle box such as an MCU, it may choose to 910 receive loudest speaker streams (in order to perform its own media 911 composition) and avoid pre-composed video captures 913 o user choice (for instance, selection of a new layout) may result 914 in a different set of media captures, or different encoding 915 characteristics, being required by the consumer 917 9.2. Physical simultaneity restrictions 919 There may be physical simultaneity constraints imposed by the 920 provider that affect the provider's ability to simultaneously send 921 all of the captures the consumer would wish to receive. For 922 instance, a middle box such as an MCU, when connected to a multi- 923 camera room system, might prefer to receive both individual camera 924 streams of the people present in the room and an overall view of the 925 room from a single camera. Some endpoint systems might be able to 926 provide both of these sets of streams simultaneously, whereas others 927 may not (if the overall room view were produced by changing the zoom 928 level on the center camera, for instance). 930 9.3. Encoding and encoding group limits 932 Each of the provider's encoding groups has limits on bandwidth and 933 macroblocks per second, and the constituent potential encodings have 934 limits on the bandwidth, macroblocks per second, video frame rate, 935 and resolution that can be provided. When choosing the media 936 captures to be received from a provider, a consumer device must 937 ensure that the encoding characteristics requested for each 938 individual media capture fits within the capability of the encoding 939 it is being configured to use, as well as ensuring that the combined 940 encoding characteristics for media captures fit within the 941 capabilities of their associated encoding groups. In some cases, 942 this could cause an otherwise "preferred" choice of streams to be 943 passed over in favour of different streams - for instance, if a set 944 of 3 media captures could only be provided at a low resolution then a 945 3 screen device could switch to favoring a single, higher quality, 946 stream. 948 9.4. Message Flow 950 The following diagram shows the basic flow of messages between a 951 media provider and a media consumer. The usage of the "capture 952 advertisement" and "configure encodings" message is described above. 953 The consumer also sends its own capability message to the provider 954 which may contain information about its own capabilities or 955 restrictions. 957 Diagram for Message Flow 959 Media Consumer Media Provider 960 -------------- ------------ 961 | | 962 |----- Consumer Capability ---------->| 963 | | 964 | | 965 |<---- Capture advertisement ---------| 966 | | 967 | | 968 |------ Configure encodings --------->| 969 | | 971 In order for a maximally-capable provider to be able to advertise a 972 manageable number of video captures to a consumer, there is a 973 potential use for the consumer, at the start of CLUE, to be able to 974 inform the provider of its capabilities. One example here would be 975 the video capture attribute set - a consumer could tell the provider 976 the complete set of video capture attributes it is able to understand 977 and so the provider would be able to reduce the capture scene it 978 advertises to be tailored to the consumer. 980 TBD - the content of the consumer capability message needs to be 981 better defined. The authors believe there is a need for this 982 message, but have not worked out the details yet. 984 10. Extensibility 986 One of the most important characteristics of the Framework is its 987 extensibility. Telepresence is a relatively new industry and while 988 we can foresee certain directions, we also do not know everything 989 about how it will develop. The standard for interoperability and 990 handling multiple streams must be future-proof. 992 The framework itself is inherently extensible through expanding the 993 data model types. For example: 995 o Adding more types of media, such as telemetry, can done by 996 defining additional types of captures in addition to audio and 997 video. 999 o Adding new functionalities , such as 3-D, say, will require 1000 additional attributes describing the captures. 1002 o Adding a new codecs, such as H.265, can be accomplished by 1003 defining new encoding variables. 1005 The infrastructure is designed to be extended rather than requiring 1006 new infrastructure elements. Extension comes through adding to 1007 defined types. 1009 Assuming the implementation is in something like XML, adding data 1010 elements and attributes makes extensibility easy. 1012 11. Examples - Using the Framework 1014 This section shows some examples in more detail how to use the 1015 framework to represent a typical case for telepresence rooms. First 1016 an endpoint is illustrated, then an MCU case is shown. 1018 11.1. Three screen endpoint media provider 1020 Consider an endpoint with the following description: 1022 o 3 cameras, 3 displays, a 6 person table 1024 o Each video device can provide one capture for each 1/3 section of 1025 the table 1027 o A single capture representing the active speaker can be provided 1029 o A single capture representing the active speaker with the other 2 1030 captures shown picture in picture within the stream can be 1031 provided 1033 o A capture showing a zoomed out view of all 6 seats in the room can 1034 be provided 1036 The audio and video captures for this endpoint can be described as 1037 follows. 1039 Video Captures: 1041 o VC0- (the camera-left camera stream), encoding group=EG0, 1042 content=main, switched=false 1044 o VC1- (the center camera stream), encoding group=EG1, content=main, 1045 switched=false 1047 o VC2- (the camera-right camera stream), encoding group=EG2, 1048 content=main, switched=false 1050 o VC3- (the loudest panel stream), encoding group=EG1, content=main, 1051 switched=true 1053 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1054 content=main, composed=true, switched=true 1056 o VC5- (the zoomed out view of all people in the room), encoding 1057 group=EG1, content=main, composed=false, switched=false 1059 o VC6- (presentation stream), encoding group=EG1, content=slides, 1060 switched=false 1062 The following diagram is a top view of the room with 3 cameras, 3 1063 displays, and 6 seats. Each camera is capturing 2 people. The six 1064 seats are not all in a straight line. 1066 ,-. d 1067 ( )`--.__ +---+ 1068 `-' / `--.__ | | 1069 ,-. | `-.._ |_-+Camera 2 (VC2) 1070 ( ).' ___..-+-''`+-+ 1071 `-' |_...---'' | | 1072 ,-.c+-..__ +---+ 1073 ( )| ``--..__ | | 1074 `-' | ``+-..|_-+Camera 1 (VC1) 1075 ,-. | __..--'|+-+ 1076 ( )| __..--' | | 1077 `-'b|..--' +---+ 1078 ,-. |``---..___ | | 1079 ( )\ ```--..._|_-+Camera 0 (VC0) 1080 `-' \ _..-''`-+ 1081 ,-. \ __.--'' | | 1082 ( ) |..-'' +---+ 1083 `-' a 1085 The two points labeled b and c are intended to be at the midpoint 1086 between the seating positions, and where the fields of view of the 1087 cameras intersect. 1088 The plane of interest for VC0 is a vertical plane that intersects 1089 points 'a' and 'b'. 1090 The plane of interest for VC1 intersects points 'b' and 'c'. 1091 The plane of interest for VC2 intersects points 'c' and 'd'. 1092 This example uses an area scale of millimeters. 1094 Areas of capture: 1095 bottom left bottom right top left top right 1096 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1097 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1098 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1099 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1100 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1101 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1102 VC6 none 1104 Points of capture: 1105 VC0 (-1678,0,800) 1106 VC1 (0,0,800) 1107 VC2 (1678,0,800) 1108 VC3 none 1109 VC4 none 1110 VC5 (0,0,800) 1111 VC6 none 1113 In this example, the right edge of the VC0 area lines up with the 1114 left edge of the VC1 area. It doesn't have to be this way. There 1115 could be a gap or an overlap. One additional thing to note for this 1116 example is the distance from a to b is equal to the distance from b 1117 to c and the distance from c to d. All these distances are 1346 mm. 1118 This is the planar width of each area of capture for VC0, VC1, and 1119 VC2. 1121 Note the text in parentheses (e.g. "the camera-left camera stream") 1122 is not explicitly part of the model, it is just explanatory text for 1123 this example, and is not included in the model with the media 1124 captures and attributes. Also, the "composed" boolean attribute 1125 doesn't say anything about how a capture is composed, so the media 1126 consumer can't tell based on this attribute that VC4 is composed of a 1127 "loudest panel with PiPs". 1129 Audio Captures: 1131 o AC0 (camera-left), encoding group=EG3, content=main, channel 1132 format=mono 1134 o AC1 (camera-right), encoding group=EG3, content=main, channel 1135 format=mono 1137 o AC2 (center) encoding group=EG3, content=main, channel format=mono 1139 o AC3 being a simple pre-mixed audio stream from the room (mono), 1140 encoding group=EG3, content=main, channel format=mono 1142 o AC4 audio stream associated with the presentation video (mono) 1143 encoding group=EG3, content=slides, channel format=mono 1145 Areas of capture: 1146 bottom left bottom right top left top right 1147 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1148 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1149 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1150 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1151 AC4 none 1153 The physical simultaneity information is: 1155 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1157 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1159 This constraint indicates it is not possible to use all the VCs at 1160 the same time. VC5 can not be used at the same time as VC1 or VC3 or 1161 VC4. Also, using every member in the set simultaneously may not make 1162 sense - for example VC3(loudest) and VC4 (loudest with PIP). (In 1163 addition, there are encoding constraints that make choosing all of 1164 the VCs in a set impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and 1165 EG1 has only 3 ENCs. This constraint shows up in the encoding 1166 groups, not in the simultaneous transmission sets.) 1168 In this example there are no restrictions on which audio captures can 1169 be sent simultaneously. 1171 Encoding Groups: 1173 This example has three encoding groups associated with the video 1174 captures. Each group can have 3 encodings, but with each potential 1175 encoding having a progressively lower specification. In this 1176 example, 1080p60 transmission is possible (as ENC0 has a maxMbps 1177 value compatible with that) as long as it is the only active encoding 1178 in the group(as maxMbps for the entire encoding group is also 1179 489600). Significantly, as up to 3 encodings are available per 1180 group, it is possible to transmit some video captures simultaneously 1181 that are not in the same entry in the capture scene. For example VC1 1182 and VC3 at the same time. 1184 It is also possible to transmit multiple encodings of a single video 1185 capture. For example VC0 can be encoded using ENC0 and ENC1 at the 1186 same time, as long as the encoding parameters satisfy the constraints 1187 of ENC0, ENC1, and EG0, such as one at 1080p30 and one at 720p30. 1189 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1190 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1191 maxH264Mbps=489600, maxBandwidth=4000000 1192 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1193 maxH264Mbps=108000, maxBandwidth=4000000 1194 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1195 maxH264Mbps=61200, maxBandwidth=4000000 1197 encodeGroupID=EG1 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1198 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1199 maxH264Mbps=489600, maxBandwidth=4000000 1200 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1201 maxH264Mbps=108000, maxBandwidth=4000000 1202 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1203 maxH264Mbps=61200, maxBandwidth=4000000 1205 encodeGroupID=EG2 maxGroupH264Mbps=489600 maxGroupBandwidth=6000000 1206 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1207 maxH264Mbps=489600, maxBandwidth=4000000 1208 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1209 maxH264Mbps=108000, maxBandwidth=4000000 1210 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1211 maxH264Mbps=61200, maxBandwidth=4000000 1213 Figure 2: Example Encoding Groups for Video 1215 For audio, there are five potential encodings available, so all five 1216 audio captures can be encoded at the same time. 1218 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1219 encodeID=ENC9, maxBandwidth=64000 1220 encodeID=ENC10, maxBandwidth=64000 1221 encodeID=ENC11, maxBandwidth=64000 1222 encodeID=ENC12, maxBandwidth=64000 1223 encodeID=ENC13, maxBandwidth=64000 1225 Figure 3: Example Encoding Group for Audio 1227 Capture Scenes: 1229 The following table represents the capture scenes for this provider. 1230 Recall that a capture scene is composed of alternative capture scene 1231 entries covering the same scene. Capture Scene #1 is for the main 1232 people captures, and Capture Scene #2 is for presentation. 1234 Each row in the table is a separate entry in the capture scene 1236 +------------------+ 1237 | Capture Scene #1 | 1238 +------------------+ 1239 | VC0, VC1, VC2 | 1240 | VC3 | 1241 | VC4 | 1242 | VC5 | 1243 | AC0, AC1, AC2 | 1244 | AC3 | 1245 +------------------+ 1247 +------------------+ 1248 | Capture Scene #2 | 1249 +------------------+ 1250 | VC6 | 1251 | AC4 | 1252 +------------------+ 1254 Different capture scenes are unique to each other, non-overlapping. 1255 A consumer can choose an entry from each capture scene. In this case 1256 the three captures VC0, VC1, and VC2 are one way of representing the 1257 video from the endpoint. These three captures should appear adjacent 1258 next to each other. Alternatively, another way of representing the 1259 Capture Scene is with the capture VC3, which automatically shows the 1260 person who is talking. Similarly for the VC4 and VC5 alternatives. 1262 As in the video case, the different entries of audio in Capture Scene 1263 #1 represent the "same thing", in that one way to receive the audio 1264 is with the 3 audio captures (AC0, AC1, AC2), and another way is with 1265 the mixed AC3. The Media Consumer can choose an audio capture entry 1266 it is capable of receiving. 1268 The spatial ordering is understood by the media capture attributes 1269 area and point of capture. 1271 A Media Consumer would likely want to choose a capture scene entry to 1272 receive based in part on how many streams it can simultaneously 1273 receive. A consumer that can receive three people streams would 1274 probably prefer to receive the first entry of Capture Scene #1 (VC0, 1275 VC1, VC2) and not receive the other entries. A consumer that can 1276 receive only one people stream would probably choose one of the other 1277 entries. 1279 If the consumer can receive a presentation stream too, it would also 1280 choose to receive the only entry from Capture Scene #2 (VC6). 1282 11.2. Encoding Group Example 1284 This is an example of an encoding group to illustrate how it can 1285 express dependencies between encodings. 1287 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1288 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1289 maxH264Mbps=244800, maxBandwidth=4000000 1290 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1291 maxH264Mbps=244800, maxBandwidth=4000000 1292 encodeID=AUDENC0, maxBandwidth=96000 1293 encodeID=AUDENC1, maxBandwidth=96000 1294 encodeID=AUDENC2, maxBandwidth=96000 1296 Here, the encoding group is EG0. It can transmit up to two 1080p30 1297 encodings (Mbps for 1080p = 244800), but it is capable of 1298 transmitting a maxFrameRate of 60 frames per second (fps). To 1299 achieve the maximum resolution (1920 x 1088) the frame rate is 1300 limited to 30 fps. However 60 fps can be achieved at a lower 1301 resolution if required by the consumer. Although the encoding group 1302 is capable of transmitting up to 6Mbit/s, no individual video 1303 encoding can exceed 4Mbit/s. 1305 This encoding group also allows up to 3 audio encodings, AUDENC<0-2>. 1306 It is not required that audio and video encodings reside within the 1307 same encoding group, but if so then the group's overall maxBandwidth 1308 value is a limit on the sum of all audio and video encodings 1309 configured by the consumer. A system that does not wish or need to 1310 combine bandwidth limitations in this way should instead use separate 1311 encoding groups for audio and video in order for the bandwidth 1312 limitations on audio and video to not interact. 1314 Audio and video can be expressed in separate encoding groups, as in 1315 this illustration. 1317 encodeGroupID=EG0, maxGroupH264Mbps=489600, maxGroupBandwidth=6000000 1318 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1319 maxH264Mbps=244800, maxBandwidth=4000000 1320 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1321 maxH264Mbps=244800, maxBandwidth=4000000 1323 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1324 encodeID=AUDENC0, maxBandwidth=96000 1325 encodeID=AUDENC1, maxBandwidth=96000 1326 encodeID=AUDENC2, maxBandwidth=96000 1328 11.3. The MCU Case 1330 This section shows how an MCU might express its Capture Scenes, 1331 intending to offer different choices for consumers that can handle 1332 different numbers of streams. A single audio capture stream is 1333 provided for all single and multi-screen configurations that can be 1334 associated (e.g. lip-synced) with any combination of video captures 1335 at the consumer. 1337 +--------------------+---------------------------------------------+ 1338 | Capture Scene #1 | note | 1339 +--------------------+---------------------------------------------+ 1340 | VC0 | video capture for single screen consumer | 1341 | VC1, VC2 | video capture for 2 screen consumer | 1342 | VC3, VC4, VC5 | video capture for 3 screen consumer | 1343 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer | 1344 | AC0 | audio capture representing all participants | 1345 +--------------------+---------------------------------------------+ 1347 If / when a presentation stream becomes active within the conference, 1348 the MCU might re-advertise the available media as: 1350 +------------------+--------------------------------------+ 1351 | Capture Scene #2 | note | 1352 +------------------+--------------------------------------+ 1353 | VC10 | video capture for presentation | 1354 | AC1 | presentation audio to accompany VC10 | 1355 +------------------+--------------------------------------+ 1357 11.4. Media Consumer Behavior 1359 This section gives an example of how a media consumer might behave 1360 when deciding how to request streams from the three screen endpoint 1361 described in the previous section. 1363 The receive side of a call needs to balance its requirements, based 1364 on number of screens and speakers, its decoding capabilities and 1365 available bandwidth, and the provider's capabilities in order to 1366 optimally configure the provider's streams. Typically it would want 1367 to receive and decode media from each capture scene advertised by the 1368 provider. 1370 A sane, basic, algorithm might be for the consumer to go through each 1371 capture scene in turn and find the collection of video captures that 1372 best matches the number of screens it has (this might include 1373 consideration of screens dedicated to presentation video display 1374 rather than "people" video) and then decide between alternative 1375 entries in the video capture scenes based either on hard-coded 1376 preferences or user choice. Once this choice has been made, the 1377 consumer would then decide how to configure the provider's encoding 1378 groups in order to make best use of the available network bandwidth 1379 and its own decoding capabilities. 1381 11.4.1. One screen consumer 1383 VC3, VC4 and VC5 are all different entries by themselves, not grouped 1384 together in a single entry, so the receiving device should choose 1385 between one of those. The choice would come down to whether to see 1386 the greatest number of participants simultaneously at roughly equal 1387 precedence (VC5), a switched view of just the loudest region (VC3) or 1388 a switched view with PiPs (VC4). An endpoint device with a small 1389 amount of knowledge of these differences could offer a dynamic choice 1390 of these options, in-call, to the user. 1392 11.4.2. Two screen consumer configuring the example 1394 Mixing systems with an even number of screens, "2n", and those with 1395 "2n+1" cameras (and vice versa) is always likely to be the 1396 problematic case. In this instance, the behavior is likely to be 1397 determined by whether a "2 screen" system is really a "2 decoder" 1398 system, i.e., whether only one received stream can be displayed per 1399 screen or whether more than 2 streams can be received and spread 1400 across the available screen area. To enumerate 3 possible behaviors 1401 here for the 2 screen system when it learns that the far end is 1402 "ideally" expressed via 3 capture streams: 1404 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1405 per the 1 screen consumer case above) and either leave one screen 1406 blank or use it for presentation if / when a presentation becomes 1407 active 1409 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 1410 (either with each capture being scaled to 2/3 of a screen and the 1411 centre capture being split across 2 screens) or, as would be 1412 necessary if there were large bezels on the screens, with each 1413 stream being scaled to 1/2 the screen width and height and there 1414 being a 4th "blank" panel. This 4th panel could potentially be 1415 used for any presentation that became active during the call. 1417 3. Receive 3 streams, decode all 3, and use control information 1418 indicating which was the most active to switch between showing 1419 the left and centre streams (one per screen) and the centre and 1420 right streams. 1422 For an endpoint capable of all 3 methods of working described above, 1423 again it might be appropriate to offer the user the choice of display 1424 mode. 1426 11.4.3. Three screen consumer configuring the example 1428 This is the most straightforward case - the consumer would look to 1429 identify a set of streams to receive that best matched its available 1430 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 1431 spatial ordering would give sufficient information for the correct 1432 video capture to be shown on the correct screen, and the consumer 1433 would either need to divide a single encoding group's capability by 3 1434 to determine what resolution and frame rate to configure the provider 1435 with or to configure the individual video captures' encoding groups 1436 with what makes most sense (taking into account the receive side 1437 decode capabilities, overall call bandwidth, the resolution of the 1438 screens plus any user preferences such as motion vs sharpness). 1440 12. Acknowledgements 1442 Mark Gorzyinski contributed much to the approach. We want to thank 1443 Stephen Botzko for helpful discussions on audio. 1445 13. IANA Considerations 1447 TBD 1449 14. Security Considerations 1451 TBD 1453 15. Changes Since Last Version 1455 NOTE TO THE RFC-Editor: Please remove this section prior to 1456 publication as an RFC. 1458 Changes from 05 to 06: 1460 1. Capture scene description attribute is a list of text strings, 1461 each in a different language, rather than just a single string. 1463 2. Add new Axis of Capture Point attribute. 1465 3. Remove appendices A.1 through A.6. 1467 4. Clarify that the provider must use the same coordinate system 1468 with same scale and origin for all coordinates within the same 1469 capture scene. 1471 Changes from 04 to 05: 1473 1. Clarify limitations of "composed" attribute. 1475 2. Add new section "capture scene entry attributes" and add the 1476 attribute "scene-switch-policy". 1478 3. Add capture scene description attribute and description language 1479 attribute. 1481 4. Editorial changes to examples section for consistency with the 1482 rest of the document. 1484 Changes from 03 to 04: 1486 1. Remove sentence from overview - "This constitutes a significant 1487 change ..." 1489 2. Clarify a consumer can choose a subset of captures from a 1490 capture scene entry or a simultaneous set (in section "capture 1491 scene" and "consumer's choice..."). 1493 3. Reword first paragraph of Media Capture Attributes section. 1495 4. Clarify a stereo audio capture is different from two mono audio 1496 captures (description of audio channel format attribute). 1498 5. Clarify what it means when coordinate information is not 1499 specified for area of capture, point of capture, area of scene. 1501 6. Change the term "producer" to "provider" to be consistent (it 1502 was just in two places). 1504 7. Change name of "purpose" attribute to "content" and refer to 1505 RFC4796 for values. 1507 8. Clarify simultaneous sets are part of a provider advertisement, 1508 and apply across all capture scenes in the advertisement. 1510 9. Remove sentence about lip-sync between all media captures in a 1511 capture scene. 1513 10. Combine the concepts of "capture scene" and "capture set" into a 1514 single concept, using the term "capture scene" to replace the 1515 previous term "capture set", and eliminating the original 1516 separate capture scene concept. 1518 16. Informative References 1520 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1521 Requirement Levels", BCP 14, RFC 2119, March 1997. 1523 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 1524 A., Peterson, J., Sparks, R., Handley, M., and E. 1525 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1526 June 2002. 1528 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1529 Jacobson, "RTP: A Transport Protocol for Real-Time 1530 Applications", STD 64, RFC 3550, July 2003. 1532 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1533 Session Initiation Protocol (SIP)", RFC 4353, 1534 February 2006. 1536 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description 1537 Protocol (SDP) Content Attribute", RFC 4796, 1538 February 2007. 1540 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 1541 January 2008. 1543 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1544 Languages", BCP 47, RFC 5646, September 2009. 1546 [IANA-Lan] 1547 IANA, "Language Subtag Registry", 1548 . 1551 Authors' Addresses 1553 Allyn Romanow 1554 Cisco Systems 1555 San Jose, CA 95134 1556 USA 1558 Email: allyn@cisco.com 1559 Mark Duckworth (editor) 1560 Polycom 1561 Andover, MA 01810 1562 USA 1564 Email: mark.duckworth@polycom.com 1566 Andrew Pepperell 1567 Langley, England 1568 UK 1570 Email: apeppere@gmail.com 1572 Brian Baldino 1573 Cisco Systems 1574 San Jose, CA 95134 1575 USA 1577 Email: bbaldino@cisco.com