idnits 2.17.1 draft-ietf-clue-framework-21.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 126 has weird spacing: '...certain compa...' == Line 1983 has weird spacing: '...om left bot...' -- The document date (March 3, 2015) is 3340 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC4566' is mentioned on line 1605, but not defined ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866) == Missing Reference: 'RFC6351' is mentioned on line 872, but not defined == Missing Reference: 'RFC6350' is mentioned on line 883, but not defined == Missing Reference: 'RFC 6503' is mentioned on line 3003, but not defined == Missing Reference: 'RFC 3261' is mentioned on line 3025, but not defined == Unused Reference: 'RFC4579' is defined on line 3458, but no explicit reference was found in the text == Outdated reference: A later version (-18) exists of draft-ietf-clue-datachannel-05 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-datachannel (ref. 'I-D.ietf-clue-datachannel') == Outdated reference: A later version (-17) exists of draft-ietf-clue-data-model-schema-07 == Outdated reference: A later version (-19) exists of draft-ietf-clue-protocol-02 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-protocol (ref. 'I-D.ietf-clue-protocol') == Outdated reference: A later version (-15) exists of draft-ietf-clue-signaling-04 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-signaling (ref. 'I-D.ietf-clue-signaling') == Outdated reference: A later version (-14) exists of draft-ietf-clue-rtp-mapping-03 -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 4 errors (**), 0 flaws (~~), 15 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Standards Track A. Pepperell 4 Expires: September 3, 2015 Acano 5 S. Wenger 6 Vidyo 7 March 3, 2015 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-21.txt 12 Abstract 14 This document defines a framework for a protocol to enable devices 15 in a telepresence conference to interoperate. The protocol enables 16 communication of information about multiple media streams so a 17 sending system and receiving system can make reasonable decisions 18 about transmitting, selecting and rendering the media streams. 19 This protocol is used in addition to SIP signaling and SDP 20 negotiation for setting up a telepresence session. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current 30 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other 34 documents at any time. It is inappropriate to use Internet-Drafts 35 as reference material or to cite them other than as "work in 36 progress." 38 This Internet-Draft will expire on September 3, 2015. 40 Copyright Notice 42 Copyright (c) 2013 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with 50 respect to this document. Code Components extracted from this 51 document must include Simplified BSD License text as described in 52 Section 4.e of the Trust Legal Provisions and are provided without 53 warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction...................................................3 58 2. Terminology....................................................4 59 3. Definitions....................................................4 60 4. Overview and Motivation........................................7 61 5. Description of the Framework/Model.............................9 62 6. Spatial Relationships.........................................15 63 7. Media Captures and Capture Scenes.............................17 64 7.1. Media Captures...........................................17 65 7.1.1. Media Capture Attributes............................18 66 7.2. Multiple Content Capture.................................23 67 7.2.1. MCC Attributes......................................24 68 7.3. Capture Scene............................................30 69 7.3.1. Capture Scene attributes............................33 70 7.3.2. Capture Scene View attributes.......................33 71 7.4. Global View List.........................................34 72 8. Simultaneous Transmission Set Constraints.....................35 73 9. Encodings.....................................................37 74 9.1. Individual Encodings.....................................37 75 9.2. Encoding Group...........................................38 76 9.3. Associating Captures with Encoding Groups................39 77 10. Consumer's Choice of Streams to Receive from the Provider....40 78 10.1. Local preference........................................43 79 10.2. Physical simultaneity restrictions......................43 80 10.3. Encoding and encoding group limits......................43 81 11. Extensibility................................................44 82 12. Examples - Using the Framework (Informative).................44 83 12.1. Provider Behavior.......................................44 84 12.1.1. Three screen Endpoint Provider.....................44 85 12.1.2. Encoding Group Example.............................51 86 12.1.3. The MCU Case.......................................52 88 12.2. Media Consumer Behavior.................................53 89 12.2.1. One screen Media Consumer..........................53 90 12.2.2. Two screen Media Consumer configuring the example..54 91 12.2.3. Three screen Media Consumer configuring the example55 92 12.3. Multipoint Conference utilizing Multiple Content Captures55 93 12.3.1. Single Media Captures and MCC in the same 94 Advertisement..............................................55 95 12.3.2. Several MCCs in the same Advertisement.............59 96 12.3.3. Heterogeneous conference with switching and 97 composition................................................60 98 12.3.4. Heterogeneous conference with voice activated 99 switching..................................................67 100 13. Acknowledgements.............................................70 101 14. IANA Considerations..........................................70 102 15. Security Considerations......................................70 103 16. Changes Since Last Version...................................72 104 17. Normative References.........................................80 105 18. Informative References.......................................81 106 19. Authors' Addresses...........................................82 108 1. Introduction 110 Current telepresence systems, though based on open standards such 111 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 112 each other. A major factor limiting the interoperability of 113 telepresence systems is the lack of a standardized way to describe 114 and negotiate the use of multiple audio and video streams 115 comprising the media flows. This document provides a framework for 116 protocols to enable interoperability by handling multiple streams 117 in a standardized way. The framework is intended to support the 118 use cases described in Use Cases for Telepresence Multistreams 119 [RFC7205] and to meet the requirements in Requirements for 120 Telepresence Multistreams [RFC7262]. This includes cases using 121 multiple media streams that are not necessarily telepresence. 123 This document occasionally refers to the term "CLUE", in capital 124 letters. CLUE is an acronym for "ControLling mUltiple streams for 125 tElepresence", which is the name of the IETF working group in which 126 this document and certain companion documents have been developed. 127 Often, CLUE-something refers to something that has been designed by 128 the CLUE working group; for example, this document may be called 129 the CLUE-framework. 131 The basic session setup for the use cases is based on SIP [RFC3261] 132 and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP 133 offer/answer, CLUE specific signaling is required to exchange the 134 information describing the multiple media streams. The motivation 135 for this framework, an overview of the signaling, and information 136 required to be exchanged is described in subsequent sections of 137 this document. Companion documents describe the signaling details 138 [I-D.ietf-clue-signaling] and the data model [I-D.ietf-clue-data- 139 model-schema] and protocol [I-D.ietf-clue-protocol]. 141 2. Terminology 143 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 144 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 145 this document are to be interpreted as described in RFC 2119 146 [RFC2119]. 148 3. Definitions 150 The terms defined below are used throughout this document and 151 companion documents. In order to easily identify the use of a 152 defined term, those terms are capitalized. 154 Advertisement: a CLUE message a Media Provider sends to a Media 155 Consumer describing specific aspects of the content of the media, 156 and any restrictions it has in terms of being able to provide 157 certain Streams simultaneously. 159 Audio Capture: Media Capture for audio. Denoted as ACn in the 160 examples in this document. 162 Capture: Same as Media Capture. 164 Capture Device: A device that converts physical input, such as 165 audio, video or text, into an electrical signal, in most cases to 166 be fed into a media encoder. 168 Capture Encoding: A specific encoding of a Media Capture, to be 169 sent by a Media Provider to a Media Consumer via RTP. 171 Capture Scene: a structure representing a spatial region captured 172 by one or more Capture Devices, each capturing media representing a 173 portion of the region. The spatial region represented by a Capture 174 Scene MAY correspond to a real region in physical space, such as a 175 room. A Capture Scene includes attributes and one or more Capture 176 Scene Views, with each view including one or more Media Captures. 178 Capture Scene View (CSV): a list of Media Captures of the same 179 media type that together form one way to represent the entire 180 Capture Scene. 182 CLUE-capable device: A device that supports the CLUE data channel 183 [I-D.ietf-clue-datachannel], the CLUE protocol [I-D.ietf-clue- 184 protocol] and the principles of CLUE negotiation, and seeks CLUE- 185 enabled calls. 187 CLUE-enabled call: A call in which two CLUE-capable devices have 188 successfully negotiated support for a CLUE data channel in SDP 189 [RFC4566]. A CLUE-enabled call is not necessarily immediately able 190 to send CLUE-controlled media; negotiation of the data channel and 191 of the CLUE protocol must complete first. Calls between two CLUE- 192 capable devices which have not yet successfully completed 193 negotiation of support for the CLUE data channel in SDP are not 194 considered CLUE- enabled. 196 Conference: used as defined in [RFC4353], A Framework for 197 Conferencing within the Session Initiation Protocol (SIP). 199 Configure Message: A CLUE message a Media Consumer sends to a Media 200 Provider specifying which content and Media Streams it wants to 201 receive, based on the information in a corresponding Advertisement 202 message. 204 Consumer: short for Media Consumer. 206 Encoding: short for Individual Encoding. 208 Encoding Group: A set of encoding parameters representing a total 209 media encoding capability to be sub-divided across potentially 210 multiple Individual Encodings. 212 Endpoint: A CLUE-capable device which is the logical point of final 213 termination through receiving, decoding and rendering, and/or 214 initiation through capturing, encoding, and sending of media 215 streams. An endpoint consists of one or more physical devices 216 which source and sink media streams, and exactly one [RFC4353] 217 Participant (which, in turn, includes exactly one SIP User Agent). 218 Endpoints can be anything from multiscreen/multicamera rooms to 219 handheld devices. 221 Global View: A set of references to one or more Capture Scene Views 222 of the same media type that are defined within Scenes of the same 223 advertisement. A Global View is a suggestion from the Provider to 224 the Consumer for one set of CSVs that provide a useful 225 representation of all the scenes in the advertisement. 227 Global View List: A list of Global Views included in an 228 Advertisement. A Global View List may include Global Views of 229 different media types. 231 Individual Encoding: a set of parameters representing a way to 232 encode a Media Capture to become a Capture Encoding. 234 Multipoint Control Unit (MCU): a CLUE-capable device that connects 235 two or more endpoints together into one single multimedia 236 conference [RFC5117]. An MCU includes an [RFC4353] like Mixer, 237 without the [RFC4353] requirement to send media to each 238 participant. 240 Media: Any data that, after suitable encoding, can be conveyed over 241 RTP, including audio, video or timed text. 243 Media Capture: a source of Media, such as from one or more Capture 244 Devices or constructed from other Media streams. 246 Media Consumer: a CLUE-capable device that intends to receive 247 Capture Encodings. 249 Media Provider: a CLUE-capable device that intends to send Capture 250 Encodings. 252 Multiple Content Capture (MCC): A Capture that mixes and/or 253 switches other Captures of a single type. (E.g. all audio or all 254 video.) Particular Media Captures may or may not be present in the 255 resultant Capture Encoding depending on time or space. Denoted as 256 MCCn in the example cases in this document. 258 Plane of Interest: The spatial plane within a scene containing the 259 most relevant subject matter. 261 Provider: Same as Media Provider. 263 Render: the process of generating a representation from media, such 264 as displayed motion video or sound emitted from loudspeakers. 266 Scene: Same as Capture Scene 268 Simultaneous Transmission Set: a set of Media Captures that can be 269 transmitted simultaneously from a Media Provider. 271 Single Media Capture: A capture which contains media from a single 272 source capture device, e.g. an audio capture from a single 273 microphone, a video capture from a single camera. 275 Spatial Relation: The arrangement in space of two objects, in 276 contrast to relation in time or other relationships. 278 Stream: a Capture Encoding sent from a Media Provider to a Media 279 Consumer via RTP [RFC3550]. 281 Stream Characteristics: the media stream attributes commonly used 282 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 283 resolution, profile/level etc.) as well as CLUE specific 284 attributes, such as the Capture ID or a spatial location. 286 Video Capture: Media Capture for video. Denoted as VCn in the 287 example cases in this document. 289 Video Composite: A single image that is formed, normally by an RTP 290 mixer inside an MCU, by combining visual elements from separate 291 sources. 293 4. Overview and Motivation 295 This section provides an overview of the functional elements 296 defined in this document to represent a telepresence or 297 multistream system. The motivations for the framework described 298 in this document are also provided. 300 Two key concepts introduced in this document are the terms "Media 301 Provider" and "Media Consumer". A Media Provider represents the 302 entity that sends the media and a Media Consumer represents the 303 entity that receives the media. A Media Provider provides Media in 304 the form of RTP packets, a Media Consumer consumes those RTP 305 packets. Media Providers and Media Consumers can reside in 306 Endpoints or in Multipoint Control Units (MCUs). A Media Provider 307 in an Endpoint is usually associated with the generation of media 308 for Media Captures; these Media Captures are typically sourced 309 from cameras, microphones, and the like. Similarly, the Media 310 Consumer in an Endpoint is usually associated with renderers, such 311 as screens and loudspeakers. In MCUs, Media Providers and 312 Consumers can have the form of outputs and inputs, respectively, 313 of RTP mixers, RTP translators, and similar devices. Typically, 314 telepresence devices such as Endpoints and MCUs would perform as 315 both Media Providers and Media Consumers, the former being 316 concerned with those devices' transmitted media and the latter 317 with those devices' received media. In a few circumstances, a 318 CLUE-capable device includes only Consumer or Provider 319 functionality, such as recorder-type Consumers or webcam-type 320 Providers. 322 The motivations for the framework outlined in this document 323 include the following: 325 (1) Endpoints in telepresence systems typically have multiple Media 326 Capture and Media Render devices, e.g., multiple cameras and 327 screens. While previous system designs were able to set up calls 328 that would capture media using all cameras and display media on all 329 screens, for example, there was no mechanism that could associate 330 these Media Captures with each other in space and time, in a cross- 331 vendor interoperable way. 333 (2) The mere fact that there are multiple capturing and rendering 334 devices, each of which may be configurable in aspects such as zoom, 335 leads to the difficulty that a variable number of such devices can 336 be used to capture different aspects of a region. The Capture 337 Scene concept allows for the description of multiple setups for 338 those multiple capture devices that could represent sensible 339 operation points of the physical capture devices in a room, chosen 340 by the operator. A Consumer can pick and choose from those 341 configurations based on its rendering abilities and inform the 342 Provider about its choices. Details are provided in section 7. 344 (3) In some cases, physical limitations or other reasons disallow 345 the concurrent use of a device in more than one setup. For 346 example, the center camera in a typical three-camera conference 347 room can set its zoom objective either to capture only the middle 348 few seats, or all seats of a room, but not both concurrently. The 349 Simultaneous Transmission Set concept allows a Provider to signal 350 such limitations. Simultaneous Transmission Sets are part of the 351 Capture Scene description, and are discussed in section 8. 353 (4) Often, the devices in a room do not have the computational 354 complexity or connectivity to deal with multiple encoding options 355 simultaneously, even if each of these options is sensible in 356 certain scenarios, and even if the simultaneous transmission is 357 also sensible (i.e. in case of multicast media distribution to 358 multiple endpoints). Such constraints can be expressed by the 359 Provider using the Encoding Group concept, described in section 9. 361 (5) Due to the potentially large number of RTP streams required for 362 a Multimedia Conference involving potentially many Endpoints, each 363 of which can have many Media Captures and media renderers, it has 364 become common to multiplex multiple RTP streams onto the same 365 transport address, so to avoid using the port number as a 366 multiplexing point and the associated shortcomings such as 367 NAT/firewall traversal. The large number of possible permutations 368 of sensible options a Media Provider can make available to a Media 369 Consumer makes a mechanism desirable that allows it to narrow down 370 the number of possible options that a SIP offer/answer exchange has 371 to consider. Such information is made available using protocol 372 mechanisms specified in this document and companion documents, 373 although it should be stressed that its use in an implementation is 374 OPTIONAL. Also, there are aspects of the control of both Endpoints 375 and MCUs that dynamically change during the progress of a call, 376 such as audio-level based screen switching, layout changes, and so 377 on, which need to be conveyed. Note that these control aspects are 378 complementary to those specified in traditional SIP based 379 conference management such as BFCP. An exemplary call flow can be 380 found in section 5. 382 Finally, all this information needs to be conveyed, and the notion 383 of support for it needs to be established. This is done by the 384 negotiation of a "CLUE channel", a data channel negotiated early 385 during the initiation of a call. An Endpoint or MCU that rejects 386 the establishment of this data channel, by definition, does not 387 support CLUE based mechanisms, whereas an Endpoint or MCU that 388 accepts it is REQUIRED to use it to the extent specified in this 389 document and its companion documents. 391 5. Description of the Framework/Model 393 The CLUE framework specifies how multiple media streams are to be 394 handled in a telepresence conference. 396 A Media Provider (transmitting Endpoint or MCU) describes specific 397 aspects of the content of the media and the media stream encodings 398 it can send in an Advertisement; and the Media Consumer responds to 399 the Media Provider by specifying which content and media streams it 400 wants to receive in a Configure message. The Provider then 401 transmits the asked-for content in the specified streams. 403 This Advertisement and Configure typically occur during call 404 initiation, after CLUE has been enabled in a call, but MAY also 405 happen at any time throughout the call, whenever there is a change 406 in what the Consumer wants to receive or (perhaps less common) the 407 Provider can send. 409 An Endpoint or MCU typically act as both Provider and Consumer at 410 the same time, sending Advertisements and sending Configurations in 411 response to receiving Advertisements. (It is possible to be just 412 one or the other.) 414 The data model [I-D.ietf-clue-data-model-schema]is based around two 415 main concepts: a Capture and an Encoding. A Media Capture (MC), 416 such as of type audio or video, has attributes to describe the 417 content a Provider can send. Media Captures are described in terms 418 of CLUE-defined attributes, such as spatial relationships and 419 purpose of the capture. Providers tell Consumers which Media 420 Captures they can provide, described in terms of the Media Capture 421 attributes. 423 A Provider organizes its Media Captures into one or more Capture 424 Scenes, each representing a spatial region, such as a room. A 425 Consumer chooses which Media Captures it wants to receive from the 426 Capture Scenes. 428 In addition, the Provider can send the Consumer a description of 429 the Individual Encodings it can send in terms of identifiers which 430 relate to items in SDP [RFC4566]. 432 The Provider can also specify constraints on its ability to provide 433 Media, and a sensible design choice for a Consumer is to take these 434 into account when choosing the content and Capture Encodings it 435 requests in the later offer/answer exchange. Some constraints are 436 due to the physical limitations of devices--for example, a camera 437 may not be able to provide zoom and non-zoom views simultaneously. 438 Other constraints are system based, such as maximum bandwidth. 440 The following diagram illustrates the information contained in an 441 Advertisement. 443 ................................................................... 444 . Provider Advertisement +--------------------+ . 445 . | Simultaneous Sets | . 446 . +------------------------+ +--------------------+ . 447 . | Capture Scene N | +--------------------+ . 448 . +-+----------------------+ | | Global View List | . 449 . | Capture Scene 2 | | +--------------------+ . 450 . +-+----------------------+ | | +----------------------+ . 451 . | Capture Scene 1 | | | | Encoding Group N | . 452 . | +---------------+ | | | +-+--------------------+ | . 453 . | | Attributes | | | | | Encoding Group 2 | | . 454 . | +---------------+ | | | +-+--------------------+ | | . 455 . | | | | | Encoding Group 1 | | | . 456 . | +----------------+ | | | | parameters | | | . 457 . | | V i e w s | | | | | bandwidth | | | . 458 . | | +---------+ | | | | | +-------------------+| | | . 459 . | | |Attribute| | | | | | | V i d e o || | | . 460 . | | +---------+ | | | | | | E n c o d i n g s || | | . 461 . | | | | | | | | Encoding 1 || | | . 462 . | | View 1 | | | | | | || | | . 463 . | | (list of MCs) | | |-+ | +-------------------+| | | . 464 . | +----|-|--|------+ |-+ | | | | . 465 . +---------|-|--|---------+ | +-------------------+| | | . 466 . | | | | | A u d i o || | | . 467 . | | | | | E n c o d i n g s || | | . 468 . v | | | | Encoding 1 || | | . 469 . +---------|--|--------+ | | || | | . 470 . | Media Capture N |------>| +-------------------+| | | . 471 . +-+---------v--|------+ | | | | | . 472 . | Media Capture 2 | | | | |-+ . 473 . +-+--------------v----+ |-------->| | | . 474 . | Media Capture 1 | | | | |-+ . 475 . | +----------------+ |---------->| | . 476 . | | Attributes | | |_+ +----------------------+ . 477 . | +----------------+ |_+ . 478 . +---------------------+ . 479 . . 480 ................................................................... 482 Figure 1: Advertisement Structure 484 A very brief outline of the call flow used by a simple system (two 485 Endpoints) in compliance with this document can be described as 486 follows, and as shown in the following figure. 488 +-----------+ +-----------+ 489 | Endpoint1 | | Endpoint2 | 490 +----+------+ +-----+-----+ 491 | INVITE (BASIC SDP+CLUECHANNEL) | 492 |--------------------------------->| 493 | 200 0K (BASIC SDP+CLUECHANNEL)| 494 |<---------------------------------| 495 | ACK | 496 |--------------------------------->| 497 | | 498 |<################################>| 499 | BASIC SDP MEDIA SESSION | 500 |<################################>| 501 | | 502 | CONNECT (CLUE CTRL CHANNEL) | 503 |=================================>| 504 | ... | 505 |<================================>| 506 | CLUE CTRL CHANNEL ESTABLISHED | 507 |<================================>| 508 | | 509 | ADVERTISEMENT 1 | 510 |*********************************>| 511 | ADVERTISEMENT 2 | 512 |<*********************************| 513 | | 514 | CONFIGURE 1 | 515 |<*********************************| 516 | CONFIGURE 2 | 517 |*********************************>| 518 | | 519 | REINVITE (UPDATED SDP) | 520 |--------------------------------->| 521 | 200 0K (UPDATED SDP)| 522 |<---------------------------------| 523 | ACK | 524 |--------------------------------->| 525 | | 526 |<################################>| 527 | UPDATED SDP MEDIA SESSION | 528 |<################################>| 529 | | 530 v v 532 Figure 2: Basic Information Flow 534 An initial offer/answer exchange establishes a basic media session, 535 for example audio-only, and a CLUE channel between two Endpoints. 536 With the establishment of that channel, the endpoints have 537 consented to use the CLUE protocol mechanisms and, therefore, MUST 538 adhere to the CLUE protocol suite as outlined herein. 540 Over this CLUE channel, the Provider in each Endpoint conveys its 541 characteristics and capabilities by sending an Advertisement as 542 specified herein. The Advertisement is typically not sufficient to 543 set up all media. The Consumer in the Endpoint receives the 544 information provided by the Provider, and can use it for several 545 purposes. It uses it, along with information from an offer/answer 546 exchange, to construct a CLUE Configure message to tell the 547 Provider what the Consumer wishes to receive. Also, the Consumer 548 MAY use the information provided to tailor the SDP it is going to 549 send during any following SIP offer/answer exchange, and its 550 reaction to SDP it receives in that step. It is often a sensible 551 implementation choice to do so. Spatial relationships associated 552 with the Media can be included in the Advertisement, and it is 553 often sensible for the Media Consumer to take those spatial 554 relationships into account when tailoring the SDP. The Consumer 555 can also limit the number of encodings it must set up resources to 556 receive, and not waste resources on unwanted encodings, because it 557 has the Provider's Advertisement information ahead of time to 558 determine what it really wants to receive. The Consumer can also 559 use the Advertisement information for local rendering decisions. 561 This initial CLUE exchange is followed by an SDP offer/answer 562 exchange that not only establishes those aspects of the media that 563 have not been "negotiated" over CLUE, but has also the side effect 564 of setting up the media transmission itself, involving potentially 565 security exchanges, ICE, and whatnot. This step is plain vanilla 566 SIP. 568 During the lifetime of a call, further exchanges MAY occur over the 569 CLUE channel. In some cases, those further exchanges lead to a 570 modified system behavior of Provider or Consumer (or both) without 571 any other protocol activity such as further offer/answer exchanges. 572 For example, a Configure Message requesting the Provider to place a 573 different Capture source into a Capture Encoding, signaled over the 574 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 575 re-invites. However, in other cases, after the CLUE negotiation an 576 additional offer/answer exchange becomes necessary. For example, 577 if both sides decide to upgrade the call from a single screen to a 578 multi-screen call and more bandwidth is required for the additional 579 video channels compared to what was previously negotiated using 580 offer/answer, a new O/A exchange is REQUIRED. 582 One aspect of the protocol outlined herein and specified in more 583 detail in companion documents is that it makes available, to the 584 Consumer, information regarding the Provider's capabilities to 585 deliver Media, and attributes related to that Media such as their 586 spatial relationship. The operation of the renderer inside the 587 Consumer is unspecified in that it can choose to ignore some 588 information provided by the Provider, and/or not render media 589 streams available from the Provider (although it MUST follow the 590 CLUE protocol and, therefore, MUST gracefully receive and respond 591 (through a Configure) to the Provider's information). 593 A CLUE-capable device interoperates with a device that does not 594 support CLUE. The CLUE-capable device can determine, by the result 595 of the initial offer/answer exchange, if the other device supports 596 and wishes to use CLUE. The specific mechanism for this is 597 described in [I-D.ietf-clue-signaling]. If the other device does 598 not use CLUE, then the CLUE-capable device falls back to behavior 599 that does not require CLUE. 601 As for the media, Provider and Consumer have an end-to-end 602 communication relationship with respect to (RTP transported) media; 603 and the mechanisms described herein and in companion documents do 604 not change the aspects of setting up those RTP flows and sessions. 605 In other words, the RTP media sessions conform to the negotiated 606 SDP whether or not CLUE is used. 608 6. Spatial Relationships 610 In order for a Consumer to perform a proper rendering, it is often 611 necessary or at least helpful for the Consumer to have received 612 spatial information about the streams it is receiving. CLUE 613 defines a coordinate system that allows Media Providers to describe 614 the spatial relationships of their Media Captures to enable proper 615 scaling and spatially sensible rendering of their streams. The 616 coordinate system is based on a few principles: 618 o Each Capture Scene has a distinct coordinate system, unrelated 619 to the coordinate systems of other scenes. 621 o Simple systems which do not have multiple Media Captures to 622 associate spatially need not use the coordinate model, although 623 it can still be useful to provide an Area of Capture. 625 o Coordinates can be either in real, physical units (millimeters), 626 have an unknown scale or have no physical scale. Systems which 627 know their physical dimensions (for example professionally 628 installed Telepresence room systems) MUST provide those real- 629 world measurements to enable the best user experience for 630 advanced receiving systems that can utilize this information. 631 Systems which don't know specific physical dimensions but still 632 know relative distances MUST use 'unknown scale'. 'No scale' is 633 intended to be used only where Media Captures from different 634 devices (with potentially different scales) will be forwarded 635 alongside one another (e.g. in the case of an MCU). 637 * "Millimeters" means the scale is in millimeters. 639 * "Unknown" means the scale is not necessarily millimeters, but 640 the scale is the same for every Capture in the Capture Scene. 642 * "No Scale" means the scale could be different for each 643 capture- an MCU Provider that advertises two adjacent 644 captures and picks sources (which can change quickly) from 645 different endpoints might use this value; the scale could be 646 different and changing for each capture. But the areas of 647 capture still represent a spatial relation between captures. 649 o The coordinate system is right-handed Cartesian X, Y, Z with the 650 origin at a spatial location of the Provider's choosing. The 651 Provider MUST use the same coordinate system with the same scale 652 and origin for all coordinates within the same Capture Scene. 654 The direction of increasing coordinate values is: 655 X increases from left to right, from the point of view of an 656 observer at the front of the room looking toward the back 657 Y increases from the front of the room to the back of the room 658 Z increases from low to high (i.e. floor to ceiling) 660 Cameras in a scene typically point in the direction of increasing 661 Y, from front to back. But there could be multiple cameras 662 pointing in different directions. If the physical space does not 663 have a well-defined front and back, the provider chooses any 664 direction for X and Y consistent with right-handed coordinates. 666 7. Media Captures and Capture Scenes 668 This section describes how Providers can describe the content of 669 media to Consumers. 671 7.1. Media Captures 673 Media Captures are the fundamental representations of streams that 674 a device can transmit. What a Media Capture actually represents is 675 flexible: 677 o It can represent the immediate output of a physical source (e.g. 678 camera, microphone) or 'synthetic' source (e.g. laptop computer, 679 DVD player) 681 o It can represent the output of an audio mixer or video composer 683 o It can represent a concept such as 'the loudest speaker' 685 o It can represent a conceptual position such as 'the leftmost 686 stream' 688 To identify and distinguish between multiple Capture instances 689 Captures have a unique identity. For instance: VC1, VC2 and AC1, 690 AC2, where VC1 and VC2 refer to two different video captures and 691 AC1 and AC2 refer to two different audio captures. 693 Some key points about Media Captures: 695 . A Media Capture is of a single media type (e.g. audio or 696 video) 697 . A Media Capture is defined in a Capture Scene and is given an 698 Advertisement unique identity. The identity may be referenced 699 outside the Capture Scene that defines it through a Multiple 700 Content Capture (MCC) 701 . A Media Capture may be associated with one or more Capture 702 Scene Views 703 . A Media Capture has exactly one set of spatial information 704 . A Media Capture can be the source of at most one Capture 705 Encoding 707 Each Media Capture can be associated with attributes to describe 708 what it represents. 710 7.1.1. Media Capture Attributes 712 Media Capture Attributes describe information about the Captures. 713 A Provider can use the Media Capture Attributes to describe the 714 Captures for the benefit of the Consumer of the Advertisement 715 message. All these attributes are optional. Media Capture 716 Attributes include: 718 . Spatial information, such as point of capture, point on line 719 of capture, and area of capture, all of which, in combination 720 define the capture field of, for example, a camera 721 . Other descriptive information to help the Consumer choose 722 between captures (e.g. description, presentation, view, 723 priority, language, person information and type) 725 The sub-sections below define the Capture attributes. 727 7.1.1.1. Point of Capture 729 The Point of Capture attribute is a field with a single Cartesian 730 (X, Y, Z) point value which describes the spatial location of the 731 capturing device (such as camera). For an Audio Capture with 732 multiple microphones, the Point of Capture defines the nominal mid- 733 point of the microphones. 735 7.1.1.2. Point on Line of Capture 737 The Point on Line of Capture attribute is a field with a single 738 Cartesian (X, Y, Z) point value which describes a position in space 739 of a second point on the axis of the capturing device, toward the 740 direction it is pointing; the first point being the Point of 741 Capture (see above). 743 Together, the Point of Capture and Point on Line of Capture define 744 the direction and axis of the capturing device, for example the 745 optical axis of a camera or the axis of a microphone. The Media 746 Consumer can use this information to adjust how it renders the 747 received media if it so chooses. 749 For an Audio Capture, the Media Consumer can use this information 750 along with the Audio Capture Sensitivity Pattern to define a 3- 751 dimensional volume of capture where sounds can be expected to be 752 picked up by the microphone providing this specific audio capture. 753 If the Consumer wants to associate an Audio Capture with a Video 754 Capture, it can compare this volume with the area of capture for 755 video media to provide a check on whether the audio capture is 756 indeed spatially associated with the video capture. For example, a 757 video area of capture that fails to intersect at all with the audio 758 volume of capture, or is at such a long radial distance from the 759 microphone point of capture that the audio level would be very low, 760 would be inappropriate. 762 7.1.1.3. Area of Capture 764 The Area of Capture is a field with a set of four (X, Y, Z) points 765 as a value which describes the spatial location of what is being 766 "captured". This attribute applies only to video captures, not 767 other types of media. By comparing the Area of Capture for 768 different Video Captures within the same Capture Scene a Consumer 769 can determine the spatial relationships between them and render 770 them correctly. 772 The four points MUST be co-planar, forming a quadrilateral, which 773 defines the Plane of Interest for the particular Media Capture. 775 If the Area of Capture is not specified, it means the Video Capture 776 might be spatially related to other Captures in the same Scene, but 777 there is no detailed information on the relationship.For a switched 778 Capture that switches between different sections within a larger 779 area, the area of capture MUST use coordinates for the larger 780 potential area. 782 7.1.1.4. Mobility of Capture 784 The Mobility of Capture attribute indicates whether or not the 785 point of capture, line on point of capture, and area of capture 786 values stay the same over time, or are expected to change 787 (potentially frequently). Possible values are static, dynamic, and 788 highly dynamic. 790 An example for "dynamic" is a camera mounted on a stand which is 791 occasionally hand-carried and placed at different positions in 792 order to provide the best angle to capture a work task. A camera 793 worn by a person who moves around the room is an example for 794 "highly dynamic". In either case, the effect is that the capture 795 point, capture axis and area of capture change with time. 797 The capture point of a static Capture MUST NOT move for the life of 798 the CLUE session. The capture point of dynamic Captures is 799 categorized by a change in position followed by a reasonable period 800 of stability--in the order of magnitude of minutes. High dynamic 801 captures are categorized by a capture point that is constantly 802 moving. If the "area of capture", "capture point" and "line of 803 capture" attributes are included with dynamic or highly dynamic 804 Captures they indicate spatial information at the time of the 805 Advertisement. 807 7.1.1.5. Audio Capture Sensitivity Pattern 809 The Audio Capture Sensitivity Pattern attribute applies only to 810 audio captures. This attribute gives information about the nominal 811 sensitivity pattern of the microphone which is the source of the 812 Capture. Possible values include patterns such as omni, shotgun, 813 cardioid, hyper-cardioid. 815 7.1.1.6. Description 817 The Description attribute is a human-readable description (which 818 could be in multiple languages) of the Capture. 820 7.1.1.7. Presentation 822 The Presentation attribute indicates that the capture originates 823 from a presentation device, that is one that provides supplementary 824 information to a conference through slides, video, still images, 825 data etc. Where more information is known about the capture it MAY 826 be expanded hierarchically to indicate the different types of 827 presentation media, e.g. presentation.slides, presentation.image 828 etc. 830 Note: It is expected that a number of keywords will be defined that 831 provide more detail on the type of presentation. 833 7.1.1.8. View 835 The View attribute is a field with enumerated values, indicating 836 what type of view the Capture relates to. The Consumer can use 837 this information to help choose which Media Captures it wishes to 838 receive. The value MUST be one of: 840 Room - Captures the entire scene 842 Table - Captures the conference table with seated people 844 Individual - Captures an individual person 845 Lectern - Captures the region of the lectern including the 846 presenter, for example in a classroom style conference room 848 Audience - Captures a region showing the audience in a classroom 849 style conference room 851 7.1.1.9. Language 853 The Language attribute indicates one or more languages used in the 854 content of the Media Capture. Captures MAY be offered in different 855 languages in case of multilingual and/or accessible conferences. A 856 Consumer can use this attribute to differentiate between them and 857 pick the appropriate one. 859 Note that the Language attribute is defined and meaningful both for 860 audio and video captures. In case of audio captures, the meaning 861 is obvious. For a video capture, "Language" could, for example, be 862 sign interpretation or text. 864 The Language attribute is coded per [RFC5646]. 866 7.1.1.10. Person Information 868 The Person Information attribute allows a Provider to provide 869 specific information regarding the people in a Capture (regardless 870 of whether or not the capture has a Presentation attribute). The 871 Provider may gather the information automatically or manually from 872 a variety of sources however the xCard [RFC6351] format is used to 873 convey the information. This allows various information such as 874 Identification information (section 6.2/[RFC6350]), Communication 875 Information (section 6.4/[RFC6350]) and Organizational information 876 (section 6.6/[RFC6350]) to be communicated. A Consumer may then 877 automatically (i.e. via a policy) or manually select Captures 878 based on information about who is in a Capture. It also allows a 879 Consumer to render information regarding the people participating 880 in the conference or to use it for further processing. 882 The Provider may supply a minimal set of information or a larger 883 set of information. However it MUST be compliant to [RFC6350] and 884 supply a "VERSION" and "FN" property. A Provider may supply 885 multiple xCards per Capture of any KIND (section 6.1.4/[RFC6350]). 887 In order to keep CLUE messages compact the Provider SHOULD use a 888 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 889 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 890 message. 892 7.1.1.11. Person Type 894 The Person Type attribute indicates the type of people contained in 895 the capture with respect to the meeting agenda (regardless of 896 whether or not the capture has a Presentation attribute). As a 897 capture may include multiple people the attribute may contain 898 multiple values. However values MUST NOT be repeated within the 899 attribute. 901 An Advertiser associates the person type with an individual capture 902 when it knows that a particular type is in the capture. If an 903 Advertiser cannot link a particular type with some certainty to a 904 capture then it is not included. A Consumer on reception of a 905 capture with a person type attribute knows with some certainly that 906 the capture contains that person type. The capture may contain 907 other person types but the Advertiser has not been able to 908 determine that this is the case. 910 The types of Captured people include: 912 . Chairman - the person responsible for running the meeting 913 according to the agenda. 914 . Vice-Chairman - the person responsible for assisting the 915 chairman in running the meeting. 916 . Minute Taker - the person responsible for recording the 917 minutes of the meeting. 918 . Attendee - the person has no particular responsibilities with 919 respect to running the meeting. 920 . Observer - an Attendee without the right to influence the 921 discussion. 922 . Presenter - the person is scheduled on the agenda to make a 923 presentation in the meeting. Note: This is not related to any 924 "active speaker" functionality. 925 . Translator - the person is providing some form of translation 926 or commentary in the meeting. 927 . Timekeeper - the person is responsible for maintaining the 928 meeting schedule. 930 Furthermore the person type attribute may contain one or more 931 strings allowing the Provider to indicate custom meeting specific 932 types. 934 7.1.1.12. Priority 936 The Priority attribute indicates a relative priority between 937 different Media Captures. The Provider sets this priority, and the 938 Consumer MAY use the priority to help decide which Captures it 939 wishes to receive. 941 The "priority" attribute is an integer which indicates a relative 942 priority between Captures. For example it is possible to assign a 943 priority between two presentation Captures that would allow a 944 remote Endpoint to determine which presentation is more important. 945 Priority is assigned at the individual Capture level. It represents 946 the Provider's view of the relative priority between Captures with 947 a priority. The same priority number MAY be used across multiple 948 Captures. It indicates they are equally important. If no priority 949 is assigned no assumptions regarding relative important of the 950 Capture can be assumed. 952 7.1.1.13. Embedded Text 954 The Embedded Text attribute indicates that a Capture provides 955 embedded textual information. For example the video Capture MAY 956 contain speech to text information composed with the video image. 958 7.1.1.14. Related To 960 The Related To attribute indicates the Capture contains additional 961 complementary information related to another Capture. The value 962 indicates the identity of the other Capture to which this Capture 963 is providing additional information. 965 For example, a conference can utilize translators or facilitators 966 that provide an additional audio stream (i.e. a translation or 967 description or commentary of the conference). Where multiple 968 captures are available, it may be advantageous for a Consumer to 969 select a complementary Capture instead of or in addition to a 970 Capture it relates to. 972 7.2. Multiple Content Capture 974 The MCC indicates that one or more Single Media Captures are 975 multiplexed (temporally and/or spatially) or mixed in one Media 976 Capture. Only one Capture type (i.e. audio, video, etc.) is 977 allowed in each MCC instance. The MCC may contain a reference to 978 the Single Media Captures (which may have their own attributes) as 979 well as attributes associated with the MCC itself. A MCC may also 980 contain other MCCs. The MCC MAY reference Captures from within the 981 Capture Scene that defines it or from other Capture Scenes. No 982 ordering is implied by the order that Captures appear within a MCC. 983 A MCC MAY contain no references to other Captures to indicate that 984 the MCC contains content from multiple sources but no information 985 regarding those sources is given. MCCs either contain the 986 referenced Captures and no others, or have no referenced captures 987 and therefore may contain any Capture. 989 One or more MCCs may also be specified in a CSV. This allows an 990 Advertiser to indicate that several MCC captures are used to 991 represent a capture scene. Table 14 provides an example of this 992 case. 994 As outlined in section 7.1. each instance of the MCC has its own 995 Capture identity i.e. MCC1. It allows all the individual captures 996 contained in the MCC to be referenced by a single MCC identity. 998 The example below shows the use of a Multiple Content Capture: 1000 +-----------------------+---------------------------------+ 1001 | Capture Scene #1 | | 1002 +-----------------------|---------------------------------+ 1003 | VC1 | {MC attributes} | 1004 | VC2 | {MC attributes} | 1005 | VC3 | {MC attributes} | 1006 | MCC1(VC1,VC2,VC3) | {MC and MCC attributes} | 1007 | CSV(MCC1) | | 1008 +---------------------------------------------------------+ 1010 Table 1: Multiple Content Capture concept 1012 This indicates that MCC1 is a single capture that contains the 1013 Captures VC1, VC2 and VC3 according to any MCC1 attributes. 1015 7.2.1. MCC Attributes 1017 Media Capture Attributes may be associated with the MCC instance 1018 and the Single Media Captures that the MCC references. A Provider 1019 should avoid providing conflicting attribute values between the MCC 1020 and Single Media Captures. Where there is conflict the attributes 1021 of the MCC override any that may be present in the individual 1022 Captures. 1024 A Provider MAY include as much or as little of the original source 1025 Capture information as it requires. 1027 There are MCC specific attributes that MUST only be used with 1028 Multiple Content Captures. These are described in the sections 1029 below. The attributes described in section 7.1.1. MAY also be used 1030 with MCCs. 1032 The spatial related attributes of an MCC indicate its area of 1033 capture and point of capture within the scene, just like any other 1034 media capture. The spatial information does not imply anything 1035 about how other captures are composed within an MCC. 1037 For example: A virtual scene could be constructed for the MCC 1038 capture with two Video Captures with a "MaxCaptures" attribute set 1039 to 2 and an "Area of Capture" attribute provided with an overall 1040 area. Each of the individual Captures could then also include an 1041 "Area of Capture" attribute with a sub-set of the overall area. 1042 The Consumer would then know how each capture is related to others 1043 within the scene, but not the relative position of the individual 1044 captures within the composed capture. 1046 +-----------------------+---------------------------------+ 1047 | Capture Scene #1 | | 1048 +-----------------------|---------------------------------+ 1049 | VC1 | AreaofCapture=(0,0,0)(9,0,0) | 1050 | | (0,0,9)(9,0,9) | 1051 | VC2 | AreaofCapture=(10,0,0)(19,0,0) | 1052 | | (10,0,9)(19,0,9) | 1053 | MCC1(VC1,VC2) | MaxCaptures=2 | 1054 | | AreaofCapture=(0,0,0)(19,0,0) | 1055 | | (0,0,9)(19,0,9) | 1056 | CSV(MCC1) | | 1057 +---------------------------------------------------------+ 1059 Table 2: Example of MCC and Single Media Capture attributes 1061 The sub-sections below describe the MCC only attributes. 1063 7.2.1.1. Maximum Number of Captures within a MCC 1065 The Maximum Number of Captures MCC attribute indicates the maximum 1066 number of individual Captures that may appear in a Capture Encoding 1067 at a time. The actual number at any given time can be less than or 1068 equal to this maximum. It may be used to derive how the Single 1069 Media Captures within the MCC are composed / switched with regards 1070 to space and time. 1072 A Provider can indicate that the number of Captures in a MCC 1073 Capture Encoding is equal "=" to the MaxCaptures value or that 1074 there may be any number of Captures up to and including "<=" the 1075 MaxCaptures value. This allows a Provider to distinguish between a 1076 MCC that purely represents a composition of sources versus a MCC 1077 that represents switched or switched and composed sources. 1079 MaxCaptures MAY be set to one so that only content related to one 1080 of the sources are shown in the MCC Capture Encoding at a time or 1081 it may be set to any value up to the total number of Source Media 1082 Captures in the MCC. 1084 The bullets below describe how the setting of MaxCapture versus the 1085 number of Captures in the MCC affects how sources appear in a 1086 Capture Encoding: 1088 . When MaxCaptures is set to <= 1 and the number of Captures in 1089 the MCC is greater than 1 (or not specified) in the MCC this 1090 is a switched case. Zero or 1 Captures may be switched into 1091 the Capture Encoding. Note: zero is allowed because of the 1092 "<=". 1093 . When MaxCaptures is set to = 1 and the number of Captures in 1094 the MCC is greater than 1 (or not specified) in the MCC this 1095 is a switched case. Only one Capture source is contained in a 1096 Capture Encoding at a time. 1097 . When MaxCaptures is set to <= N (with N > 1) and the number of 1098 Captures in the MCC is greater than N (or not specified) this 1099 is a switched and composed case. The Capture Encoding may 1100 contain purely switched sources (i.e. <=2 allows for 1 source 1101 on its own), or may contain composed and switched sources 1102 (i.e. a composition of 2 sources switched between the 1103 sources). 1104 . When MaxCaptures is set to = N (with N > 1) and the number of 1105 Captures in the MCC is greater than N (or not specified) this 1106 is a switched and composed case. The Capture Encoding contains 1107 composed and switched sources (i.e. a composition of N sources 1108 switched between the sources). It is not possible to have a 1109 single source. 1110 . When MaxCaptures is set to <= to the number of Captures in the 1111 MCC this is a switched and composed case. The Capture Encoding 1112 may contain media switched between any number (up to the 1113 MaxCaptures) of composed sources. 1115 . When MaxCaptures is set to = to the number of Captures in the 1116 MCC this is a composed case. All the sources are composed into 1117 a single Capture Encoding. 1119 If this attribute is not set then as default it is assumed that all 1120 source media capture content can appear concurrently in the Capture 1121 Encoding associated with the MCC. 1123 For example: The use of MaxCaptures equal to 1 on a MCC with three 1124 Video Captures VC1, VC2 and VC3 would indicate that the Advertiser 1125 in the Capture Encoding would switch between VC1, VC2 or VC3 as 1126 there may be only a maximum of one Capture at a time. 1128 7.2.1.2. Policy 1130 The Policy MCC Attribute indicates the criteria that the Provider 1131 uses to determine when and/or where media content appears in the 1132 Capture Encoding related to the MCC. 1134 The attribute is in the form of a token that indicates the policy 1135 and an index representing an instance of the policy. The same 1136 index value can be used for multiple MCCs. 1138 The tokens are: 1140 SoundLevel - This indicates that the content of the MCC is 1141 determined by a sound level detection algorithm. The loudest 1142 (active) speaker (or a previous speaker, depending on the index 1143 value) is contained in the MCC. 1145 RoundRobin - This indicates that the content of the MCC is 1146 determined by a time based algorithm. For example: the Provider 1147 provides content from a particular source for a period of time and 1148 then provides content from another source and so on. 1150 An index is used to represent an instance in the policy setting. An 1151 index of 0 represents the most current instance of the policy, i.e. 1152 the active speaker, 1 represents the previous instance, i.e. the 1153 previous active speaker and so on. 1155 The following example shows a case where the Provider provides two 1156 media streams, one showing the active speaker and a second stream 1157 showing the previous speaker. 1159 +-----------------------+---------------------------------+ 1160 | Capture Scene #1 | | 1161 +-----------------------|---------------------------------+ 1162 | VC1 | | 1163 | VC2 | | 1164 | MCC1(VC1,VC2) | Policy=SoundLevel:0 | 1165 | | MaxCaptures=1 | 1166 | MCC2(VC1,VC2) | Policy=SoundLevel:1 | 1167 | | MaxCaptures=1 | 1168 | CSV(MCC1,MCC2) | | 1169 +---------------------------------------------------------+ 1171 Table 3: Example Policy MCC attribute usage 1173 7.2.1.3. Synchronisation Identity 1175 The Synchronisation Identity MCC attribute indicates how the 1176 individual Captures in multiple MCC Captures are synchronised. To 1177 indicate that the Capture Encodings associated with MCCs contain 1178 Captures from the same source at the same time a Provider should 1179 set the same Synchronisation Identity on each of the concerned 1180 MCCs. It is the Provider that determines what the source for the 1181 Captures is, so a Provider can choose how to group together Single 1182 Media Captures into a combined "source" for the purpose of 1183 switching them together to keep them synchronized according to the 1184 SynchronisationID attribute. For example when the Provider is in 1185 an MCU it may determine that each separate CLUE Endpoint is a 1186 remote source of media. The Synchronisation Identity may be used 1187 across media types, i.e. to synchronize audio and video related 1188 MCCs. 1190 Without this attribute it is assumed that multiple MCCs may provide 1191 content from different sources at any particular point in time. 1193 For example: 1195 +=======================+=================================+ 1196 | Capture Scene #1 | | 1197 +-----------------------|---------------------------------+ 1198 | VC1 | Description=Left | 1199 | VC2 | Description=Centre | 1200 | VC3 | Description=Right | 1201 | AC1 | Description=Room | 1202 | CSV(VC1,VC2,VC3) | | 1203 | CSV(AC1) | | 1204 +=======================+=================================+ 1205 | Capture Scene #2 | | 1206 +-----------------------|---------------------------------+ 1207 | VC4 | Description=Left | 1208 | VC5 | Description=Centre | 1209 | VC6 | Description=Right | 1210 | AC2 | Description=Room | 1211 | CSV(VC4,VC5,VC6) | | 1212 | CSV(AC2) | | 1213 +=======================+=================================+ 1214 | Capture Scene #3 | | 1215 +-----------------------|---------------------------------+ 1216 | VC7 | | 1217 | AC3 | | 1218 +=======================+=================================+ 1219 | Capture Scene #4 | | 1220 +-----------------------|---------------------------------+ 1221 | VC8 | | 1222 | AC4 | | 1223 +=======================+=================================+ 1224 | Capture Scene #5 | | 1225 +-----------------------|---------------------------------+ 1226 | MCC1(VC1,VC4,VC7) | SynchronisationID=1 | 1227 | | MaxCaptures=1 | 1228 | MCC2(VC2,VC5,VC8) | SynchronisationID=1 | 1229 | | MaxCaptures=1 | 1230 | MCC3(VC3,VC6) | MaxCaptures=1 | 1231 | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1 | 1232 | | MaxCaptures=1 | 1233 | CSV(MCC1,MCC2,MCC3) | | 1234 | CSV(MCC4) | | 1235 +=======================+=================================+ 1237 Table 4: Example Synchronisation Identity MCC attribute usage 1239 The above Advertisement would indicate that MCC1, MCC2, MCC3 and 1240 MCC4 make up a Capture Scene. There would be four Capture 1241 Encodings (one for each MCC). Because MCC1 and MCC2 have the same 1242 SynchronisationID, each Encoding from MCC1 and MCC2 respectively 1243 would together have content from only Capture Scene 1 or only 1244 Capture Scene 2 or the combination of VC7 and VC8 at a particular 1245 point in time. In this case the Provider has decided the sources 1246 to be synchronized are Scene #1, Scene #2, and Scene #3 and #4 1247 together. The Encoding from MCC3 would not be synchronised with 1248 MCC1 or MCC2. As MCC4 also has the same Synchronisation Identity 1249 as MCC1 and MCC2 the content of the audio Encoding will be 1250 synchronised with the video content. 1252 7.2.1.4. Allow Subset Choice 1254 The Allow Subset Choice MCC attribute is a boolean value, 1255 indicating whether or not the Provider allows the Consumer to 1256 choose a specific subset of the Captures referenced by the MCC. 1257 If this attribute is true, and the MCC references other Captures, 1258 then the Consumer MAY select (in a Configure message) a specific 1259 subset of those Captures to be included in the MCC, and the 1260 Provider MUST then include only that subset. If this attribute is 1261 false, or the MCC does not reference other Captures, then the 1262 Consumer MUST NOT select a subset. 1264 7.3. Capture Scene 1266 In order for a Provider's individual Captures to be used 1267 effectively by a Consumer, the Provider organizes the Captures into 1268 one or more Capture Scenes, with the structure and contents of 1269 these Capture Scenes being sent from the Provider to the Consumer 1270 in the Advertisement. 1272 A Capture Scene is a structure representing a spatial region 1273 containing one or more Capture Devices, each capturing media 1274 representing a portion of the region. A Capture Scene includes one 1275 or more Capture Scene Views (CSV), with each CSV including one or 1276 more Media Captures of the same media type. There can also be 1277 Media Captures that are not included in a Capture Scene View. A 1278 Capture Scene represents, for example, the video image of a group 1279 of people seated next to each other, along with the sound of their 1280 voices, which could be represented by some number of VCs and ACs in 1281 the Capture Scene Views. An MCU can also describe in Capture 1282 Scenes what it constructs from media Streams it receives. 1284 A Provider MAY advertise one or more Capture Scenes. What 1285 constitutes an entire Capture Scene is up to the Provider. A 1286 simple Provider might typically use one Capture Scene for 1287 participant media (live video from the room cameras) and another 1288 Capture Scene for a computer generated presentation. In more 1289 complex systems, the use of additional Capture Scenes is also 1290 sensible. For example, a classroom may advertise two Capture 1291 Scenes involving live video, one including only the camera 1292 capturing the instructor (and associated audio), the other 1293 including camera(s) capturing students (and associated audio). 1295 A Capture Scene MAY (and typically will) include more than one type 1296 of media. For example, a Capture Scene can include several Capture 1297 Scene Views for Video Captures, and several Capture Scene Views for 1298 Audio Captures. A particular Capture MAY be included in more than 1299 one Capture Scene View. 1301 A Provider MAY express spatial relationships between Captures that 1302 are included in the same Capture Scene. However, there is no 1303 spatial relationship between Media Captures from different Capture 1304 Scenes. In other words, Capture Scenes each use their own spatial 1305 measurement system as outlined above in section 6. 1307 A Provider arranges Captures in a Capture Scene to help the 1308 Consumer choose which captures it wants to render. The Capture 1309 Scene Views in a Capture Scene are different alternatives the 1310 Provider is suggesting for representing the Capture Scene. Each 1311 Capture Scene View is given an advertisement unique identity. The 1312 order of Capture Scene Views within a Capture Scene has no 1313 significance. The Media Consumer can choose to receive all Media 1314 Captures from one Capture Scene View for each media type (e.g. 1315 audio and video), or it can pick and choose Media Captures 1316 regardless of how the Provider arranges them in Capture Scene 1317 Views. Different Capture Scene Views of the same media type are 1318 not necessarily mutually exclusive alternatives. Also note that 1319 the presence of multiple Capture Scene Views (with potentially 1320 multiple encoding options in each view) in a given Capture Scene 1321 does not necessarily imply that a Provider is able to serve all the 1322 associated media simultaneously (although the construction of such 1323 an over-rich Capture Scene is probably not sensible in many cases). 1324 What a Provider can send simultaneously is determined through the 1325 Simultaneous Transmission Set mechanism, described in section 8. 1327 Captures within the same Capture Scene View MUST be of the same 1328 media type - it is not possible to mix audio and video captures in 1329 the same Capture Scene View, for instance. The Provider MUST be 1330 capable of encoding and sending all Captures (that have an encoding 1331 group) in a single Capture Scene View simultaneously. The order of 1332 Captures within a Capture Scene View has no significance. A 1333 Consumer can decide to receive all the Captures in a single Capture 1334 Scene View, but a Consumer could also decide to receive just a 1335 subset of those captures. A Consumer can also decide to receive 1336 Captures from different Capture Scene Views, all subject to the 1337 constraints set by Simultaneous Transmission Sets, as discussed in 1338 section 8. 1340 When a Provider advertises a Capture Scene with multiple CSVs, it 1341 is essentially signaling that there are multiple representations of 1342 the same Capture Scene available. In some cases, these multiple 1343 views would be used simultaneously (for instance a "video view" and 1344 an "audio view"). In some cases the views would conceptually be 1345 alternatives (for instance a view consisting of three Video 1346 Captures covering the whole room versus a view consisting of just a 1347 single Video Capture covering only the center of a room). In this 1348 latter example, one sensible choice for a Consumer would be to 1349 indicate (through its Configure and possibly through an additional 1350 offer/answer exchange) the Captures of that Capture Scene View that 1351 most closely matched the Consumer's number of display devices or 1352 screen layout. 1354 The following is an example of 4 potential Capture Scene Views for 1355 an endpoint-style Provider: 1357 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 1359 2. (MCC3) - Video Capture associated with loudest room segment 1361 3. (VC4) - Video Capture zoomed out view of all people in the room 1363 4. (AC0) - main audio 1365 The first view in this Capture Scene example is a list of Video 1366 Captures which have a spatial relationship to each other. 1367 Determination of the order of these captures (VC0, VC1 and VC2) for 1368 rendering purposes is accomplished through use of their Area of 1369 Capture attributes. The second view (MCC3) and the third view 1370 (VC4) are alternative representations of the same room's video, 1371 which might be better suited to some Consumers' rendering 1372 capabilities. The inclusion of the Audio Capture in the same 1373 Capture Scene indicates that AC0 is associated with all of those 1374 Video Captures, meaning it comes from the same spatial region. 1375 Therefore, if audio were to be rendered at all, this audio would be 1376 the correct choice irrespective of which Video Captures were 1377 chosen. 1379 7.3.1. Capture Scene attributes 1381 Capture Scene Attributes can be applied to Capture Scenes as well 1382 as to individual media captures. Attributes specified at this 1383 level apply to all constituent Captures. Capture Scene attributes 1384 include 1386 . Human-readable description of the Capture Scene, which could 1387 be in multiple languages; 1388 . xCard scene information 1389 . Scale information (millimeters, unknown, no scale), as 1390 described in Section 6. 1392 7.3.1.1. Scene Information 1394 The Scene information attribute provides information regarding the 1395 Capture Scene rather than individual participants. The Provider 1396 may gather the information automatically or manually from a 1397 variety of sources. The scene information attribute allows a 1398 Provider to indicate information such as: organizational or 1399 geographic information allowing a Consumer to determine which 1400 Capture Scenes are of interest in order to then perform Capture 1401 selection. It also allows a Consumer to render information 1402 regarding the Scene or to use it for further processing. 1404 As per 7.1.1.10. the xCard format is used to convey this 1405 information and the Provider may supply a minimal set of 1406 information or a larger set of information. 1408 In order to keep CLUE messages compact the Provider SHOULD use a 1409 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 1410 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 1411 message. 1413 7.3.2. Capture Scene View attributes 1415 A Capture Scene can include one or more Capture Scene Views in 1416 addition to the Capture Scene wide attributes described above. 1417 Capture Scene View attributes apply to the Capture Scene View as a 1418 whole, i.e. to all Captures that are part of the Capture Scene 1419 View. 1421 Capture Scene View attributes include: 1423 . Human-readable description (which could be in multiple 1424 languages) of the Capture Scene View 1426 7.4. Global View List 1428 An Advertisement can include an optional Global View list. Each 1429 item in this list is a Global View. The Provider can include 1430 multiple Global Views, to allow a Consumer to choose sets of 1431 captures appropriate to its capabilities or application. The 1432 choice of how to make these suggestions in the Global View list 1433 for what represents all the scenes for which the Provider can send 1434 media is up to the Provider. This is very similar to how each CSV 1435 represents a particular scene. 1437 As an example, suppose an advertisement has three scenes, and each 1438 scene has three CSVs, ranging from one to three video captures in 1439 each CSV. The Provider is advertising a total of nine video 1440 Captures across three scenes. The Provider can use the Global 1441 View list to suggest alternatives for Consumers that can't receive 1442 all nine video Captures as separate media streams. For 1443 accommodating a Consumer that wants to receive three video 1444 Captures, a Provider might suggest a Global View containing just a 1445 single CSV with three Captures and nothing from the other two 1446 scenes. Or a Provider might suggest a Global View containing 1447 three different CSVs, one from each scene, with a single video 1448 Capture in each. 1450 Some additional rules: 1452 . The ordering of Global Views in the Global View list is 1453 insignificant. 1454 . The ordering of CSVs within each Global View is 1455 insignificant. 1456 . A particular CSV may be used in multiple Global Views. 1457 . The Provider must be capable of encoding and sending all 1458 Captures within the CSVs of a given Global View 1459 simultaneously. 1461 The following figure shows an example of the structure of Global 1462 Views in a Global View List. 1464 ........................................................ 1465 . Advertisement . 1466 . . 1467 . +--------------+ +-------------------------+ . 1468 . |Scene 1 | |Global View List | . 1469 . | | | | . 1470 . | CSV1 (v)<----------------- Global View (CSV 1) | . 1471 . | <-------. | | . 1472 . | | *--------- Global View (CSV 1,5) | . 1473 . | CSV2 (v) | | | | . 1474 . | | | | | . 1475 . | CSV3 (v)<---------*------- Global View (CSV 3,5) | . 1476 . | | | | | | . 1477 . | CSV4 (a)<----------------- Global View (CSV 4) | . 1478 . | <-----------. | | . 1479 . +--------------+ | | *----- Global View (CSV 4,6) | . 1480 . | | | | | . 1481 . +--------------+ | | | +-------------------------+ . 1482 . |Scene 2 | | | | . 1483 . | | | | | . 1484 . | CSV5 (v)<-------' | | . 1485 . | <---------' | . 1486 . | | | (v) = video . 1487 . | CSV6 (a)<-----------' (a) = audio . 1488 . | | . 1489 . +--------------+ . 1490 `......................................................' 1492 Figure 3: Global View List Structure 1494 8. Simultaneous Transmission Set Constraints 1496 In many practical cases, a Provider has constraints or limitations 1497 on its ability to send Captures simultaneously. One type of 1498 limitation is caused by the physical limitations of capture 1499 mechanisms; these constraints are represented by a Simultaneous 1500 Transmission Set. The second type of limitation reflects the 1501 encoding resources available, such as bandwidth or video encoding 1502 throughput (macroblocks/second). This type of constraint is 1503 captured by Individual Encodings and Encoding Groups, discussed 1504 below. 1506 Some Endpoints or MCUs can send multiple Captures simultaneously; 1507 however sometimes there are constraints that limit which Captures 1508 can be sent simultaneously with other Captures. A device may not 1509 be able to be used in different ways at the same time. Provider 1510 Advertisements are made so that the Consumer can choose one of 1511 several possible mutually exclusive usages of the device. This 1512 type of constraint is expressed in a Simultaneous Transmission Set, 1513 which lists all the Captures of a particular media type (e.g. 1514 audio, video, text) that can be sent at the same time. There are 1515 different Simultaneous Transmission Sets for each media type in the 1516 Advertisement. This is easier to show in an example. 1518 Consider the example of a room system where there are three cameras 1519 each of which can send a separate Capture covering two persons 1520 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1521 optical zoom lens) and show all six persons, VC3. But the middle 1522 camera cannot be used in both modes at the same time - it has to 1523 either show the space where two participants sit or the whole six 1524 seats, but not both at the same time. As a result, VC1 and VC3 1525 cannot be sent simultaneously. 1527 Simultaneous Transmission Sets are expressed as sets of the Media 1528 Captures that the Provider could transmit at the same time (though, 1529 in some cases, it is not intuitive to do so). If a Multiple 1530 Content Capture is included in a Simultaneous Transmission Set it 1531 indicates that the Capture Encoding associated with it could be 1532 transmitted as the same time as the other Captures within the 1533 Simultaneous Transmission Set. It does not imply that the Single 1534 Media Captures contained in the Multiple Content Capture could all 1535 be transmitted at the same time. 1537 In this example the two Simultaneous Transmission Sets are shown in 1538 Table 5. If a Provider advertises one or more mutually exclusive 1539 Simultaneous Transmission Sets, then for each media type the 1540 Consumer MUST ensure that it chooses Media Captures that lie wholly 1541 within one of those Simultaneous Transmission Sets. 1543 +-------------------+ 1544 | Simultaneous Sets | 1545 +-------------------+ 1546 | {VC0, VC1, VC2} | 1547 | {VC0, VC3, VC2} | 1548 +-------------------+ 1550 Table 5: Two Simultaneous Transmission Sets 1552 A Provider OPTIONALLY can include the Simultaneous Transmission 1553 Sets in its Advertisement. These constraints apply across all the 1554 Capture Scenes in the Advertisement. It is a syntax conformance 1555 requirement that the Simultaneous Transmission Sets MUST allow all 1556 the media Captures in any particular Capture Scene View to be used 1557 simultaneously. Similarly, the Simultaneous Transmission Sets MUST 1558 reflect the simultaneity expressed by any Global View. 1560 For shorthand convenience, a Provider MAY describe a Simultaneous 1561 Transmission Set in terms of Capture Scene Views and Capture 1562 Scenes. If a Capture Scene View is included in a Simultaneous 1563 Transmission Set, then all Media Captures in the Capture Scene View 1564 are included in the Simultaneous Transmission Set. If a Capture 1565 Scene is included in a Simultaneous Transmission Set, then all its 1566 Capture Scene Views (of the corresponding media type) are included 1567 in the Simultaneous Transmission Set. The end result reduces to a 1568 set of Media Captures, of a particular media type, in either case. 1570 If an Advertisement does not include Simultaneous Transmission 1571 Sets, then the Provider MUST be able to simultaneously provide all 1572 the Captures from any one CSV of each media type from each Capture 1573 Scene. Likewise, if there are no Simultaneous Transmission Sets 1574 and there is a Global View list, then the Provider MUST be able to 1575 simultaneously provide all the Captures from any particular Global 1576 View (of each media type) from the Global View list. 1578 If an Advertisement includes multiple Capture Scene Views in a 1579 Capture Scene then the Consumer MAY choose one Capture Scene View 1580 for each media type, or MAY choose individual Captures based on the 1581 Simultaneous Transmission Sets. 1583 9. Encodings 1585 Individual encodings and encoding groups are CLUE's mechanisms 1586 allowing a Provider to signal its limitations for sending Captures, 1587 or combinations of Captures, to a Consumer. Consumers can map the 1588 Captures they want to receive onto the Encodings, with the encoding 1589 parameters they want. As for the relationship between the CLUE- 1590 specified mechanisms based on Encodings and the SIP offer/answer 1591 exchange, please refer to section 5. 1593 9.1. Individual Encodings 1595 An Individual Encoding represents a way to encode a Media Capture 1596 as a Capture Encoding, to be sent as an encoded media stream from 1597 the Provider to the Consumer. An Individual Encoding has a set of 1598 parameters characterizing how the media is encoded. 1600 Different media types have different parameters, and different 1601 encoding algorithms may have different parameters. An Individual 1602 Encoding can be assigned to at most one Capture Encoding at any 1603 given time. 1605 Individual Encoding parameters are represented in SDP [RFC4566], 1606 not in CLUE messages. For example, for a video encoding using 1607 H.26x compression technologies, this can include parameters such 1608 as: 1610 . Maximum bandwidth; 1611 . Maximum picture size in pixels; 1612 . Maximum number of pixels to be processed per second; 1614 The bandwidth parameter is the only one that specifically relates 1615 to a CLUE Advertisement, as it can be further constrained by the 1616 maximum group bandwidth in an Encoding Group. 1618 9.2. Encoding Group 1620 An Encoding Group includes a set of one or more Individual 1621 Encodings, and parameters that apply to the group as a whole. By 1622 grouping multiple individual Encodings together, an Encoding Group 1623 describes additional constraints on bandwidth for the group. A 1624 single Encoding Group MAY refer to Encodings for different media 1625 types. 1627 The Encoding Group data structure contains: 1629 . Maximum bitrate for all encodings in the group combined; 1630 . A list of identifiers for the Individual Encodings belonging 1631 to the group. 1633 When the Individual Encodings in a group are instantiated into 1634 Capture Encodings, each Capture Encoding has a bitrate that MUST be 1635 less than or equal to the max bitrate for the particular Individual 1636 Encoding. The "maximum bitrate for all encodings in the group" 1637 parameter gives the additional restriction that the sum of all the 1638 individual Capture Encoding bitrates MUST be less than or equal to 1639 this group value. 1641 The following diagram illustrates one example of the structure of a 1642 media Provider's Encoding Groups and their contents. 1644 ,-------------------------------------------------. 1645 | Media Provider | 1646 | | 1647 | ,--------------------------------------. | 1648 | | ,--------------------------------------. | 1649 | | | ,--------------------------------------. | 1650 | | | | Encoding Group | | 1651 | | | | ,-----------. | | 1652 | | | | | | ,---------. | | 1653 | | | | | | | | ,---------.| | 1654 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1655 | `.| | | | | | `---------'| | 1656 | `.| `-----------' `---------' | | 1657 | `--------------------------------------' | 1658 `-------------------------------------------------' 1660 Figure 4: Encoding Group Structure 1662 A Provider advertises one or more Encoding Groups. Each Encoding 1663 Group includes one or more Individual Encodings. Each Individual 1664 Encoding can represent a different way of encoding media. For 1665 example one Individual Encoding may be 1080p60 video, another could 1666 be 720p30, with a third being CIF, all in, for example, H.264 1667 format. 1668 While a typical three codec/display system might have one Encoding 1669 Group per "codec box" (physical codec, connected to one camera and 1670 one screen), there are many possibilities for the number of 1671 Encoding Groups a Provider may be able to offer and for the 1672 encoding values in each Encoding Group. 1674 There is no requirement for all Encodings within an Encoding Group 1675 to be instantiated at the same time. 1677 9.3. Associating Captures with Encoding Groups 1679 Each Media Capture, including MCCs, MAY be associated with one 1680 Encoding Group. To be eligible for configuration, a Media Capture 1681 MUST be associated with one Encoding Group, which is used to 1682 instantiate that Capture into a Capture Encoding. When an MCC is 1683 configured all the Media Captures referenced by the MCC will appear 1684 in the Capture Encoding according to the attributes of the chosen 1685 encoding of the MCC. This allows an Advertiser to specify encoding 1686 attributes associated with the Media Captures without the need to 1687 provide an individual Capture Encoding for each of the inputs. 1689 If an Encoding Group is assigned to a Media Capture referenced by 1690 the MCC it indicates that this Capture may also have an individual 1691 Capture Encoding. 1693 For example: 1695 +--------------------+------------------------------------+ 1696 | Capture Scene #1 | | 1697 +--------------------+------------------------------------+ 1698 | VC1 | EncodeGroupID=1 | 1699 | VC2 | | 1700 | MCC1(VC1,VC2) | EncodeGroupID=2 | 1701 | CSV(VC1) | | 1702 | CSV(MCC1) | | 1703 +--------------------+------------------------------------+ 1705 Table 6: Example usage of Encoding with MCC and source Captures 1707 This would indicate that VC1 may be sent as its own Capture 1708 Encoding from EncodeGroupID=1 or that it may be sent as part of a 1709 Capture Encoding from EncodeGroupID=2 along with VC2. 1711 More than one Capture MAY use the same Encoding Group. 1713 The maximum number of Capture Encodings that can result from a 1714 particular Encoding Group constraint is equal to the number of 1715 individual Encodings in the group. The actual number of Capture 1716 Encodings used at any time MAY be less than this maximum. Any of 1717 the Captures that use a particular Encoding Group can be encoded 1718 according to any of the Individual Encodings in the group. 1720 It is a protocol conformance requirement that the Encoding Groups 1721 MUST allow all the Captures in a particular Capture Scene View to 1722 be used simultaneously. 1724 10. Consumer's Choice of Streams to Receive from the Provider 1726 After receiving the Provider's Advertisement message (that includes 1727 media captures and associated constraints), the Consumer composes 1728 its reply to the Provider in the form of a Configure message. The 1729 Consumer is free to use the information in the Advertisement as it 1730 chooses, but there are a few obviously sensible design choices, 1731 which are outlined below. 1733 If multiple Providers connect to the same Consumer (i.e. in an MCU- 1734 less multiparty call), it is the responsibility of the Consumer to 1735 compose Configures for each Provider that both fulfill each 1736 Provider's constraints as expressed in the Advertisement, as well 1737 as its own capabilities. 1739 In an MCU-based multiparty call, the MCU can logically terminate 1740 the Advertisement/Configure negotiation in that it can hide the 1741 characteristics of the receiving endpoint and rely on its own 1742 capabilities (transcoding/transrating/...) to create Media Streams 1743 that can be decoded at the Endpoint Consumers. The timing of an 1744 MCU's sending of Advertisements (for its outgoing ports) and 1745 Configures (for its incoming ports, in response to Advertisements 1746 received there) is up to the MCU and implementation dependent. 1748 As a general outline, a Consumer can choose, based on the 1749 Advertisement it has received, which Captures it wishes to receive, 1750 and which Individual Encodings it wants the Provider to use to 1751 encode the Captures. 1753 On receipt of an Advertisement with an MCC the Consumer treats the 1754 MCC as per other non-MCC Captures with the following differences: 1756 - The Consumer would understand that the MCC is a Capture that 1757 includes the referenced individual Captures (or any Captures, if 1758 none are referenced) and that these individual Captures are 1759 delivered as part of the MCC's Capture Encoding. 1761 - The Consumer may utilise any of the attributes associated with 1762 the referenced individual Captures and any Capture Scene attributes 1763 from where the individual Captures were defined to choose Captures 1764 and for rendering decisions. 1766 - If the MCC attribute Allow Subset Choice is true, then the 1767 Consumer may or may not choose to receive all the indicated 1768 Captures. It can choose to receive a sub-set of Captures indicated 1769 by the MCC. 1771 For example if the Consumer receives: 1773 MCC1(VC1,VC2,VC3){attributes} 1775 A Consumer could choose all the Captures within a MCC however if 1776 the Consumer determines that it doesn't want VC3 it can return 1777 MCC1(VC1,VC2). If it wants all the individual Captures then it 1778 returns only the MCC identity (i.e. MCC1). If the MCC in the 1779 advertisement does not reference any individual captures, or the 1780 Allow Subset Choice attribute is false, then the Consumer cannot 1781 choose what is included in the MCC, it is up to the Provider to 1782 decide. 1784 A Configure Message includes a list of Capture Encodings. These 1785 are the Capture Encodings the Consumer wishes to receive from the 1786 Provider. Each Capture Encoding refers to one Media Capture and 1787 one Individual Encoding. 1789 For each Capture the Consumer wants to receive, it configures one 1790 of the Encodings in that Capture's Encoding Group. The Consumer 1791 does this by telling the Provider, in its Configure Message, which 1792 Encoding to use for each chosen Capture. Upon receipt of this 1793 Configure from the Consumer, common knowledge is established 1794 between Provider and Consumer regarding sensible choices for the 1795 media streams. The setup of the actual media channels, at least in 1796 the simplest case, is left to a following offer/answer exchange. 1797 Optimized implementations MAY speed up the reaction to the 1798 offer/answer exchange by reserving the resources at the time of 1799 finalization of the CLUE handshake. 1801 CLUE advertisements and configure messages don't necessarily 1802 require a new SDP offer/answer for every CLUE message 1803 exchange. But the resulting encodings sent via RTP must conform to 1804 the most recent SDP offer/answer result. 1806 In order to meaningfully create and send an initial Configure, the 1807 Consumer needs to have received at least one Advertisement, and an 1808 SDP offer defining the Individual Encodings, from the Provider. 1810 In addition, the Consumer can send a Configure at any time during 1811 the call. The Configure MUST be valid according to the most 1812 recently received Advertisement. The Consumer can send a Configure 1813 either in response to a new Advertisement from the Provider or on 1814 its own, for example because of a local change in conditions 1815 (people leaving the room, connectivity changes, multipoint related 1816 considerations). 1818 When choosing which Media Streams to receive from the Provider, and 1819 the encoding characteristics of those Media Streams, the Consumer 1820 advantageously takes several things into account: its local 1821 preference, simultaneity restrictions, and encoding limits. 1823 10.1. Local preference 1825 A variety of local factors influence the Consumer's choice of 1826 Media Streams to be received from the Provider: 1828 o if the Consumer is an Endpoint, it is likely that it would 1829 choose, where possible, to receive video and audio Captures that 1830 match the number of display devices and audio system it has 1832 o if the Consumer is an MCU, it MAY choose to receive loudest 1833 speaker streams (in order to perform its own media composition) 1834 and avoid pre-composed video Captures 1836 o user choice (for instance, selection of a new layout) MAY result 1837 in a different set of Captures, or different encoding 1838 characteristics, being required by the Consumer 1840 10.2. Physical simultaneity restrictions 1842 Often there are physical simultaneity constraints of the Provider 1843 that affect the Provider's ability to simultaneously send all of 1844 the captures the Consumer would wish to receive. For instance, an 1845 MCU, when connected to a multi-camera room system, might prefer to 1846 receive both individual video streams of the people present in the 1847 room and an overall view of the room from a single camera. Some 1848 Endpoint systems might be able to provide both of these sets of 1849 streams simultaneously, whereas others might not (if the overall 1850 room view were produced by changing the optical zoom level on the 1851 center camera, for instance). 1853 10.3. Encoding and encoding group limits 1855 Each of the Provider's encoding groups has limits on bandwidth, 1856 and the constituent potential encodings have limits on the 1857 bandwidth, computational complexity, video frame rate, and 1858 resolution that can be provided. When choosing the Captures to be 1859 received from a Provider, a Consumer device MUST ensure that the 1860 encoding characteristics requested for each individual Capture 1861 fits within the capability of the encoding it is being configured 1862 to use, as well as ensuring that the combined encoding 1863 characteristics for Captures fit within the capabilities of their 1864 associated encoding groups. In some cases, this could cause an 1865 otherwise "preferred" choice of capture encodings to be passed 1866 over in favor of different Capture Encodings--for instance, if a 1867 set of three Captures could only be provided at a low resolution 1868 then a three screen device could switch to favoring a single, 1869 higher quality, Capture Encoding. 1871 11. Extensibility 1873 One important characteristics of the Framework is its 1874 extensibility. The standard for interoperability and handling 1875 multiple streams must be future-proof. The framework itself is 1876 inherently extensible through expanding the data model types. For 1877 example: 1879 o Adding more types of media, such as telemetry, can done by 1880 defining additional types of Captures in addition to audio and 1881 video. 1883 o Adding new functionalities, such as 3-D video Captures, say, may 1884 require additional attributes describing the Captures. 1886 The infrastructure is designed to be extended rather than 1887 requiring new infrastructure elements. Extension comes through 1888 adding to defined types. 1890 12. Examples - Using the Framework (Informative) 1892 This section gives some examples, first from the point of view of 1893 the Provider, then the Consumer, then some multipoint scenarios 1895 12.1. Provider Behavior 1897 This section shows some examples in more detail of how a Provider 1898 can use the framework to represent a typical case for telepresence 1899 rooms. First an endpoint is illustrated, then an MCU case is 1900 shown. 1902 12.1.1. Three screen Endpoint Provider 1904 Consider an Endpoint with the following description: 1906 3 cameras, 3 displays, a 6 person table 1908 o Each camera can provide one Capture for each 1/3 section of the 1909 table 1911 o A single Capture representing the active speaker can be provided 1912 (voice activity based camera selection to a given encoder input 1913 port implemented locally in the Endpoint) 1915 o A single Capture representing the active speaker with the other 1916 2 Captures shown picture in picture within the stream can be 1917 provided (again, implemented inside the endpoint) 1919 o A Capture showing a zoomed out view of all 6 seats in the room 1920 can be provided 1922 The video and audio Captures for this Endpoint can be described as 1923 follows. 1925 Video Captures: 1927 o VC0- (the left camera stream), encoding group=EG0, view=table 1929 o VC1- (the center camera stream), encoding group=EG1, view=table 1931 o VC2- (the right camera stream), encoding group=EG2, view=table 1933 o MCC3- (the loudest panel stream), encoding group=EG1, 1934 view=table, MaxCaptures=1, policy=SoundLevel 1936 o MCC4- (the loudest panel stream with PiPs), encoding group=EG1, 1937 view=room, MaxCaptures=3, policy=SoundLevel 1939 o VC5- (the zoomed out view of all people in the room), encoding 1940 group=EG1, view=room 1942 o VC6- (presentation stream), encoding group=EG1, presentation 1944 The following diagram is a top view of the room with 3 cameras, 3 1945 displays, and 6 seats. Each camera captures 2 people. The six 1946 seats are not all in a straight line. 1948 ,-. d 1949 ( )`--.__ +---+ 1950 `-' / `--.__ | | 1951 ,-. | `-.._ |_-+Camera 2 (VC2) 1952 ( ).' <--(AC1)-+-''`+-+ 1953 `-' |_...---'' | | 1954 ,-.c+-..__ +---+ 1955 ( )| ``--..__ | | 1956 `-' | ``+-..|_-+Camera 1 (VC1) 1957 ,-. | <--(AC2)..--'|+-+ ^ 1958 ( )| __..--' | | | 1959 `-'b|..--' +---+ |X 1960 ,-. |``---..___ | | | 1961 ( )\ ```--..._|_-+Camera 0 (VC0) | 1962 `-' \ <--(AC0) ..-''`-+ | 1963 ,-. \ __.--'' | | <----------+ 1964 ( ) |..-'' +---+ Y 1965 `-' a (0,0,0) origin is under Camera 1 1967 Figure 5: Room Layout Top View 1969 The two points labeled b and c are intended to be at the midpoint 1970 between the seating positions, and where the fields of view of the 1971 cameras intersect. 1973 The plane of interest for VC0 is a vertical plane that intersects 1974 points 'a' and 'b'. 1976 The plane of interest for VC1 intersects points 'b' and 'c'. The 1977 plane of interest for VC2 intersects points 'c' and 'd'. 1979 This example uses an area scale of millimeters. 1981 Areas of capture: 1983 bottom left bottom right top left top right 1984 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1985 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1986 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1987 MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1988 MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1989 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1990 VC6 none 1992 Points of capture: 1993 VC0 (-1678,0,800) 1994 VC1 (0,0,800) 1995 VC2 (1678,0,800) 1996 MCC3 none 1997 MCC4 none 1998 VC5 (0,0,800) 1999 VC6 none 2001 In this example, the right edge of the VC0 area lines up with the 2002 left edge of the VC1 area. It doesn't have to be this way. There 2003 could be a gap or an overlap. One additional thing to note for 2004 this example is the distance from a to b is equal to the distance 2005 from b to c and the distance from c to d. All these distances are 2006 1346 mm. This is the planar width of each area of capture for VC0, 2007 VC1, and VC2. 2009 Note the text in parentheses (e.g. "the left camera stream") is 2010 not explicitly part of the model, it is just explanatory text for 2011 this example, and is not included in the model with the media 2012 captures and attributes. Also, MCC4 doesn't say anything about 2013 how a capture is composed, so the media consumer can't tell based 2014 on this capture that MCC4 is composed of a "loudest panel with 2015 PiPs". 2017 Audio Captures: 2019 Three ceiling microphones are located between the cameras and the 2020 table, at the same height as the cameras. The microphones point 2021 down at an angle toward the seating positions. 2023 o AC0 (left), encoding group=EG3 2025 o AC1 (right), encoding group=EG3 2027 o AC2 (center) encoding group=EG3 2029 o AC3 being a simple pre-mixed audio stream from the room (mono), 2030 encoding group=EG3 2032 o AC4 audio stream associated with the presentation video (mono) 2033 encoding group=EG3, presentation 2035 Point of capture: Point on Line of Capture: 2037 AC0 (-1342,2000,800) (-1342,2925,379) 2038 AC1 ( 1342,2000,800) ( 1342,2925,379) 2039 AC2 ( 0,2000,800) ( 0,3000,379) 2040 AC3 ( 0,2000,800) ( 0,3000,379) 2041 AC4 none 2043 The physical simultaneity information is: 2045 Simultaneous transmission set #1 {VC0, VC1, VC2, MCC3, MCC4, 2046 VC6} 2048 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 2050 This constraint indicates it is not possible to use all the VCs at 2051 the same time. VC5 cannot be used at the same time as VC1 or MCC3 2052 or MCC4. Also, using every member in the set simultaneously may 2053 not make sense - for example MCC3(loudest) and MCC4 (loudest with 2054 PIP). In addition, there are encoding constraints that make 2055 choosing all of the VCs in a set impossible. VC1, MCC3, MCC4, 2056 VC5, VC6 all use EG1 and EG1 has only 3 ENCs. This constraint 2057 shows up in the encoding groups, not in the simultaneous 2058 transmission sets. 2060 In this example there are no restrictions on which Audio Captures 2061 can be sent simultaneously. 2063 Encoding Groups: 2065 This example has three encoding groups associated with the video 2066 captures. Each group can have 3 encodings, but with each 2067 potential encoding having a progressively lower specification. In 2068 this example, 1080p60 transmission is possible (as ENC0 has a 2069 maxPps value compatible with that). Significantly, as up to 3 2070 encodings are available per group, it is possible to transmit some 2071 video Captures simultaneously that are not in the same view in the 2072 Capture Scene. For example VC1 and MCC3 at the same time. The 2073 information below about Encodings is a summary of what would be 2074 conveyed in SDP, not directly in the CLUE Advertisement. 2076 encodeGroupID=EG0, maxGroupBandwidth=6000000 2077 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2078 maxPps=124416000, maxBandwidth=4000000 2079 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2080 maxPps=27648000, maxBandwidth=4000000 2081 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 2082 maxPps=15552000, maxBandwidth=4000000 2083 encodeGroupID=EG1 maxGroupBandwidth=6000000 2084 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2085 maxPps=124416000, maxBandwidth=4000000 2086 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2087 maxPps=27648000, maxBandwidth=4000000 2088 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 2089 maxPps=15552000, maxBandwidth=4000000 2090 encodeGroupID=EG2 maxGroupBandwidth=6000000 2091 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2092 maxPps=124416000, maxBandwidth=4000000 2093 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2094 maxPps=27648000, maxBandwidth=4000000 2095 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 2096 maxPps=15552000, maxBandwidth=4000000 2098 Figure 6: Example Encoding Groups for Video 2100 For audio, there are five potential encodings available, so all 2101 five Audio Captures can be encoded at the same time. 2103 encodeGroupID=EG3, maxGroupBandwidth=320000 2104 encodeID=ENC9, maxBandwidth=64000 2105 encodeID=ENC10, maxBandwidth=64000 2106 encodeID=ENC11, maxBandwidth=64000 2107 encodeID=ENC12, maxBandwidth=64000 2108 encodeID=ENC13, maxBandwidth=64000 2110 Figure 7: Example Encoding Group for Audio 2112 Capture Scenes: 2114 The following table represents the Capture Scenes for this 2115 Provider. Recall that a Capture Scene is composed of alternative 2116 Capture Scene Views covering the same spatial region. Capture 2117 Scene #1 is for the main people captures, and Capture Scene #2 is 2118 for presentation. 2120 Each row in the table is a separate Capture Scene View 2122 +------------------+ 2123 | Capture Scene #1 | 2124 +------------------+ 2125 | VC0, VC1, VC2 | 2126 | MCC3 | 2127 | MCC4 | 2128 | VC5 | 2129 | AC0, AC1, AC2 | 2130 | AC3 | 2131 +------------------+ 2133 +------------------+ 2134 | Capture Scene #2 | 2135 +------------------+ 2136 | VC6 | 2137 | AC4 | 2138 +------------------+ 2140 Table 7: Example Capture Scene Views 2142 Different Capture Scenes are distinct from each other, and are 2143 non-overlapping. A Consumer can choose a view from each Capture 2144 Scene. In this case the three Captures VC0, VC1, and VC2 are one 2145 way of representing the video from the Endpoint. These three 2146 Captures should appear adjacent next to each other. 2147 Alternatively, another way of representing the Capture Scene is 2148 with the capture MCC3, which automatically shows the person who is 2149 talking. Similarly for the MCC4 and VC5 alternatives. 2151 As in the video case, the different views of audio in Capture 2152 Scene #1 represent the "same thing", in that one way to receive 2153 the audio is with the 3 Audio Captures (AC0, AC1, AC2), and 2154 another way is with the mixed AC3. The Media Consumer can choose 2155 an audio CSV it is capable of receiving. 2157 The spatial ordering is understood by the Media Capture attributes 2158 Area of Capture, Point of Capture and Point on Line of Capture. 2160 A Media Consumer would likely want to choose a Capture Scene View 2161 to receive based in part on how many streams it can simultaneously 2162 receive. A consumer that can receive three video streams would 2163 probably prefer to receive the first view of Capture Scene #1 2164 (VC0, VC1, VC2) and not receive the other views. A consumer that 2165 can receive only one video stream would probably choose one of the 2166 other views. 2168 If the consumer can receive a presentation stream too, it would 2169 also choose to receive the only view from Capture Scene #2 (VC6). 2171 12.1.2. Encoding Group Example 2173 This is an example of an Encoding Group to illustrate how it can 2174 express dependencies between Encodings. The information below 2175 about Encodings is a summary of what would be conveyed in SDP, not 2176 directly in the CLUE Advertisement. 2178 encodeGroupID=EG0 maxGroupBandwidth=6000000 2179 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2180 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2181 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2182 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2183 encodeID=AUDENC0, maxBandwidth=96000 2184 encodeID=AUDENC1, maxBandwidth=96000 2185 encodeID=AUDENC2, maxBandwidth=96000 2187 Here, the Encoding Group is EG0. Although the Encoding Group is 2188 capable of transmitting up to 6Mbit/s, no individual video 2189 Encoding can exceed 4Mbit/s. 2191 This encoding group also allows up to 3 audio encodings, AUDENC<0- 2192 2>. It is not required that audio and video encodings reside 2193 within the same encoding group, but if so then the group's overall 2194 maxBandwidth value is a limit on the sum of all audio and video 2195 encodings configured by the consumer. A system that does not wish 2196 or need to combine bandwidth limitations in this way should 2197 instead use separate encoding groups for audio and video in order 2198 for the bandwidth limitations on audio and video to not interact. 2200 Audio and video can be expressed in separate encoding groups, as 2201 in this illustration. 2203 encodeGroupID=EG0 maxGroupBandwidth=6000000 2204 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2205 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2206 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2207 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2208 encodeGroupID=EG1 maxGroupBandwidth=500000 2209 encodeID=AUDENC0, maxBandwidth=96000 2210 encodeID=AUDENC1, maxBandwidth=96000 2211 encodeID=AUDENC2, maxBandwidth=96000 2213 12.1.3. The MCU Case 2215 This section shows how an MCU might express its Capture Scenes, 2216 intending to offer different choices for consumers that can handle 2217 different numbers of streams. Each MCC is for video. A single 2218 Audio Capture is provided for all single and multi-screen 2219 configurations that can be associated (e.g. lip-synced) with any 2220 combination of Video Captures (the MCCs) at the consumer. 2222 +-----------------------+---------------------------------+ 2223 | Capture Scene #1 | | 2224 +-----------------------|---------------------------------+ 2225 | MCC | for a single screen consumer | 2226 | MCC1, MCC2 | for a two screen consumer | 2227 | MCC3, MCC4, MCC5 | for a three screen consumer | 2228 | MCC6, MCC7, MCC8, MCC9| for a four screen consumer | 2229 | AC0 | AC representing all participants| 2230 | CSV(MCC0) | | 2231 | CSV(MCC1,MCC2) | | 2232 | CSV(MCC3,MCC4,MCC5) | | 2233 | CSV(MCC6,MCC7, | | 2234 | MCC8,MCC9) | | 2235 | CSV(AC0) | | 2236 +-----------------------+---------------------------------+ 2237 Table 8: MCU main Capture Scenes 2239 If / when a presentation stream becomes active within the 2240 conference the MCU might re-advertise the available media as: 2242 +------------------+--------------------------------------+ 2243 | Capture Scene #2 | note | 2244 +------------------+--------------------------------------+ 2245 | VC10 | video capture for presentation | 2246 | AC1 | presentation audio to accompany VC10 | 2247 | CSV(VC10) | | 2248 | CSV(AC1) | | 2249 +------------------+--------------------------------------+ 2251 Table 9: MCU presentation Capture Scene 2253 12.2. Media Consumer Behavior 2255 This section gives an example of how a Media Consumer might behave 2256 when deciding how to request streams from the three screen 2257 endpoint described in the previous section. 2259 The receive side of a call needs to balance its requirements, 2260 based on number of screens and speakers, its decoding capabilities 2261 and available bandwidth, and the provider's capabilities in order 2262 to optimally configure the provider's streams. Typically it would 2263 want to receive and decode media from each Capture Scene 2264 advertised by the Provider. 2266 A sane, basic, algorithm might be for the consumer to go through 2267 each Capture Scene View in turn and find the collection of Video 2268 Captures that best matches the number of screens it has (this 2269 might include consideration of screens dedicated to presentation 2270 video display rather than "people" video) and then decide between 2271 alternative views in the video Capture Scenes based either on 2272 hard-coded preferences or user choice. Once this choice has been 2273 made, the consumer would then decide how to configure the 2274 provider's encoding groups in order to make best use of the 2275 available network bandwidth and its own decoding capabilities. 2277 12.2.1. One screen Media Consumer 2279 MCC3, MCC4 and VC5 are all different views by themselves, not 2280 grouped together in a single view, so the receiving device should 2281 choose between one of those. The choice would come down to 2282 whether to see the greatest number of participants simultaneously 2283 at roughly equal precedence (VC5), a switched view of just the 2284 loudest region (MCC3) or a switched view with PiPs (MCC4). An 2285 endpoint device with a small amount of knowledge of these 2286 differences could offer a dynamic choice of these options, in- 2287 call, to the user. 2289 12.2.2. Two screen Media Consumer configuring the example 2291 Mixing systems with an even number of screens, "2n", and those 2292 with "2n+1" cameras (and vice versa) is always likely to be the 2293 problematic case. In this instance, the behavior is likely to be 2294 determined by whether a "2 screen" system is really a "2 decoder" 2295 system, i.e., whether only one received stream can be displayed 2296 per screen or whether more than 2 streams can be received and 2297 spread across the available screen area. To enumerate 3 possible 2298 behaviors here for the 2 screen system when it learns that the far 2299 end is "ideally" expressed via 3 capture streams: 2301 1. Fall back to receiving just a single stream (MCC3, MCC4 or VC5 2302 as per the 1 screen consumer case above) and either leave one 2303 screen blank or use it for presentation if / when a 2304 presentation becomes active. 2306 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 2307 screens (either with each capture being scaled to 2/3 of a 2308 screen and the center capture being split across 2 screens) or, 2309 as would be necessary if there were large bezels on the 2310 screens, with each stream being scaled to 1/2 the screen width 2311 and height and there being a 4th "blank" panel. This 4th panel 2312 could potentially be used for any presentation that became 2313 active during the call. 2315 3. Receive 3 streams, decode all 3, and use control information 2316 indicating which was the most active to switch between showing 2317 the left and center streams (one per screen) and the center and 2318 right streams. 2320 For an endpoint capable of all 3 methods of working described 2321 above, again it might be appropriate to offer the user the choice 2322 of display mode. 2324 12.2.3. Three screen Media Consumer configuring the example 2326 This is the most straightforward case - the Media Consumer would 2327 look to identify a set of streams to receive that best matched its 2328 available screens and so the VC0 plus VC1 plus VC2 should match 2329 optimally. The spatial ordering would give sufficient information 2330 for the correct Video Capture to be shown on the correct screen, 2331 and the consumer would either need to divide a single encoding 2332 group's capability by 3 to determine what resolution and frame 2333 rate to configure the provider with or to configure the individual 2334 Video Captures' Encoding Groups with what makes most sense (taking 2335 into account the receive side decode capabilities, overall call 2336 bandwidth, the resolution of the screens plus any user preferences 2337 such as motion vs. sharpness). 2339 12.3. Multipoint Conference utilizing Multiple Content Captures 2341 The use of MCCs allows the MCU to construct outgoing Advertisements 2342 describing complex media switching and composition scenarios. The 2343 following sections provide several examples. 2345 Note: In the examples the identities of the CLUE elements (e.g. 2346 Captures, Capture Scene) in the incoming Advertisements overlap. 2347 This is because there is no co-ordination between the endpoints. 2348 The MCU is responsible for making these unique in the outgoing 2349 advertisement. 2351 12.3.1. Single Media Captures and MCC in the same Advertisement 2353 Four endpoints are involved in a Conference where CLUE is used. An 2354 MCU acts as a middlebox between the endpoints with a CLUE channel 2355 between each endpoint and the MCU. The MCU receives the following 2356 Advertisements. 2358 +-----------------------+---------------------------------+ 2359 | Capture Scene #1 | Description=AustralianConfRoom | 2360 +-----------------------|---------------------------------+ 2361 | VC1 | Description=Audience | 2362 | | EncodeGroupID=1 | 2363 | CSV(VC1) | | 2364 +---------------------------------------------------------+ 2366 Table 10: Advertisement received from Endpoint A 2368 +-----------------------+---------------------------------+ 2369 | Capture Scene #1 | Description=ChinaConfRoom | 2370 +-----------------------|---------------------------------+ 2371 | VC1 | Description=Speaker | 2372 | | EncodeGroupID=1 | 2373 | VC2 | Description=Audience | 2374 | | EncodeGroupID=1 | 2375 | CSV(VC1, VC2) | | 2376 +---------------------------------------------------------+ 2378 Table 11: Advertisement received from Endpoint B 2380 +-----------------------+---------------------------------+ 2381 | Capture Scene #1 | Description=USAConfRoom | 2382 +-----------------------|---------------------------------+ 2383 | VC1 | Description=Audience | 2384 | | EncodeGroupID=1 | 2385 | CSV(VC1) | | 2386 +---------------------------------------------------------+ 2388 Table 12: Advertisement received from Endpoint C 2390 Note: Endpoint B above indicates that it sends two streams. 2392 If the MCU wanted to provide a Multiple Content Capture containing 2393 a round robin switched view of the audience from the 3 endpoints 2394 and the speaker it could construct the following advertisement: 2396 Advertisement sent to Endpoint F 2397 +=======================+=================================+ 2398 | Capture Scene #1 | Description=AustralianConfRoom | 2399 +-----------------------|---------------------------------+ 2400 | VC1 | Description=Audience | 2401 | CSV(VC1) | | 2402 +=======================+=================================+ 2403 | Capture Scene #2 | Description=ChinaConfRoom | 2404 +-----------------------|---------------------------------+ 2405 | VC2 | Description=Speaker | 2406 | VC3 | Description=Audience | 2407 | CSV(VC2, VC3) | | 2408 +=======================+=================================+ 2409 | Capture Scene #3 | Description=USAConfRoom | 2410 +-----------------------|---------------------------------+ 2411 | VC4 | Description=Audience | 2412 | CSV(VC4) | | 2413 +=======================+=================================+ 2414 | Capture Scene #4 | | 2415 +-----------------------|---------------------------------+ 2416 | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1 | 2417 | | MaxCaptures=1 | 2418 | | EncodingGroup=1 | 2419 | CSV(MCC1) | | 2420 +=======================+=================================+ 2422 Table 13: Advertisement sent to Endpoint F - One Encoding 2424 Alternatively if the MCU wanted to provide the speaker as one media 2425 stream and the audiences as another it could assign an encoding 2426 group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene 2427 #4 as per the example below. 2429 Advertisement sent to Endpoint F 2430 +=======================+=================================+ 2431 | Capture Scene #1 | Description=AustralianConfRoom | 2432 +-----------------------|---------------------------------+ 2433 | VC1 | Description=Audience | 2434 | CSV(VC1) | | 2435 +=======================+=================================+ 2436 | Capture Scene #2 | Description=ChinaConfRoom | 2437 +-----------------------|---------------------------------+ 2438 | VC2 | Description=Speaker | 2439 | | EncodingGroup=1 | 2440 | VC3 | Description=Audience | 2441 | CSV(VC2, VC3) | | 2442 +=======================+=================================+ 2443 | Capture Scene #3 | Description=USAConfRoom | 2444 +-----------------------|---------------------------------+ 2445 | VC4 | Description=Audience | 2446 | CSV(VC4) | | 2447 +=======================+=================================+ 2448 | Capture Scene #4 | | 2449 +-----------------------|---------------------------------+ 2450 | MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1 | 2451 | | MaxCaptures=1 | 2452 | | EncodingGroup=1 | 2453 | | AllowSubset=True | 2454 | MCC2(VC2) | MaxCaptures=1 | 2455 | | EncodingGroup=1 | 2456 | CSV2(MCC1,MCC2) | | 2457 +=======================+=================================+ 2459 Table 14: Advertisement sent to Endpoint F - Two Encodings 2461 Therefore a Consumer could choose whether or not to have a separate 2462 speaker related stream and could choose which endpoints to see. If 2463 it wanted the second stream but not the Australian conference room 2464 it could indicate the following captures in the Configure message: 2466 +-----------------------+---------------------------------+ 2467 | MCC1(VC3,VC4) | Encoding | 2468 | VC2 | Encoding | 2469 +-----------------------|---------------------------------+ 2470 Table 15: MCU case: Consumer Response 2472 12.3.2. Several MCCs in the same Advertisement 2474 Multiple MCCs can be used where multiple streams are used to carry 2475 media from multiple endpoints. For example: 2477 A conference has three endpoints D, E and F. Each end point has 2478 three video captures covering the left, middle and right regions of 2479 each conference room. The MCU receives the following 2480 advertisements from D and E. 2482 +-----------------------+---------------------------------+ 2483 | Capture Scene #1 | Description=AustralianConfRoom | 2484 +-----------------------|---------------------------------+ 2485 | VC1 | CaptureArea=Left | 2486 | | EncodingGroup=1 | 2487 | VC2 | CaptureArea=Centre | 2488 | | EncodingGroup=1 | 2489 | VC3 | CaptureArea=Right | 2490 | | EncodingGroup=1 | 2491 | CSV(VC1,VC2,VC3) | | 2492 +---------------------------------------------------------+ 2494 Table 16: Advertisement received from Endpoint D 2496 +-----------------------+---------------------------------+ 2497 | Capture Scene #1 | Description=ChinaConfRoom | 2498 +-----------------------|---------------------------------+ 2499 | VC1 | CaptureArea=Left | 2500 | | EncodingGroup=1 | 2501 | VC2 | CaptureArea=Centre | 2502 | | EncodingGroup=1 | 2503 | VC3 | CaptureArea=Right | 2504 | | EncodingGroup=1 | 2505 | CSV(VC1,VC2,VC3) | | 2506 +---------------------------------------------------------+ 2508 Table 17: Advertisement received from Endpoint E 2510 The MCU wants to offer Endpoint F three Capture Encodings. Each 2511 Capture Encoding would contain all the Captures from either 2512 Endpoint D or Endpoint E depending based on the active speaker. 2513 The MCU sends the following Advertisement: 2515 +=======================+=================================+ 2516 | Capture Scene #1 | Description=AustralianConfRoom | 2517 +-----------------------|---------------------------------+ 2518 | VC1 | | 2519 | VC2 | | 2520 | VC3 | | 2521 | CSV(VC1,VC2,VC3) | | 2522 +=======================+=================================+ 2523 | Capture Scene #2 | Description=ChinaConfRoom | 2524 +-----------------------|---------------------------------+ 2525 | VC4 | | 2526 | VC5 | | 2527 | VC6 | | 2528 | CSV(VC4,VC5,VC6) | | 2529 +=======================+=================================+ 2530 | Capture Scene #3 | | 2531 +-----------------------|---------------------------------+ 2532 | MCC1(VC1,VC4) | CaptureArea=Left | 2533 | | MaxCaptures=1 | 2534 | | SynchronisationID=1 | 2535 | | EncodingGroup=1 | 2536 | MCC2(VC2,VC5) | CaptureArea=Centre | 2537 | | MaxCaptures=1 | 2538 | | SynchronisationID=1 | 2539 | | EncodingGroup=1 | 2540 | MCC3(VC3,VC6) | CaptureArea=Right | 2541 | | MaxCaptures=1 | 2542 | | SynchronisationID=1 | 2543 | | EncodingGroup=1 | 2544 | CSV(MCC1,MCC2,MCC3) | | 2545 +=======================+=================================+ 2547 Table 18: Advertisement sent to Endpoint F 2549 12.3.3. Heterogeneous conference with switching and composition 2551 Consider a conference between endpoints with the following 2552 characteristics: 2554 Endpoint A - 4 screens, 3 cameras 2556 Endpoint B - 3 screens, 3 cameras 2558 Endpoint C - 3 screens, 3 cameras 2559 Endpoint D - 3 screens, 3 cameras 2561 Endpoint E - 1 screen, 1 camera 2563 Endpoint F - 2 screens, 1 camera 2565 Endpoint G - 1 screen, 1 camera 2567 This example focuses on what the user in one of the 3-camera multi- 2568 screen endpoints sees. Call this person User A, at Endpoint A. 2569 There are 4 large display screens at Endpoint A. Whenever somebody 2570 at another site is speaking, all the video captures from that 2571 endpoint are shown on the large screens. If the talker is at a 3- 2572 camera site, then the video from those 3 cameras fills 3 of the 2573 screens. If the talker is at a single-camera site, then video from 2574 that camera fills one of the screens, while the other screens show 2575 video from other single-camera endpoints. 2577 User A hears audio from the 4 loudest talkers. 2579 User A can also see video from other endpoints, in addition to the 2580 current talker, although much smaller in size. Endpoint A has 4 2581 screens, so one of those screens shows up to 9 other Media Captures 2582 in a tiled fashion. When video from a 3 camera endpoint appears in 2583 the tiled area, video from all 3 cameras appears together across 2584 the screen with correct spatial relationship among those 3 images. 2586 +---+---+---+ +-------------+ +-------------+ +-------------+ 2587 | | | | | | | | | | 2588 +---+---+---+ | | | | | | 2589 | | | | | | | | | | 2590 +---+---+---+ | | | | | | 2591 | | | | | | | | | | 2592 +---+---+---+ +-------------+ +-------------+ +-------------+ 2593 Figure 8: Endpoint A - 4 Screen Display 2595 User B at Endpoint B sees a similar arrangement, except there are 2596 only 3 screens, so the 9 other Media Captures are spread out across 2597 the bottom of the 3 displays, in a picture-in-picture (PIP) format. 2598 When video from a 3 camera endpoint appears in the PIP area, video 2599 from all 3 cameras appears together across a single screen with 2600 correct spatial relationship. 2602 +-------------+ +-------------+ +-------------+ 2603 | | | | | | 2604 | | | | | | 2605 | | | | | | 2606 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2607 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2608 +-------------+ +-------------+ +-------------+ 2609 Figure 9: Endpoint B - 3 Screen Display with PiPs 2611 When somebody at a different endpoint becomes the current talker, 2612 then User A and User B both see the video from the new talker 2613 appear on their large screen area, while the previous talker takes 2614 one of the smaller tiled or PIP areas. The person who is the 2615 current talker doesn't see themselves; they see the previous talker 2616 in their large screen area. 2618 One of the points of this example is that endpoints A and B each 2619 want to receive 3 capture encodings for their large display areas, 2620 and 9 encodings for their smaller areas. A and B are be able to 2621 each send the same Configure message to the MCU, and each receive 2622 the same conceptual Media Captures from the MCU. The differences 2623 are in how they are rendered and are purely a local matter at A and 2624 B. 2626 The Advertisements for such a scenario are described below. 2628 +-----------------------+---------------------------------+ 2629 | Capture Scene #1 | Description=Endpoint x | 2630 +-----------------------|---------------------------------+ 2631 | VC1 | EncodingGroup=1 | 2632 | VC2 | EncodingGroup=1 | 2633 | VC3 | EncodingGroup=1 | 2634 | AC1 | EncodingGroup=2 | 2635 | CSV1(VC1, VC2, VC3) | | 2636 | CSV2(AC1) | | 2637 +---------------------------------------------------------+ 2639 Table 19: Advertisement received at the MCU from Endpoints A to D 2640 +-----------------------+---------------------------------+ 2641 | Capture Scene #1 | Description=Endpoint y | 2642 +-----------------------|---------------------------------+ 2643 | VC1 | EncodingGroup=1 | 2644 | AC1 | EncodingGroup=2 | 2645 | CSV1(VC1) | | 2646 | CSV2(AC1) | | 2647 +---------------------------------------------------------+ 2649 Table 20: Advertisement received at the MCU from Endpoints E to G 2651 Rather than considering what is displayed CLUE concentrates more 2652 on what the MCU sends. The MCU doesn't know anything about the 2653 number of screens an endpoint has. 2655 As Endpoints A to D each advertise that three Captures make up a 2656 Capture Scene, the MCU offers these in a "site" switching mode. 2657 That is that there are three Multiple Content Captures (and 2658 Capture Encodings) each switching between Endpoints. The MCU 2659 switches in the applicable media into the stream based on voice 2660 activity. Endpoint A will not see a capture from itself. 2662 Using the MCC concept the MCU would send the following 2663 Advertisement to endpoint A: 2665 +=======================+=================================+ 2666 | Capture Scene #1 | Description=Endpoint B | 2667 +-----------------------|---------------------------------+ 2668 | VC4 | CaptureArea=Left | 2669 | VC5 | CaptureArea=Center | 2670 | VC6 | CaptureArea=Right | 2671 | AC1 | | 2672 | CSV(VC4,VC5,VC6) | | 2673 | CSV(AC1) | | 2674 +=======================+=================================+ 2675 | Capture Scene #2 | Description=Endpoint C | 2676 +-----------------------|---------------------------------+ 2677 | VC7 | CaptureArea=Left | 2678 | VC8 | CaptureArea=Center | 2679 | VC9 | CaptureArea=Right | 2680 | AC2 | | 2681 | CSV(VC7,VC8,VC9) | | 2682 | CSV(AC2) | | 2683 +=======================+=================================+ 2684 | Capture Scene #3 | Description=Endpoint D | 2685 +-----------------------|---------------------------------+ 2686 | VC10 | CaptureArea=Left | 2687 | VC11 | CaptureArea=Center | 2688 | VC12 | CaptureArea=Right | 2689 | AC3 | | 2690 | CSV(VC10,VC11,VC12) | | 2691 | CSV(AC3) | | 2692 +=======================+=================================+ 2693 | Capture Scene #4 | Description=Endpoint E | 2694 +-----------------------|---------------------------------+ 2695 | VC13 | | 2696 | AC4 | | 2697 | CSV(VC13) | | 2698 | CSV(AC4) | | 2699 +=======================+=================================+ 2700 | Capture Scene #5 | Description=Endpoint F | 2701 +-----------------------|---------------------------------+ 2702 | VC14 | | 2703 | AC5 | | 2704 | CSV(VC14) | | 2705 | CSV(AC5) | | 2706 +=======================+=================================+ 2707 | Capture Scene #6 | Description=Endpoint G | 2708 +-----------------------|---------------------------------+ 2709 | VC15 | | 2710 | AC6 | | 2711 | CSV(VC15) | | 2712 | CSV(AC6) | | 2713 +=======================+=================================+ 2715 Table 21: Advertisement sent to endpoint A - Source Part 2717 The above part of the Advertisement presents information about the 2718 sources to the MCC. The information is effectively the same as the 2719 received Advertisements except that there are no Capture Encodings 2720 associated with them and the identities have been re-numbered. 2722 In addition to the source Capture information the MCU advertises 2723 "site" switching of Endpoints B to G in three streams. 2725 +=======================+=================================+ 2726 | Capture Scene #7 | Description=Output3streammix | 2727 +-----------------------|---------------------------------+ 2728 | MCC1(VC4,VC7,VC10, | CaptureArea=Left | 2729 | VC13) | MaxCaptures=1 | 2730 | | SynchronisationID=1 | 2731 | | Policy=SoundLevel:0 | 2732 | | EncodingGroup=1 | 2733 | | | 2734 | MCC2(VC5,VC8,VC11, | CaptureArea=Center | 2735 | VC14) | MaxCaptures=1 | 2736 | | SynchronisationID=1 | 2737 | | Policy=SoundLevel:0 | 2738 | | EncodingGroup=1 | 2739 | | | 2740 | MCC3(VC6,VC9,VC12, | CaptureArea=Right | 2741 | VC15) | MaxCaptures=1 | 2742 | | SynchronisationID=1 | 2743 | | Policy=SoundLevel:0 | 2744 | | EncodingGroup=1 | 2745 | | | 2746 | MCC4() (for audio) | CaptureArea=whole scene | 2747 | | MaxCaptures=1 | 2748 | | Policy=SoundLevel:0 | 2749 | | EncodingGroup=2 | 2750 | | | 2751 | MCC5() (for audio) | CaptureArea=whole scene | 2752 | | MaxCaptures=1 | 2753 | | Policy=SoundLevel:1 | 2754 | | EncodingGroup=2 | 2755 | | | 2756 | MCC6() (for audio) | CaptureArea=whole scene | 2757 | | MaxCaptures=1 | 2758 | | Policy=SoundLevel:2 | 2759 | | EncodingGroup=2 | 2760 | | | 2761 | MCC7() (for audio) | CaptureArea=whole scene | 2762 | | MaxCaptures=1 | 2763 | | Policy=SoundLevel:3 | 2764 | | EncodingGroup=2 | 2765 | | | 2766 | CSV(MCC1,MCC2,MCC3) | | 2767 | CSV(MCC4,MCC5,MCC6, | | 2768 | MCC7) | | 2769 +=======================+=================================+ 2771 Table 22: Advertisement send to endpoint A - switching part 2773 The above part describes the switched 3 main streams that relate to 2774 site switching. MaxCaptures=1 indicates that only one Capture from 2775 the MCC is sent at a particular time. SynchronisationID=1 indicates 2776 that the source sending is synchronised. The provider can choose to 2777 group together VC13, VC14, and VC15 for the purpose of switching 2778 according to the SynchronisationID. Therefore when the provider 2779 switches one of them into an MCC, it can also switch the others 2780 even though they are not part of the same Capture Scene. 2782 All the audio for the conference is included in this Scene #7. 2783 There isn't necessarily a one to one relation between any audio 2784 capture and video capture in this scene. Typically a change in 2785 loudest talker will cause the MCU to switch the audio streams more 2786 quickly than switching video streams. 2788 The MCU can also supply nine media streams showing the active and 2789 previous eight speakers. It includes the following in the 2790 Advertisement: 2792 +=======================+=================================+ 2793 | Capture Scene #8 | Description=Output9stream | 2794 +-----------------------|---------------------------------+ 2795 | MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2796 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:0 | 2797 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2798 | | | 2799 | MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2800 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:1 | 2801 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2802 | | | 2803 to to | 2804 | | | 2805 | MCC16(VC4,VC5,VC6,VC7,| MaxCaptures=1 | 2806 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:8 | 2807 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2808 | | | 2809 | CSV(MCC8,MCC9,MCC10, | | 2810 | MCC11,MCC12,MCC13,| | 2811 | MCC14,MCC15,MCC16)| | 2812 +=======================+=================================+ 2814 Table 23: Advertisement sent to endpoint A - 9 switched part 2816 The above part indicates that there are 9 capture encodings. Each 2817 of the Capture Encodings may contain any captures from any source 2818 site with a maximum of one Capture at a time. Which Capture is 2819 present is determined by the policy. The MCCs in this scene do not 2820 have any spatial attributes. 2822 Note: The Provider alternatively could provide each of the MCCs 2823 above in its own Capture Scene. 2825 If the MCU wanted to provide a composed Capture Encoding containing 2826 all of the 9 captures it could advertise in addition: 2828 +=======================+=================================+ 2829 | Capture Scene #9 | Description=NineTiles | 2830 +-----------------------|---------------------------------+ 2831 | MCC13(MCC8,MCC9,MCC10,| MaxCaptures=9 | 2832 | MCC11,MCC12,MCC13,| EncodingGroup=1 | 2833 | MCC14,MCC15,MCC16)| | 2834 | | | 2835 | CSV(MCC13) | | 2836 +=======================+=================================+ 2838 Table 24: Advertisement sent to endpoint A - 9 composed part 2840 As MaxCaptures is 9 it indicates that the capture encoding contains 2841 information from 9 sources at a time. 2843 The Advertisement to Endpoint B is identical to the above other 2844 than the captures from Endpoint A would be added and the captures 2845 from Endpoint B would be removed. Whether the Captures are rendered 2846 on a four screen display or a three screen display is up to the 2847 Consumer to determine. The Consumer wants to place video captures 2848 from the same original source endpoint together, in the correct 2849 spatial order, but the MCCs do not have spatial attributes. So the 2850 Consumer needs to associate incoming media packets with the 2851 original individual captures in the advertisement (such as VC4, 2852 VC5, and VC6) in order to know the spatial information it needs for 2853 correct placement on the screens. The Provider can use the RTCP 2854 CaptureId SDES item and associated RTP header extension, as 2855 described in [I-D.ietf-clue-rtp-mapping], to convey this 2856 information to the Consumer. 2858 12.3.4. Heterogeneous conference with voice activated switching 2860 This example illustrates how multipoint "voice activated switching" 2861 behavior can be realized, with an endpoint making its own decision 2862 about which of its outgoing video streams is considered the "active 2863 talker" from that endpoint. Then an MCU can decide which is the 2864 active talker among the whole conference. 2866 Consider a conference between endpoints with the following 2867 characteristics: 2869 Endpoint A - 3 screens, 3 cameras 2871 Endpoint B - 3 screens, 3 cameras 2873 Endpoint C - 1 screen, 1 camera 2875 This example focuses on what the user at endpoint C sees. The 2876 user would like to see the video capture of the current talker, 2877 without composing it with any other video capture. In this 2878 example endpoint C is capable of receiving only a single video 2879 stream. The following tables describe advertisements from A and B 2880 to the MCU, and from the MCU to C, that can be used to accomplish 2881 this. 2883 +-----------------------+---------------------------------+ 2884 | Capture Scene #1 | Description=Endpoint x | 2885 +-----------------------|---------------------------------+ 2886 | VC1 | CaptureArea=Left | 2887 | | EncodingGroup=1 | 2888 | VC2 | CaptureArea=Center | 2889 | | EncodingGroup=1 | 2890 | VC3 | CaptureArea=Right | 2891 | | EncodingGroup=1 | 2892 | MCC1(VC1,VC2,VC3) | MaxCaptures=1 | 2893 | | CaptureArea=whole scene | 2894 | | Policy=SoundLevel:0 | 2895 | | EncodingGroup=1 | 2896 | AC1 | CaptureArea=whole scene | 2897 | | EncodingGroup=2 | 2898 | CSV1(VC1, VC2, VC3) | | 2899 | CSV2(MCC1) | | 2900 | CSV3(AC1) | | 2901 +---------------------------------------------------------+ 2903 Table 25: Advertisement received at the MCU from Endpoints A and B 2905 Endpoints A and B are advertising each individual video capture, 2906 and also a switched capture MCC1 which switches between the other 2907 three based on who is the active talker. These endpoints do not 2908 advertise distinct audio captures associated with each individual 2909 video capture, so it would be impossible for the MCU (as a media 2910 consumer) to make its own determination of which video capture is 2911 the active talker based just on information in the audio streams. 2913 +-----------------------+---------------------------------+ 2914 | Capture Scene #1 | Description=conference | 2915 +-----------------------|---------------------------------+ 2916 | MCC1() | CaptureArea=Left | 2917 | | MaxCaptures=1 | 2918 | | SynchronisationID=1 | 2919 | | Policy=SoundLevel:0 | 2920 | | EncodingGroup=1 | 2921 | | | 2922 | MCC2() | CaptureArea=Center | 2923 | | MaxCaptures=1 | 2924 | | SynchronisationID=1 | 2925 | | Policy=SoundLevel:0 | 2926 | | EncodingGroup=1 | 2927 | | | 2928 | MCC3() | CaptureArea=Right | 2929 | | MaxCaptures=1 | 2930 | | SynchronisationID=1 | 2931 | | Policy=SoundLevel:0 | 2932 | | EncodingGroup=1 | 2933 | | | 2934 | MCC4() | CaptureArea=whole scene | 2935 | | MaxCaptures=1 | 2936 | | Policy=SoundLevel:0 | 2937 | | EncodingGroup=1 | 2938 | | | 2939 | MCC5() (for audio) | CaptureArea=whole scene | 2940 | | MaxCaptures=1 | 2941 | | Policy=SoundLevel:0 | 2942 | | EncodingGroup=2 | 2943 | | | 2944 | MCC6() (for audio) | CaptureArea=whole scene | 2945 | | MaxCaptures=1 | 2946 | | Policy=SoundLevel:1 | 2947 | | EncodingGroup=2 | 2948 | CSV1(MCC1,MCC2,MCC3 | | 2949 | CSV2(MCC4) | | 2950 | CSV3(MCC5,MCC6) | | 2951 +---------------------------------------------------------+ 2952 Table 26: Advertisement sent from the MCU to C 2954 The MCU advertises one scene, with four video MCCs. Three of them 2955 in CSV1 give a left, center, right view of the conference, with 2956 "site switching". MCC4 provides a single video capture 2957 representing a view of the whole conference. The MCU intends for 2958 MCC4 to be switched between all the other original source 2959 captures. In this example advertisement the MCU is not giving all 2960 the information about all the other endpoints' scenes and which of 2961 those captures is included in the MCCs. The MCU could include all 2962 that information if it wants to give the consumers more 2963 information, but it is not necessary for this example scenario. 2965 The Provider advertises MCC5 and MCC6 for audio. Both are 2966 switched captures, with different SoundLevel policies indicating 2967 they are the top two dominant talkers. The Provider advertises 2968 CSV3 with both MCCs, suggesting the Consumer should use both if it 2969 can. 2971 Endpoint C, in its configure message to the MCU, requests to 2972 receive MCC4 for video, and MCC5 and MCC6 for audio. In order for 2973 the MCU to get the information it needs to construct MCC4, it has 2974 to send configure messages to A and B asking to receive MCC1 from 2975 each of them, along with their AC1 audio. Now the MCU can use 2976 audio energy information from the two incoming audio streams from 2977 A and B to determine which of those alternatives is the current 2978 talker. Based on that, the MCU uses either MCC1 from A or MCC1 2979 from B as the source of MCC4 to send to C. 2981 13. Acknowledgements 2983 Allyn Romanow and Brian Baldino were authors of early versions. 2984 Mark Gorzynski also contributed much to the initial approach. 2985 Many others also contributed, including Christian Groves, Jonathan 2986 Lennox, Paul Kyzivat, Rob Hansen, Roni Even, Christer Holmberg, 2987 Stephen Botzko, Mary Barnes, John Leslie, Paul Coverdale. 2989 14. IANA Considerations 2991 None. 2993 15. Security Considerations 2995 There are several potential attacks related to telepresence, and 2996 specifically the protocols used by CLUE, in the case of 2997 conferencing sessions, due to the natural involvement of multiple 2998 endpoints and the many, often user-invoked, capabilities provided 2999 by the systems. 3001 An MCU involved in a CLUE session can experience many of the same 3002 attacks as that of a conferencing system such as that enabled by 3003 the XCON framework [RFC 6503]. Examples of attacks include the 3004 following: an endpoint attempting to listen to sessions in which 3005 it is not authorized to participate, an endpoint attempting to 3006 disconnect or mute other users, and theft of service by an 3007 endpoint in attempting to create telepresence sessions it is not 3008 allowed to create. Thus, it is RECOMMENDED that an MCU 3009 implementing the protocols necessary to support CLUE, follow the 3010 security recommendations specified in the conference control 3011 protocol documents. In the case of CLUE, SIP is the conferencing 3012 protocol, thus the security considerations in RFC 4579 MUST be 3013 followed. 3015 One primary security concern, surrounding the CLUE framework 3016 introduced in this document, involves securing the actual 3017 protocols and the associated authorization mechanisms. These 3018 concerns apply to endpoint to endpoint sessions, as well as 3019 sessions involving multiple endpoints and MCUs. Figure 2 in 3020 section 5 provides a basic flow of information exchange for CLUE 3021 and the protocols involved. 3023 As described in section 5, CLUE uses SIP/SDP to establish the 3024 session prior to exchanging any CLUE specific information. Thus 3025 the security mechanisms recommended for SIP [RFC 3261], including 3026 user authentication and authorization, SHOULD be followed. In 3027 addition, the media is based on RTP and thus existing RTP security 3028 mechanisms SHOULD be supported, and DTLS/SRTP MUST be supported. 3029 Media security is also discussed in [I-D.ietf-clue-signaling] and 3030 [I-D.ietf-clue-rtp-mapping]. 3032 A separate data channel is established to transport the CLUE 3033 protocol messages. The contents of the CLUE protocol messages are 3034 based on information introduced in this document. The CLUE data 3035 model [I-D.ietf-clue-data-model-schema] defines through an XML 3036 schema the syntax to be used. Some of the information which could 3037 possibly introduce privacy concerns is the xCard information as 3038 described in section 7.1.1.11. In addition, the (text) 3039 description field in the Media Capture attribute (section 7.1.1.7) 3040 could possibly reveal sensitive information or specific 3041 identities. The same would be true for the descriptions in the 3042 Capture Scene (section 7.3.1) and Capture Scene View (7.3.2) 3043 attributes. One other important consideration for the 3044 information in the xCard as well as the description field in the 3045 Media Capture and Capture Scene View attributes is that while the 3046 endpoints involved in the session have been authenticated, there 3047 is no assurance that the information in the xCard or description 3048 fields is authentic. Thus, this information MUST NOT be used to 3049 make any authorization decisions. 3051 While other information in the CLUE protocol messages does not 3052 reveal specific identities, it can reveal characteristics and 3053 capabilities of the endpoints. That information could possibly 3054 uniquely identify specific endpoints. It might also be possible 3055 for an attacker to manipulate the information and disrupt the CLUE 3056 sessions. It would also be possible to mount a DoS attack on the 3057 CLUE endpoints if a malicious agent has access to the data 3058 channel. Thus, it MUST be possible for the endpoints to establish 3059 a channel which is secure against both message recovery and 3060 message modification. Further details on this are provided in the 3061 CLUE data channel solution document. 3063 There are also security issues associated with the authorization 3064 to perform actions at the CLUE endpoints to invoke specific 3065 capabilities (e.g., re-arranging screens, sharing content, etc.). 3066 However, the policies and security associated with these actions 3067 are outside the scope of this document and the overall CLUE 3068 solution. 3070 16. Changes Since Last Version 3072 NOTE TO THE RFC-Editor: Please remove this section prior to 3073 publication as an RFC. 3075 Changes from 20 to 21: 3077 1. Clarify CLUE can be useful for multi-stream non-telepresence 3078 cases. 3079 2. Remove unnecessary ambiguous sentence about optional use of 3080 CLUE protocol. 3081 3. Clarify meaning if Area of Capture is not specified. 3082 4. Remove use of "conference" where it didn't fit according to 3083 the definition. Use "CLUE session" or "meeting" instead. 3085 5. Embedded Text Attribute: Remove restriction it is for video 3086 only. 3087 6. Minor cleanup in section 12 examples. 3088 7. Minor editorial corrections suggested by Christian Groves. 3090 Changes from 19 to 20: 3092 1. Define term "CLUE" in introduction. 3093 2. Add MCC attribute Allow Subset Choice. 3094 3. Remove phrase about reducing SDP size, replace with 3095 potentially saving consumer resources. 3096 4. Change example of a CLUE exchange that does not require SDP 3097 exchange. 3098 5. Language attribute uses RFC5646. 3099 6. Change Member person type to Attendee. Add Observer type. 3100 7. Clarify DTLS/SRTP MUST be supported. 3101 8. Change SHOULD NOT to MUST NOT regarding using xCard or 3102 description information for authorization decisions. 3103 9. Clarify definition of Global View. 3104 10. Refer to signaling doc regarding interoperating with a 3105 device that does not support CLUE. 3106 11. Various minor editorial changes from working group last call 3107 feedback. 3108 12. Capitalize defined terms. 3110 Changes from 18 to 19: 3112 1. Remove the Max Capture Encodings media capture attribute. 3113 2. Refer to RTP mapping document in the MCC example section. 3114 3. Update references to current versions of drafts in progress. 3116 Changes from 17 to 18: 3118 1. Add separate definition of Global View List. 3119 2. Add diagram for Global View List structure. 3120 3. Tweak definitions of Media Consumer and Provider. 3122 Changes from 16 to 17: 3124 1. Ticket #59 - rename Capture Scene Entry (CSE) to Capture 3125 Scene View (CSV) 3127 2. Ticket #60 - rename Global CSE List to Global View List 3129 3. Ticket #61 - Proposal for describing the coordinate system. 3130 Describe it better, without conflicts if cameras point in 3131 different directions. 3133 4. Minor clarifications and improved wording for Synchronisation 3134 Identity, MCC, Simultaneous Transmission Set. 3136 5. Add definitions for CLUE-capable device and CLUE-enabled 3137 call, taken from the signaling draft. 3139 6. Update definitions of Capture Device, Media Consumer, Media 3140 Provider, Endpoint, MCU, MCC. 3142 7. Replace "middle box" with "MCU". 3144 8. Explicitly state there can also be Media Captures that are 3145 not included in a Capture Scene View. 3147 9. Explicitly state "A single Encoding Group MAY refer to 3148 encodings for different media types." 3150 10. In example 12.1.1 add axes and audio captures to the 3151 diagram, and describe placement of microphones. 3153 11. Add references to data model and signaling drafts. 3155 12. Split references into Normative and Informative sections. 3156 Add heading number for references section. 3158 Changes from 15 to 16: 3160 1. Remove Audio Channel Format attribute 3162 2. Add Audio Capture Sensitivity Pattern attribute 3164 3. Clarify audio spatial information regarding point of capture 3165 and point on line of capture. Area of capture does not apply 3166 to audio. 3168 4. Update section 12 example for new treatment of audio spatial 3169 information. 3171 5. Clean up wording of some definitions, and various places in 3172 sections 5 and 10. 3174 6. Remove individual encoding parameter paragraph from section 3175 9. 3177 7. Update Advertisement diagram. 3179 8. Update Acknowledgements. 3181 9. References to use cases and requirements now refer to RFCs. 3183 10. Minor editorial changes. 3185 Changes from 14 to 15: 3187 1. Add "=" and "<=" qualifiers to MaxCaptures attribute, and 3188 clarify the meaning regarding switched and composed MCC. 3190 2. Add section 7.3.3 Global Capture Scene Entry List, and a few 3191 other sentences elsewhere that refer to global CSE sets. 3193 3. Clarify: The Provider MUST be capable of encoding and sending 3194 all Captures (*that have an encoding group*) in a single 3195 Capture Scene Entry simultaneously. 3197 4. Add voice activated switching example in section 12. 3199 5. Change name of attributes Participant Info/Type to Person 3200 Info/Type. 3202 6. Clarify the Person Info/Type attributes have the same meaning 3203 regardless of whether or not the capture has a Presentation 3204 attribute. 3206 7. Update example section 12.1 to be consistent with the rest of 3207 the document, regarding MCC and capture attributes. 3209 8. State explicitly each CSE has a unique ID. 3211 Changes from 13 to 14: 3213 1. Fill in section for Security Considerations. 3215 2. Replace Role placeholder with Participant Information, 3216 Participant Type, and Scene Information attributes. 3218 3. Spatial information implies nothing about how constituent 3219 media captures are combined into a composed MCC. 3221 4. Clean up MCC example in Section 12.3.3. Clarify behavior of 3222 tiled and PIP display windows. Add audio. Add new open 3223 issue about associating incoming packets to original source 3224 capture. 3226 5. Remove editor's note and associated statement about RTP 3227 multiplexing at end of section 5. 3229 6. Remove editor's note and associated paragraph about 3230 overloading media channel with both CLUE and non-CLUE usage, 3231 in section 5. 3233 7. In section 10, clarify intent of media encodings conforming 3234 to SDP, even with multiple CLUE message exchanges. Remove 3235 associated editor's note. 3237 Changes from 12 to 13: 3239 1. Added the MCC concept including updates to existing sections 3240 to incorporate the MCC concept. New MCC attributes: 3241 MaxCaptures, SynchronisationID and Policy. 3243 2. Removed the "composed" and "switched" Capture attributes due 3244 to overlap with the MCC concept. 3246 3. Removed the "Scene-switch-policy" CSE attribute, replaced by 3247 MCC and SynchronisationID. 3249 4. Editorial enhancements including numbering of the Capture 3250 attribute sections, tables, figures etc. 3252 Changes from 11 to 12: 3254 1. Ticket #44. Remove note questioning about requiring a 3255 Consumer to send a Configure after receiving Advertisement. 3257 2. Ticket #43. Remove ability for consumer to choose value of 3258 attribute for scene-switch-policy. 3260 3. Ticket #36. Remove computational complexity parameter, 3261 MaxGroupPps, from Encoding Groups. 3263 4. Reword the Abstract and parts of sections 1 and 4 (now 5) 3264 based on Mary's suggestions as discussed on the list. Move 3265 part of the Introduction into a new section Overview & 3266 Motivation. 3268 5. Add diagram of an Advertisement, in the Overview of the 3269 Framework/Model section. 3271 6. Change Intended Status to Standards Track. 3273 7. Clean up RFC2119 keyword language. 3275 Changes from 10 to 11: 3277 1. Add description attribute to Media Capture and Capture Scene 3278 Entry. 3280 2. Remove contradiction and change the note about open issue 3281 regarding always responding to Advertisement with a Configure 3282 message. 3284 3. Update example section, to cleanup formatting and make the 3285 media capture attributes and encoding parameters consistent 3286 with the rest of the document. 3288 Changes from 09 to 10: 3290 1. Several minor clarifications such as about SDP usage, Media 3291 Captures, Configure message. 3293 2. Simultaneous Set can be expressed in terms of Capture Scene 3294 and Capture Scene Entry. 3296 3. Removed Area of Scene attribute. 3298 4. Add attributes from draft-groves-clue-capture-attr-01. 3300 5. Move some of the Media Capture attribute descriptions back 3301 into this document, but try to leave detailed syntax to the 3302 data model. Remove the OUTSOURCE sections, which are already 3303 incorporated into the data model document. 3305 Changes from 08 to 09: 3307 1. Use "document" instead of "memo". 3309 2. Add basic call flow sequence diagram to introduction. 3311 3. Add definitions for Advertisement and Configure messages. 3313 4. Add definitions for Capture and Provider. 3315 5. Update definition of Capture Scene. 3317 6. Update definition of Individual Encoding. 3319 7. Shorten definition of Media Capture and add key points in the 3320 Media Captures section. 3322 8. Reword a bit about capture scenes in overview. 3324 9. Reword about labeling Media Captures. 3326 10. Remove the Consumer Capability message. 3328 11. New example section heading for media provider behavior 3330 12. Clarifications in the Capture Scene section. 3332 13. Clarifications in the Simultaneous Transmission Set section. 3334 14. Capitalize defined terms. 3336 15. Move call flow example from introduction to overview section 3338 16. General editorial cleanup 3340 17. Add some editors' notes requesting input on issues 3341 18. Summarize some sections, and propose details be outsourced 3342 to other documents. 3344 Changes from 06 to 07: 3346 1. Ticket #9. Rename Axis of Capture Point attribute to Point 3347 on Line of Capture. Clarify the description of this 3348 attribute. 3350 2. Ticket #17. Add "capture encoding" definition. Use this new 3351 term throughout document as appropriate, replacing some usage 3352 of the terms "stream" and "encoding". 3354 3. Ticket #18. Add Max Capture Encodings media capture 3355 attribute. 3357 4. Add clarification that different capture scene entries are 3358 not necessarily mutually exclusive. 3360 Changes from 05 to 06: 3362 1. Capture scene description attribute is a list of text strings, 3363 each in a different language, rather than just a single string. 3365 2. Add new Axis of Capture Point attribute. 3367 3. Remove appendices A.1 through A.6. 3369 4. Clarify that the provider must use the same coordinate system 3370 with same scale and origin for all coordinates within the same 3371 capture scene. 3373 Changes from 04 to 05: 3375 1. Clarify limitations of "composed" attribute. 3377 2. Add new section "capture scene entry attributes" and add the 3378 attribute "scene-switch-policy". 3380 3. Add capture scene description attribute and description 3381 language attribute. 3383 4. Editorial changes to examples section for consistency with the 3384 rest of the document. 3386 Changes from 03 to 04: 3388 1. Remove sentence from overview - "This constitutes a significant 3389 change ..." 3391 2. Clarify a consumer can choose a subset of captures from a 3392 capture scene entry or a simultaneous set (in section "capture 3393 scene" and "consumer's choice..."). 3395 3. Reword first paragraph of Media Capture Attributes section. 3397 4. Clarify a stereo audio capture is different from two mono audio 3398 captures (description of audio channel format attribute). 3400 5. Clarify what it means when coordinate information is not 3401 specified for area of capture, point of capture, area of scene. 3403 6. Change the term "producer" to "provider" to be consistent (it 3404 was just in two places). 3406 7. Change name of "purpose" attribute to "content" and refer to 3407 RFC4796 for values. 3409 8. Clarify simultaneous sets are part of a provider advertisement, 3410 and apply across all capture scenes in the advertisement. 3412 9. Remove sentence about lip-sync between all media captures in a 3413 capture scene. 3415 10. Combine the concepts of "capture scene" and "capture set" 3416 into a single concept, using the term "capture scene" to 3417 replace the previous term "capture set", and eliminating the 3418 original separate capture scene concept. 3420 17. Normative References 3422 [I-D.ietf-clue-datachannel] 3423 Holmberg, C., "CLUE Protocol Data Channel", draft- 3424 ietf-clue-datachannel-05 (work in progress), November 3425 2014. 3427 [I-D.ietf-clue-data-model-schema] 3428 Presta, R., Romano, S P., "An XML Schema for the CLUE 3429 data model", draft-ietf-clue-data-model-schema-07 (work 3430 in progress), September 2014. 3432 [I-D.ietf-clue-protocol] 3433 Presta, R. and S. Romano, "CLUE protocol", draft- 3434 ietf-clue-protocol-02 (work in progress), October 2014. 3436 [I-D.ietf-clue-signaling] 3437 Kyzivat, P., Xiao, L., Groves, C., Hansen, R., "CLUE 3438 Signaling", draft-ietf-clue-signaling-04 (work in 3439 progress), October 2014. 3441 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3442 Requirement Levels", BCP 14, RFC 2119, March 1997. 3444 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 3445 Johnston, 3446 A., Peterson, J., Sparks, R., Handley, M., and E. 3447 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 3448 June 2002. 3450 [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model 3451 with the Session Description Protocol (SDP)", RFC 3264, 3452 June 2002. 3454 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 3455 Jacobson, "RTP: A Transport Protocol for Real-Time 3456 Applications", STD 64, RFC 3550, July 2003. 3458 [RFC4579] Johnston, A., Levin, O., "SIP Call Control - 3459 Conferencing for User Agents", RFC 4579, August 2006 3461 18. Informative References 3463 [I-D.ietf-clue-rtp-mapping] 3464 Even, R., Lennox, J., "Mapping RP streams to CLUE media 3465 captures", draft-ietf-clue-rtp-mapping-03 (work in 3466 progress), October 2014. 3468 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 3469 Session Initiation Protocol (SIP)", RFC 4353, 3470 February 2006. 3472 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 3473 5117, January 2008. 3475 [RFC5646] Phillips, A., Davis, M., "Tags for Identifying 3476 Languages", RFC 5646, September 2009 3478 [RFC7205] Romanow, A., Botzko, S., Duckworth, M., Even, R., 3479 "Use Cases for Telepresence Multistreams", RFC 7205, 3480 April 2014. 3482 [RFC7262] Romanow, A., Botzko, S., Barnes, M., "Requirements 3483 for Telepresence Multistreams", RFC 7262, June 2014. 3485 19. Authors' Addresses 3487 Mark Duckworth (editor) 3488 Polycom 3489 Andover, MA 01810 3490 USA 3492 Email: mark.duckworth@polycom.com 3494 Andrew Pepperell 3495 Acano 3496 Uxbridge, England 3497 UK 3499 Email: apeppere@gmail.com 3501 Stephan Wenger 3502 Vidyo, Inc. 3503 433 Hackensack Ave. 3504 Hackensack, N.J. 07601 3505 USA 3507 Email: stewe@stewe.org