idnits 2.17.1 draft-ietf-clue-framework-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1093 has weird spacing: '... switch betwe...' == Line 1910 has weird spacing: '...om left bot...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: A separate data channel is established to transport the CLUE protocol messages. The contents of the CLUE protocol messages are based on information introduced in this document, which is represented by an XML schema for this information defined in the CLUE data model [ref]. Some of the information which could possibly introduce privacy concerns is the xCard information as described in section x. In addition, the (text) description field in the Media Capture attribute (section 7.1.1.7) could possibly reveal sensitive information or specific identities. The same would be true for the descriptions in the Capture Scene (section 7.3.1) and Capture Scene Entry (7.3.2) attributes. One other important consideration for the information in the xCard as well as the description field in the Media Capture and Capture Scene Entry attributes is that while the endpoints involved in the session have been authenticated, there is no assurance that the information in the xCard or description fields is authentic. Thus, this information SHOULD not be used to make any authorization decisions and the participants in the sessions SHOULD be made aware of this. -- The document date (June 27, 2014) is 3585 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC6351' is mentioned on line 844, but not defined == Missing Reference: 'RFC6350' is mentioned on line 855, but not defined == Missing Reference: 'RFC4566' is mentioned on line 1527, but not defined ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866) == Missing Reference: 'RFC 6503' is mentioned on line 2932, but not defined == Missing Reference: 'RFC 3261' is mentioned on line 2954, but not defined == Unused Reference: 'RFC4579' is defined on line 3291, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 11 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Standards Track A. Pepperell 4 Expires: December 27, 2014 Acano 5 S. Wenger 6 Vidyo 7 June 27, 2014 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-16.txt 12 Abstract 14 This document defines a framework for a protocol to enable devices 15 in a telepresence conference to interoperate. The protocol enables 16 communication of information about multiple media streams so a 17 sending system and receiving system can make reasonable decisions 18 about transmitting, selecting and rendering the media streams. 19 This protocol is used in addition to SIP signaling for setting up a 20 telepresence session. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current 30 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other 34 documents at any time. It is inappropriate to use Internet-Drafts 35 as reference material or to cite them other than as "work in 36 progress." 38 This Internet-Draft will expire on November 15, 2014. 40 Copyright Notice 42 Copyright (c) 2013 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with 50 respect to this document. Code Components extracted from this 51 document must include Simplified BSD License text as described in 52 Section 4.e of the Trust Legal Provisions and are provided without 53 warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction...................................................3 58 2. Terminology....................................................4 59 3. Definitions....................................................4 60 4. Overview & Motivation..........................................7 61 5. Overview of the Framework/Model................................9 62 6. Spatial Relationships.........................................14 63 7. Media Captures and Capture Scenes.............................16 64 7.1. Media Captures...........................................16 65 7.1.1. Media Capture Attributes............................17 66 7.2. Multiple Content Capture.................................23 67 7.2.1. MCC Attributes......................................24 68 7.3. Capture Scene............................................29 69 7.3.1. Capture Scene attributes............................31 70 7.3.2. Capture Scene Entry attributes......................32 71 7.3.3. Global Capture Scene Entry List.....................32 72 8. Simultaneous Transmission Set Constraints.....................33 73 9. Encodings.....................................................35 74 9.1. Individual Encodings.....................................35 75 9.2. Encoding Group...........................................36 76 9.3. Associating Captures with Encoding Groups................37 77 10. Consumer's Choice of Streams to Receive from the Provider....38 78 10.1. Local preference........................................41 79 10.2. Physical simultaneity restrictions......................41 80 10.3. Encoding and encoding group limits......................41 81 11. Extensibility................................................42 82 12. Examples - Using the Framework (Informative).................42 83 12.1. Provider Behavior.......................................42 84 12.1.1. Three screen Endpoint Provider.....................42 85 12.1.2. Encoding Group Example.............................49 86 12.1.3. The MCU Case.......................................50 88 12.2. Media Consumer Behavior.................................51 89 12.2.1. One screen Media Consumer..........................51 90 12.2.2. Two screen Media Consumer configuring the example..52 91 12.2.3. Three screen Media Consumer configuring the example53 92 12.3. Multipoint Conference utilizing Multiple Content Captures53 93 12.3.1. Single Media Captures and MCC in the same 94 Advertisement..............................................53 95 12.3.2. Several MCCs in the same Advertisement.............56 96 12.3.3. Heterogeneous conference with switching and 97 composition................................................58 98 12.3.4. Heterogeneous conference with voice activated 99 switching..................................................65 100 13. Acknowledgements.............................................67 101 14. IANA Considerations..........................................67 102 15. Security Considerations......................................68 103 16. Changes Since Last Version...................................69 104 17. Authors' Addresses...........................................76 106 1. Introduction 108 Current telepresence systems, though based on open standards such 109 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 110 each other. A major factor limiting the interoperability of 111 telepresence systems is the lack of a standardized way to describe 112 and negotiate the use of the multiple streams of audio and video 113 comprising the media flows. This document provides a framework for 114 protocols to enable interoperability by handling multiple streams 115 in a standardized way. The framework is intended to support the 116 use cases described in Use Cases for Telepresence Multistreams 117 [RFC7205] and to meet the requirements in Requirements for 118 Telepresence Multistreams [RFC7262]. 120 The basic session setup for the use cases is based on SIP [RFC3261] 121 and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP 122 offer/answer, CLUE specific signaling is required to exchange the 123 information describing the multiple media streams. The motivation 124 for this framework, an overview of the signaling, and information 125 required to be exchanged is described in subsequent sections of 126 this document. The signaling details and data model are provided 127 in subsequent documents. 129 2. Terminology 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 133 this document are to be interpreted as described in RFC 2119 134 [RFC2119]. 136 3. Definitions 138 The terms defined below are used throughout this document and 139 companion documents and they are normative. In order to easily 140 identify the use of a defined term, those terms are capitalized. 142 Advertisement: a CLUE message a Media Provider sends to a Media 143 Consumer describing specific aspects of the content of the media, 144 and any restrictions it has in terms of being able to provide 145 certain Streams simultaneously. 147 Audio Capture: Media Capture for audio. Denoted as ACn in the 148 examples in this document. 150 Camera-Left and Right: For Media Captures, camera-left and camera- 151 right are from the point of view of a person observing the rendered 152 media. They are the opposite of Stage-Left and Stage-Right. 154 Capture: Same as Media Capture. 156 Capture Device: A device that converts audio and video input into 157 an electrical signal, in most cases to be fed into a media encoder. 159 Capture Encoding: A specific encoding of a Media Capture, to be 160 sent by a Media Provider to a Media Consumer via RTP. 162 Capture Scene: a structure representing a spatial region captured 163 by one or more Capture Devices, each capturing media representing a 164 portion of the region. The spatial region represented by a Capture 165 Scene MAY or may not correspond to a real region in physical space, 166 such as a room. A Capture Scene includes attributes and one or 167 more Capture Scene Entries, with each entry including one or more 168 Media Captures. 170 Capture Scene Entry (CSE): a list of Media Captures of the same 171 media type that together form one way to represent the entire 172 Capture Scene. 174 Conference: used as defined in [RFC4353], A Framework for 175 Conferencing within the Session Initiation Protocol (SIP). 177 Configure Message: A CLUE message a Media Consumer sends to a Media 178 Provider specifying which content and media streams it wants to 179 receive, based on the information in a corresponding Advertisement 180 message. 182 Consumer: short for Media Consumer. 184 Encoding or Individual Encoding: a set of parameters representing a 185 way to encode a Media Capture to become a Capture Encoding. 187 Encoding Group: A set of encoding parameters representing a total 188 media encoding capability to be sub-divided across potentially 189 multiple Individual Encodings. 191 Endpoint: The logical point of final termination through receiving, 192 decoding and rendering, and/or initiation through capturing, 193 encoding, and sending of media streams. An endpoint consists of 194 one or more physical devices which source and sink media streams, 195 and exactly one [RFC4353] Participant (which, in turn, includes 196 exactly one SIP User Agent). Endpoints can be anything from 197 multiscreen/multicamera rooms to handheld devices. 199 Front: the portion of the room closest to the cameras. In going 200 towards the back you move away from the cameras. 202 MCU: Multipoint Control Unit (MCU) - a device that connects two or 203 more endpoints together into one single multimedia conference 204 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 205 [RFC4353] requirement to send media to each participant. 207 Media: Any data that, after suitable encoding, can be conveyed over 208 RTP, including audio, video or timed text. 210 Media Capture: a source of Media, such as from one or more Capture 211 Devices or constructed from other Media streams. 213 Media Consumer: an Endpoint or middle box that receives Media 214 streams 215 Media Provider: an Endpoint or middle box that sends Media streams 217 Multiple Content Capture: A Capture for audio or video that 218 indicates that the Capture contains multiple audio or video 219 Captures. Single Media Captures may or may not be present in the 220 resultant Capture Encoding depending on time or space. Denoted as 221 MCCn in the example cases in this document. 223 Plane of Interest: The spatial plane containing the most relevant 224 subject matter. 226 Provider: Same as Media Provider. 228 Render: the process of generating a representation from media, such 229 as displayed motion video or sound emitted from loudspeakers. 231 Simultaneous Transmission Set: a set of Media Captures that can be 232 transmitted simultaneously from a Media Provider. 234 Single Media Capture: A capture which contains media from a single 235 source capture device, e.g. an audio capture from a single 236 microphone, a video capture from a single camera. 238 Spatial Relation: The arrangement in space of two objects, in 239 contrast to relation in time or other relationships. See also 240 Camera-Left and Right. 242 Stage-Left and Right: For Media Captures, Stage-left and Stage- 243 right are the opposite of Camera-left and Camera-right. For the 244 case of a person facing (and captured by) a camera, Stage-left and 245 Stage-right are from the point of view of that person. 247 Stream: a Capture Encoding sent from a Media Provider to a Media 248 Consumer via RTP [RFC3550]. 250 Stream Characteristics: the media stream attributes commonly used 251 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 252 resolution, profile/level etc.) as well as CLUE specific 253 attributes, such as the Capture ID or a spatial location. 255 Video Capture: Media Capture for video. Denoted as VCn in the 256 example cases in this document. 258 Video Composite: A single image that is formed, normally by an RTP 259 mixer inside an MCU, by combining visual elements from separate 260 sources. 262 4. Overview & Motivation 264 This section provides an overview of the functional elements 265 defined in this document to represent a telepresence system. The 266 motivations for the framework described in this document are also 267 provided. 269 Two key concepts introduced in this document are the terms "Media 270 Provider" and "Media Consumer". A Media Provider represents the 271 entity that sends the media and a Media Consumer represents the 272 entity that receives the media. A Media Provider provides Media in 273 the form of RTP packets, a Media Consumer consumes those RTP 274 packets. Media Providers and Media Consumers can reside in 275 Endpoints or in middleboxes such as Multipoint Control Units 276 (MCUs). A Media Provider in an Endpoint is usually associated 277 with the generation of media for Media Captures; these Media 278 Captures are typically sourced from cameras, microphones, and the 279 like. Similarly, the Media Consumer in an Endpoint is usually 280 associated with renderers, such as screens and loudspeakers. In 281 middleboxes, Media Providers and Consumers can have the form of 282 outputs and inputs, respectively, of RTP mixers, RTP translators, 283 and similar devices. Typically, telepresence devices such as 284 Endpoints and middleboxes would perform as both Media Providers 285 and Media Consumers, the former being concerned with those 286 devices' transmitted media and the latter with those devices' 287 received media. In a few circumstances, a CLUE Endpoint middlebox 288 includes only Consumer or Provider functionality, such as 289 recorder-type Consumers or webcam-type Providers. 291 The motivations for the framework outlined in this document 292 include the following: 294 (1) Endpoints in telepresence systems typically have multiple Media 295 Capture and Media Render devices, e.g., multiple cameras and 296 screens. While previous system designs were able to set up calls 297 that would capture media using all cameras and display media on all 298 screens, for example, there was no mechanism that can associate 299 these Media Captures with each other in space and time. 301 (2) The mere fact that there are multiple capturing and rendering 302 devices, each of which may be configurable in aspects such as zoom, 303 leads to the difficulty that a variable number of such devices can 304 be used to capture different aspects of a region. The Capture 305 Scene concept allows for the description of multiple setups for 306 those multiple capture devices that could represent sensible 307 operation points of the physical capture devices in a room, chosen 308 by the operator. A Consumer can pick and choose from those 309 configurations based on its rendering abilities and inform the 310 Provider about its choices. Details are provided in section 7. 312 (3) In some cases, physical limitations or other reasons disallow 313 the concurrent use of a device in more than one setup. For 314 example, the center camera in a typical three-camera conference 315 room can set its zoom objective either to capture only the middle 316 few seats, or all seats of a room, but not both concurrently. The 317 Simultaneous Transmission Set concept allows a Provider to signal 318 such limitations. Simultaneous Transmission Sets are part of the 319 Capture Scene description, and discussed in section 8. 321 (4) Often, the devices in a room do not have the computational 322 complexity or connectivity to deal with multiple encoding options 323 simultaneously, even if each of these options is sensible in 324 certain scenarios, and even if the simultaneous transmission is 325 also sensible (i.e. in case of multicast media distribution to 326 multiple endpoints). Such constraints can be expressed by the 327 Provider using the Encoding Group concept, described in section 9. 329 (5) Due to the potentially large number of RTP flows required for a 330 Multimedia Conference involving potentially many Endpoints, each of 331 which can have many Media Captures and media renderers, it has 332 become common to multiplex multiple RTP media flows onto the same 333 transport address, so to avoid using the port number as a 334 multiplexing point and the associated shortcomings such as 335 NAT/firewall traversal. While the actual mapping of those RTP 336 flows to the header fields of the RTP packets is not subject of 337 this specification, the large number of possible permutations of 338 sensible options a Media Provider can make available to a Media 339 Consumer makes a mechanism desirable that allows to narrow down the 340 number of possible options that a SIP offer-answer exchange has to 341 consider. Such information is made available using protocol 342 mechanisms specified in this document and companion documents, 343 although it should be stressed that its use in an implementation is 344 OPTIONAL. Also, there are aspects of the control of both Endpoints 345 and middleboxes/MCUs that dynamically change during the progress of 346 a call, such as audio-level based screen switching, layout changes, 347 and so on, which need to be conveyed. Note that these control 348 aspects are complementary to those specified in traditional SIP 349 based conference management such as BFCP. An exemplary call flow 350 can be found in section 5. 352 Finally, all this information needs to be conveyed, and the notion 353 of support for it needs to be established. This is done by the 354 negotiation of a "CLUE channel", a data channel negotiated early 355 during the initiation of a call. An Endpoint or MCU that rejects 356 the establishment of this data channel, by definition, does not 357 support CLUE based mechanisms, whereas an Endpoint or MCU that 358 accepts it is REQUIRED to use it to the extent specified in this 359 document and its companion documents. 361 5. Overview of the Framework/Model 363 The CLUE framework specifies how multiple media streams are to be 364 handled in a telepresence conference. 366 A Media Provider (transmitting Endpoint or MCU) describes specific 367 aspects of the content of the media and the media stream encodings 368 it can send in an Advertisement; and the Media Consumer responds to 369 the Media Provider by specifying which content and media streams it 370 wants to receive in a Configure message. The Provider then 371 transmits the asked-for content in the specified streams. 373 This Advertisement and Configure typically occur during call 374 initiation but MAY also happen at any time throughout the call, 375 whenever there is a change in what the Consumer wants to receive or 376 (perhaps less common) the Provider can send. 378 An Endpoint or MCU typically act as both Provider and Consumer at 379 the same time, sending Advertisements and sending Configurations in 380 response to receiving Advertisements. (It is possible to be just 381 one or the other.) 383 The data model is based around two main concepts: a Capture and an 384 Encoding. A Media Capture (MC), such as audio or video, has 385 attributes to describe the content a Provider can send. Media 386 Captures are described in terms of CLUE-defined attributes, such as 387 spatial relationships and purpose of the capture. Providers tell 388 Consumers which Media Captures they can provide, described in terms 389 of the Media Capture attributes. 391 A Provider organizes its Media Captures into one or more Capture 392 Scenes, each representing a spatial region, such as a room. A 393 Consumer chooses which Media Captures it wants to receive from the 394 Capture Scenes. 396 In addition, the Provider can send the Consumer a description of 397 the Individual Encodings it can send in terms of identifiers which 398 relate to items in SDP. 400 The Provider can also specify constraints on its ability to provide 401 Media, and a sensible design choice for a Consumer is to take these 402 into account when choosing the content and Capture Encodings it 403 requests in the later offer-answer exchange. Some constraints are 404 due to the physical limitations of devices--for example, a camera 405 may not be able to provide zoom and non-zoom views simultaneously. 406 Other constraints are system based, such as maximum bandwidth. 408 The following diagram illustrates the information contained in an 409 Advertisement. 411 ................................................................... 412 . Provider Advertisement +--------------------+ . 413 . | Simultaneous Sets | . 414 . +------------------------+ +--------------------+ . 415 . | Capture Scene N | +--------------------+ . 416 . +-+----------------------+ | | Global CSE List | . 417 . | Capture Scene 2 | | +--------------------+ . 418 . +-+----------------------+ | | +----------------------+ . 419 . | Capture Scene 1 | | | | Encoding Group N | . 420 . | +---------------+ | | | +-+--------------------+ | . 421 . | | Attributes | | | | | Encoding Group 2 | | . 422 . | +---------------+ | | | +-+--------------------+ | | . 423 . | | | | | Encoding Group 1 | | | . 424 . | +----------------+ | | | | parameters | | | . 425 . | | E n t r i e s | | | | | bandwidth | | | . 426 . | | +---------+ | | | | | +-------------------+| | | . 427 . | | |Attribute| | | | | | | V i d e o || | | . 428 . | | +---------+ | | | | | | E n c o d i n g s || | | . 429 . | | | | | | | | Encoding 1 || | | . 430 . | | Entry 1 | | | | | | || | | . 431 . | | (list of MCs) | | |-+ | +-------------------+| | | . 432 . | +----|-|--|------+ |-+ | | | | . 433 . +---------|-|--|---------+ | +-------------------+| | | . 434 . | | | | | A u d i o || | | . 435 . | | | | | E n c o d i n g s || | | . 436 . v | | | | Encoding 1 || | | . 437 . +---------|--|--------+ | | || | | . 438 . | Media Capture N |------>| +-------------------+| | | . 439 . +-+---------v--|------+ | | | | | . 440 . | Media Capture 2 | | | | |-+ . 441 . +-+--------------v----+ |-------->| | | . 442 . | Media Capture 1 | | | | |-+ . 443 . | +----------------+ |---------->| | . 444 . | | Attributes | | |_+ +----------------------+ . 445 . | +----------------+ |_+ . 446 . +---------------------+ . 447 . . 448 ................................................................... 449 Figure 1: Advertisement Structure 451 A very brief outline of the call flow used by a simple system (two 452 Endpoints) in compliance with this document can be described as 453 follows, and as shown in the following figure. 455 +-----------+ +-----------+ 456 | Endpoint1 | | Endpoint2 | 457 +----+------+ +-----+-----+ 458 | INVITE (BASIC SDP+CLUECHANNEL) | 459 |--------------------------------->| 460 | 200 0K (BASIC SDP+CLUECHANNEL)| 461 |<---------------------------------| 462 | ACK | 463 |--------------------------------->| 464 | | 465 |<################################>| 466 | BASIC SDP MEDIA SESSION | 467 |<################################>| 468 | | 469 | CONNECT (CLUE CTRL CHANNEL) | 470 |=================================>| 471 | ... | 472 |<================================>| 473 | CLUE CTRL CHANNEL ESTABLISHED | 474 |<================================>| 475 | | 476 | ADVERTISEMENT 1 | 477 |*********************************>| 478 | ADVERTISEMENT 2 | 479 |<*********************************| 480 | | 481 | CONFIGURE 1 | 482 |<*********************************| 483 | CONFIGURE 2 | 484 |*********************************>| 485 | | 486 | REINVITE (UPDATED SDP) | 487 |--------------------------------->| 488 | 200 0K (UPDATED SDP)| 489 |<---------------------------------| 490 | ACK | 491 |--------------------------------->| 492 | | 493 |<################################>| 494 | UPDATED SDP MEDIA SESSION | 495 |<################################>| 496 | | 497 v v 499 Figure 2: Basic Information Flow 501 An initial offer/answer exchange establishes a basic media session, 502 for example audio-only, and a CLUE channel between two Endpoints. 503 With the establishment of that channel, the endpoints have 504 consented to use the CLUE protocol mechanisms and, therefore, MUST 505 adhere to the CLUE protocol suite as outlined herein. 507 Over this CLUE channel, the Provider in each Endpoint conveys its 508 characteristics and capabilities by sending an Advertisement as 509 specified herein. The Advertisement is typically not sufficient to 510 set up all media. The Consumer in the Endpoint receives the 511 information provided by the Provider, and can use it for two 512 purposes. First, it MUST construct and send a CLUE Configure 513 message to tell the Provider what the Consumer wishes to receive. 514 Second, it MAY, but is not necessarily REQUIRED to, use the 515 information provided to tailor the SDP it is going to send during 516 the following SIP offer/answer exchange, and its reaction to SDP it 517 receives in that step. It is often a sensible implementation 518 choice to do so, as the representation of the media information 519 conveyed over the CLUE channel can dramatically cut down on the 520 size of SDP messages used in the O/A exchange that follows. 521 Spatial relationships associated with the Media can be included in 522 the Advertisement, and it is often sensible for the Media Consumer 523 to take those spatial relationships into account when tailoring the 524 SDP. 526 This CLUE exchange MUST be followed by an SDP offer answer exchange 527 that not only establishes those aspects of the media that have not 528 been "negotiated" over CLUE, but has also the side effect of 529 setting up the media transmission itself, involving potentially 530 security exchanges, ICE, and whatnot. This step is plain vanilla 531 SIP, with the exception that the SDP used herein, in most (but not 532 necessarily all) cases can be considerably smaller than the SDP a 533 system would typically need to exchange if there were no pre- 534 established knowledge about the Provider and Consumer 535 characteristics. (The need for cutting down SDP size is not quite 536 obvious for a point-to-point call involving simple endpoints; 537 however, when considering a large multipoint conference involving 538 many multi-screen/multi-camera endpoints, each of which can operate 539 using multiple codecs for each camera and microphone, it becomes 540 perhaps somewhat more intuitive.) 542 During the lifetime of a call, further exchanges MAY occur over the 543 CLUE channel. In some cases, those further exchanges lead to a 544 modified system behavior of Provider or Consumer (or both) without 545 any other protocol activity such as further offer/answer exchanges. 546 For example, voice-activated screen switching, signaled over the 547 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 548 re-invites. However, in other cases, after the CLUE negotiation an 549 additional offer/answer exchange becomes necessary. For example, 550 if both sides decide to upgrade the call from a single screen to a 551 multi-screen call and more bandwidth is required for the additional 552 video channels compared to what was previously negotiated using 553 offer/answer, a new O/A exchange is REQUIRED. 555 One aspect of the protocol outlined herein and specified in more 556 detail in companion documents is that it makes available 557 information regarding the Provider's capabilities to deliver Media, 558 and attributes related to that Media such as their spatial 559 relationship, to the Consumer. The operation of the renderer 560 inside the Consumer is unspecified in that it can choose to ignore 561 some information provided by the Provider, and/or not render media 562 streams available from the Provider (although it MUST follow the 563 CLUE protocol and, therefore, MUST gracefully receive and respond 564 (through a Configure) to the Provider's information). All CLUE 565 protocol mechanisms are OPTIONAL in the Consumer in the sense that, 566 while the Consumer MUST be able to receive (and, potentially, 567 gracefully acknowledge) CLUE messages, it is free to ignore the 568 information provided therein. 570 A CLUE-implementing device interoperates with a device that does 571 not support CLUE, because the non-CLUE device does, by definition, 572 not understand the offer of a CLUE channel in the initial 573 offer/answer exchange and, therefore, will reject it. This 574 rejection MUST be used as the indication to the CLUE-implementing 575 device that the other side of the communication is not compliant 576 with CLUE, and to fall back to behavior that does not require CLUE. 578 As for the media, Provider and Consumer have an end-to-end 579 communication relationship with respect to (RTP transported) media; 580 and the mechanisms described herein and in companion documents do 581 not change the aspects of setting up those RTP flows and sessions. 582 In other words, the RTP media sessions conform to the negotiated 583 SDP whether or not CLUE is used. 585 6. Spatial Relationships 587 In order for a Consumer to perform a proper rendering, it is often 588 necessary or at least helpful for the Consumer to have received 589 spatial information about the streams it is receiving. CLUE 590 defines a coordinate system that allows Media Providers to describe 591 the spatial relationships of their Media Captures to enable proper 592 scaling and spatially sensible rendering of their streams. The 593 coordinate system is based on a few principles: 595 o Simple systems which do not have multiple Media Captures to 596 associate spatially need not use the coordinate model. 598 o Coordinates can be either in real, physical units (millimeters), 599 have an unknown scale or have no physical scale. Systems which 600 know their physical dimensions (for example professionally 601 installed Telepresence room systems) MUST always provide those 602 real-world measurements. Systems which don't know specific 603 physical dimensions but still know relative distances MUST use 604 'unknown scale'. 'No scale' is intended to be used where Media 605 Captures from different devices (with potentially different 606 scales) will be forwarded alongside one another (e.g. in the 607 case of a middle box). 609 * "Millimeters" means the scale is in millimeters. 611 * "Unknown" means the scale is not necessarily millimeters, but 612 the scale is the same for every Capture in the Capture Scene. 614 * "No Scale" means the scale could be different for each 615 capture- an MCU Provider that advertises two adjacent 616 captures and picks sources (which can change quickly) from 617 different endpoints might use this value; the scale could be 618 different and changing for each capture. But the areas of 619 capture still represent a spatial relation between captures. 621 o The coordinate system is Cartesian X, Y, Z with the origin at a 622 spatial location of the Provider's choosing. The Provider MUST 623 use the same coordinate system with the same scale and origin 624 for all coordinates within the same Capture Scene. 626 The direction of increasing coordinate values is: 627 X increases from Camera-Left to Camera-Right 628 Y increases from front to back 629 Z increases from low to high (i.e. floor to ceiling) 631 7. Media Captures and Capture Scenes 633 This section describes how Providers can describe the content of 634 media to Consumers. 636 7.1. Media Captures 638 Media Captures are the fundamental representations of streams that 639 a device can transmit. What a Media Capture actually represents is 640 flexible: 642 o It can represent the immediate output of a physical source (e.g. 643 camera, microphone) or 'synthetic' source (e.g. laptop computer, 644 DVD player). 646 o It can represent the output of an audio mixer or video composer 648 o It can represent a concept such as 'the loudest speaker' 650 o It can represent a conceptual position such as 'the leftmost 651 stream' 653 To identify and distinguish between multiple Capture instances 654 Captures have a unique identity. For instance: VC1, VC2 and AC1, 655 AC2, where VC1 and VC2 refer to two different video captures and 656 AC1 and AC2 refer to two different audio captures. 658 Some key points about Media Captures: 660 . A Media Capture is of a single media type (e.g. audio or 661 video) 662 . A Media Capture is defined in a Capture Scene and is given an 663 advertisement unique identity. The identity may be referenced 664 outside the Capture Scene that defines it through a Multiple 665 Content Capture (MCC) 666 . A Media Capture is associated with one or more Capture Scene 667 Entries 668 . A Media Capture has exactly one set of spatial information 669 . A Media Capture can be the source of one or more Capture 670 Encodings 672 Each Media Capture can be associated with attributes to describe 673 what it represents. 675 7.1.1. Media Capture Attributes 677 Media Capture Attributes describe information about the Captures. 678 A Provider can use the Media Capture Attributes to describe the 679 Captures for the benefit of the Consumer of the Advertisement 680 message. Media Capture Attributes include: 682 . Spatial information, such as point of capture, point on line 683 of capture, and area of capture, all of which, in combination 684 define the capture field of, for example, a camera; 685 . Capture multiplexing information (mono/stereo audio, maximum 686 number of simultaneous encodings per Capture and so on); and 687 . Other descriptive information to help the Consumer choose 688 between captures (description, presentation, view, priority, 689 language, person information and type). 690 . Control information for use inside the CLUE protocol suite. 692 The sub-sections below define the Capture attributes. 694 7.1.1.1. Point of Capture 696 The Point of Capture attribute is a field with a single Cartesian 697 (X, Y, Z) point value which describes the spatial location of the 698 capturing device (such as camera). For an Audio Capture with 699 multiple microphones, the Point of Capture defines the nominal mid- 700 point of the microphones. 702 7.1.1.2. Point on Line of Capture 704 The Point on Line of Capture attribute is a field with a single 705 Cartesian (X, Y, Z) point value which describes a position in space 706 of a second point on the axis of the capturing device; the first 707 point being the Point of Capture (see above). 709 Together, the Point of Capture and Point on Line of Capture define 710 an axis of the capturing device, for example the optical axis of a 711 camera or the axis of a microphone. The Media Consumer can use 712 this information to adjust how it renders the received media if it 713 so chooses. 715 For an Audio Capture, the Media Consumer can use this information 716 along with the Audio Capture Sensitivity Pattern to define a 3- 717 dimensional volume of capture where sounds can be expected to be 718 picked up by the microphone providing this specific audio capture. 719 If the Consumer wants to associate an Audio Capture with a Video 720 Capture, it can compare this volume with the area of capture for 721 video media to provide a check on whether the audio capture is 722 indeed spatially associated with the video capture. For example, a 723 video area of capture that fails to intersect at all with the audio 724 volume of capture, or is at such a long radial distance from the 725 microphone point of capture that the audio level would be very low, 726 would be inappropriate. 728 7.1.1.3. Area of Capture 730 The Area of Capture is a field with a set of four (X, Y, Z) points 731 as a value which describes the spatial location of what is being 732 "captured". By comparing the Area of Capture for different Media 733 Captures within the same Capture Scene a Consumer can determine the 734 spatial relationships between them and render them correctly. Area 735 of Capture does not apply to Audio Captures. 737 The four points MUST be co-planar, forming a quadrilateral, which 738 defines the Plane of Interest for the particular media capture. 740 If the Area of Capture is not specified, it means the Media Capture 741 is not spatially related to any other Media Capture. 743 For a switched capture that switches between different sections 744 within a larger area, the area of capture MUST use coordinates for 745 the larger potential area. 747 7.1.1.4. Mobility of Capture 749 The Mobility of Capture attribute indicates whether or not the 750 point of capture, line on point of capture, and area of capture 751 values stay the same over time, or are expected to change 752 (potentially frequently). Possible values are static, dynamic, and 753 highly dynamic. 755 An example for "dynamic" is a camera mounted on a stand which is 756 occasionally hand-carried and placed at different positions in 757 order to provide the best angle to capture a work task. A camera 758 worn by a person who moves around the room is an example for 759 "highly dynamic". In either case, the effect is that the capture 760 point, capture axis and area of capture change with time. 762 The capture point of a static capture MUST NOT move for the life of 763 the conference. The capture point of dynamic captures is 764 categorized by a change in position followed by a reasonable period 765 of stability--in the order of magnitude of minutes. High dynamic 766 captures are categorized by a capture point that is constantly 767 moving. If the "area of capture", "capture point" and "line of 768 capture" attributes are included with dynamic or highly dynamic 769 captures they indicate spatial information at the time of the 770 Advertisement. 772 7.1.1.5. Audio Capture Sensitivity Pattern 774 The Audio Capture Sensitivity Pattern attribute applies only to 775 audio captures. This is an optional attribute. This attribute 776 gives information about the nominal sensitivity pattern of the 777 microphone which is the source of the capture. Possible values 778 include patterns such as omni, shotgun, cardioid, hyper-cardioid. 780 7.1.1.6. Max Capture Encodings 782 The Max Capture Encodings attribute is an optional attribute 783 indicating the maximum number of Capture Encodings that can be 784 simultaneously active for the Media Capture. The number of 785 simultaneous Capture Encodings is also limited by the restrictions 786 of the Encoding Group for the Media Capture. 788 7.1.1.7. Description 790 The Description attribute is a human-readable description (which 791 could be in multiple languages) of the Capture. 793 7.1.1.8. Presentation 795 The Presentation attribute indicates that the capture originates 796 from a presentation device, that is one that provides supplementary 797 information to a conference through slides, video, still images, 798 data etc. Where more information is known about the capture it MAY 799 be expanded hierarchically to indicate the different types of 800 presentation media, e.g. presentation.slides, presentation.image 801 etc. 803 Note: It is expected that a number of keywords will be defined that 804 provide more detail on the type of presentation. 806 7.1.1.9. View 808 The View attribute is a field with enumerated values, indicating 809 what type of view the Capture relates to. The Consumer can use 810 this information to help choose which Media Captures it wishes to 811 receive. The value MUST be one of: 813 Room - Captures the entire scene 815 Table - Captures the conference table with seated people 817 Individual - Captures an individual person 819 Lectern - Captures the region of the lectern including the 820 presenter, for example in a classroom style conference room 822 Audience - Captures a region showing the audience in a classroom 823 style conference room 825 7.1.1.10. Language 827 The language attribute indicates one or more languages used in the 828 content of the Media Capture. Captures MAY be offered in different 829 languages in case of multilingual and/or accessible conferences. A 830 Consumer can use this attribute to differentiate between them and 831 pick the appropriate one. 833 Note that the Language attribute is defined and meaningful both for 834 audio and video captures. In case of audio captures, the meaning 835 is obvious. For a video capture, "Language" could, for example, be 836 sign interpretation or text. 838 7.1.1.11. Person Information 840 The person information attribute allows a Provider to provide 841 specific information regarding the people in a Capture (regardless 842 of whether or not the capture has a Presentation attribute). The 843 Provider may gather the information automatically or manually from 844 a variety of sources however the xCard [RFC6351] format is used to 845 convey the information. This allows various information such as 846 Identification information (section 6.2/[RFC6350]), Communication 847 Information (section 6.4/[RFC6350]) and Organizational information 848 (section 6.6/[RFC6350]) to be communicated. A Consumer may then 849 automatically (i.e. via a policy) or manually select Captures 850 based on information about who is in a Capture. It also allows a 851 Consumer to render information regarding the people participating 852 in the conference or to use it for further processing. 854 The Provider may supply a minimal set of information or a larger 855 set of information. However it MUST be compliant to [RFC6350] and 856 supply a "VERSION" and "FN" property. A Provider may supply 857 multiple xCards per Capture of any KIND (section 6.1.4/[RFC6350]). 859 In order to keep CLUE messages compact the Provider SHOULD use a 860 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 861 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 862 message. 864 7.1.1.12. Person Type 866 The person type attribute indicates the type of people contained in 867 the capture in the conference with respect to the meeting agenda 868 (regardless of whether or not the capture has a Presentation 869 attribute). As a capture may include multiple people the attribute 870 may contain multiple values. However values shall not be repeated 871 within the attribute. 873 An Advertiser associates the person type with an individual capture 874 when it knows that a particular type is in the capture. If an 875 Advertiser cannot link a particular type with some certainty to a 876 capture then it is not included. A Consumer on reception of a 877 capture with a person type attribute knows with some certainly that 878 the capture contains that person type. The capture may contain 879 other person types but the Advertiser has not been able to 880 determine that this is the case. 882 The types of Captured people include: 884 . Chairman - the person responsible for running the conference 885 according to the agenda. 886 . Vice-Chairman - the person responsible for assisting the 887 chairman in running the meeting. 888 . Minute Taker - the person responsible for recording the 889 minutes of the conference 890 . Member - the person has no particular responsibilities with 891 respect to running the meeting. 892 . Presenter - the person is scheduled on the agenda to make a 893 presentation in the meeting. Note: This is not related to any 894 "active speaker" functionality. 896 . Translator - the person is providing some form of translation 897 or commentary in the meeting. 898 . Timekeeper - the person is responsible for maintaining the 899 meeting schedule. 901 Furthermore the person type attribute may contain one or more 902 strings allowing the Provider to indicate custom meeting specific 903 roles. 905 7.1.1.13. Priority 907 The priority attribute indicates a relative priority between 908 different Media Captures. The Provider sets this priority, and the 909 Consumer MAY use the priority to help decide which captures it 910 wishes to receive. 912 The "priority" attribute is an integer which indicates a relative 913 priority between Captures. For example it is possible to assign a 914 priority between two presentation Captures that would allow a 915 remote endpoint to determine which presentation is more important. 916 Priority is assigned at the individual capture level. It represents 917 the Provider's view of the relative priority between Captures with 918 a priority. The same priority number MAY be used across multiple 919 Captures. It indicates they are equally important. If no priority 920 is assigned no assumptions regarding relative important of the 921 Capture can be assumed. 923 7.1.1.14. Embedded Text 925 The Embedded Text attribute indicates that a Capture provides 926 embedded textual information. For example the video Capture MAY 927 contain speech to text information composed with the video image. 928 This attribute is only applicable to video Captures and 929 presentation streams with visual information. 931 7.1.1.15. Related To 933 The Related To attribute indicates the Capture contains additional 934 complementary information related to another Capture. The value 935 indicates the identity of the other Capture to which this Capture 936 is providing additional information. 938 For example, a conference can utilize translators or facilitators 939 that provide an additional audio stream (i.e. a translation or 940 description or commentary of the conference). Where multiple 941 captures are available, it may be advantageous for a Consumer to 942 select a complementary Capture instead of or in addition to a 943 Capture it relates to. 945 7.2. Multiple Content Capture 947 The MCC indicates that one or more Single Media Captures are 948 contained in one Media Capture. Only one Capture type (i.e. audio, 949 video, etc.) is allowed in each MCC instance. The MCC may contain 950 a reference to the Single Media Captures (which may have their own 951 attributes) as well as attributes associated with the MCC itself. 952 A MCC may also contain other MCCs. The MCC MAY reference Captures 953 from within the Capture Scene that defines it or from other Capture 954 Scenes. No ordering is implied by the order that Captures appear 955 within a MCC. A MCC MAY contain no references to other Captures to 956 indicate that the MCC contains content from multiple sources but no 957 information regarding those sources is given. 959 One or more MCCs may also be specified in a CSE. This allows an 960 Advertiser to indicate that several MCC captures are used to 961 represent a capture scene. Table 14 provides an example of this 962 case. 964 As outlined in section 7.1. each instance of the MCC has its own 965 Capture identity i.e. MCC1. It allows all the individual captures 966 contained in the MCC to be referenced by a single MCC identity. 968 The example below shows the use of a Multiple Content Capture: 970 +-----------------------+---------------------------------+ 971 | Capture Scene #1 | | 972 +-----------------------|---------------------------------+ 973 | VC1 | {attributes} | 974 | VC2 | {attributes} | 975 | VCn | {attributes} | 976 | MCC1(VC1,VC2,...VCn) | {attributes} | 977 | CSE(MCC1) | | 978 +---------------------------------------------------------+ 980 Table 1: Multiple Content Capture concept 982 This indicates that MCC1 is a single capture that contains the 983 Captures VC1, VC2 and VC3 according to any MCC1 attributes. 985 7.2.1. MCC Attributes 987 Attributes may be associated with the MCC instance and the Single 988 Media Captures that the MCC references. A Provider should avoid 989 providing conflicting attribute values between the MCC and Single 990 Media Captures. Where there is conflict the attributes of the MCC 991 override any that may be present in the individual captures. 993 A Provider MAY include as much or as little of the original source 994 Capture information as it requires. 996 There are MCC specific attributes that MUST only be used with 997 Multiple Content Captures. These are described in the sections 998 below. The attributes described in section 7.1.1. MAY also be used 999 with MCCs. 1001 The spatial related attributes of an MCC indicate its area of 1002 capture and point of capture within the scene, just like any other 1003 media capture. The spatial information does not imply anything 1004 about how other captures are composed within an MCC. 1006 For example: A virtual scene could be constructed for the MCC 1007 capture with two Video Captures with a "MaxCaptures" attribute set 1008 to 2 and an "Area of Capture" attribute provided with an overall 1009 area. Each of the individual Captures could then also include an 1010 "Area of Capture" attribute with a sub-set of the overall area. 1011 The Consumer would then know how each capture is related to others 1012 within the scene, but not the relative position of the individual 1013 captures within the composed capture. 1015 +-----------------------+---------------------------------+ 1016 | Capture Scene #1 | | 1017 +-----------------------|---------------------------------+ 1018 | VC1 | AreaofCapture=(0,0,0)(9,0,0) | 1019 | | (0,0,9)(9,0,9) | 1020 | VC2 | AreaofCapture=(10,0,0)(19,0,0) | 1021 | | (10,0,9)(19,0,9) | 1022 | MCC1(VC1,VC2) | MaxCaptures=2 | 1023 | | AreaofCapture=(0,0,0)(19,0,0) | 1024 | | (0,0,9)(19,0,9) | 1025 | CSE(MCC1) | | 1026 +---------------------------------------------------------+ 1028 Table 2: Example of MCC and Single Media Capture attributes 1030 The sections below describe the MCC only attributes. 1032 7.2.1.1. Maximum Number of Captures within a MCC 1034 The Maximum Number of Captures MCC attribute indicates the maximum 1035 number of individual captures that may appear in a Capture Encoding 1036 at a time. The actual number at any given time can be less than 1037 this maximum. It may be used to derive how the Single Media 1038 Captures within the MCC are composed / switched with regards to 1039 space and time. 1041 A Provider can indicate that the number of captures in a MCC 1042 capture encoding is equal "=" to the MaxCaptures value or that 1043 there may be any number of captures up to and including "<=" the 1044 MaxCaptures value. This allows a Provider to distinguish between a 1045 MCC that purely represents a composition of sources versus a MCC 1046 that represents switched or switched and composed sources. 1048 MaxCaptures MAY be set to one so that only content related to one 1049 of the sources are shown in the MCC Capture Encoding at a time or 1050 it may be set to any value up to the total number of Source Media 1051 Captures in the MCC. 1053 The bullets below describe how the setting of MaxCapture versus the 1054 number of captures in the MCC affects how sources appear in a 1055 capture encoding: 1057 . When MaxCaptures is set to <= 1 and the number of captures in 1058 the MCC is greater than 1 (or not specified) in the MCC this 1059 is a switched case. Zero or 1 captures may be switched into 1060 the capture encoding. Note: zero is allowed because of the 1061 "<=". 1062 . When MaxCaptures is set to = 1 and the number of captures in 1063 the MCC is greater than 1 (or not specified) in the MCC this 1064 is a switched case. Only one capture source is contained in a 1065 capture encoding at a time. 1066 . When MaxCaptures is set to <= N (with N > 1) and the number of 1067 captures in the MCC is greater than N (or not specified) this 1068 is a switched and composed case. The capture encoding may 1069 contain purely switched sources (i.e. <=2 allows for 1 source 1070 on its own), or may contain composed and switched sources 1071 (i.e. a composition of 2 sources switched between the 1072 sources). 1073 . When MaxCaptures is set to = N (with N > 1) and the number of 1074 captures in the MCC is greater than N (or not specified) this 1075 is a switched and composed case. The capture encoding contains 1076 composed and switched sources (i.e. a composition of N sources 1077 switched between the sources). It is not possible to have a 1078 single source. 1079 . When MaxCaptures is set to <= to the number of captures in the 1080 MCC this is a switched and composed case. The capture encoding 1081 may contain media switched between any number (up to the 1082 MaxCaptures) of composed sources. 1083 . When MaxCaptures is set to = to the number of captures in the 1084 MCC this is a composed case. All the sources are composed into 1085 a single capture encoding. 1087 If this attribute is not set then as default it is assumed that all 1088 source content can appear concurrently in the Capture Encoding 1089 associated with the MCC. 1091 For example: The use of MaxCaptures equal to 1 on a MCC with three 1092 Video Captures VC1, VC2 and VC3 would indicate that the Advertiser 1093 in the capture encoding would switch between VC1, VC2 or VC3 as 1094 there may be only a maximum of one capture at a time. 1096 7.2.1.2. Policy 1098 The Policy MCC Attribute indicates the criteria that the Provider 1099 uses to determine when and/or where media content appears in the 1100 Capture Encoding related to the MCC. 1102 The attribute is in the form of a token that indicates the policy 1103 and index representing an instance of the policy. 1105 The tokens are: 1107 SoundLevel - This indicates that the content of the MCC is 1108 determined by a sound level detection algorithm. For example: the 1109 loudest (active) speaker is contained in the MCC. 1111 RoundRobin - This indicates that the content of the MCC is 1112 determined by a time based algorithm. For example: the Provider 1113 provides content from a particular source for a period of time and 1114 then provides content from another source and so on. 1116 An index is used to represent an instance in the policy setting. A 1117 index of 0 represents the most current instance of the policy, i.e. 1118 the active speaker, 1 represents the previous instance, i.e. the 1119 previous active speaker and so on. 1121 The following example shows a case where the Provider provides two 1122 media streams, one showing the active speaker and a second stream 1123 showing the previous speaker. 1125 +-----------------------+---------------------------------+ 1126 | Capture Scene #1 | | 1127 +-----------------------|---------------------------------+ 1128 | VC1 | | 1129 | VC2 | | 1130 | MCC1(VC1,VC2) | Policy=SoundLevel:0 | 1131 | | MaxCaptures=1 | 1132 | MCC2(VC1,VC2) | Policy=SoundLevel:1 | 1133 | | MaxCaptures=1 | 1134 | CSE(MCC1,MCC2) | | 1135 +---------------------------------------------------------+ 1137 Table 3: Example Policy MCC attribute usage 1139 7.2.1.3. Synchronisation Identity 1141 The Synchronisation Identity MCC attribute indicates how the 1142 individual captures in multiple MCC captures are synchronised. To 1143 indicate that the Capture Encodings associated with MCCs contain 1144 captures from the source at the same time a Provider should set the 1145 same Synchronisation Identity on each of the concerned MCCs. It is 1146 the Provider that determines what the source for the Captures is, 1147 so a Provider can choose how to group together Single Media 1148 Captures for the purpose of keeping them synchronized according to 1149 the SynchronisationID attribute. For example when the Provider is 1150 in an MCU it may determine that each separate CLUE Endpoint is a 1151 remote source of media. The Synchronisation Identity may be used 1152 across media types, i.e. to synchronize audio and video related 1153 MCCs. 1155 Without this attribute it is assumed that multiple MCCs may provide 1156 content from different sources at any particular point in time. 1158 For example: 1160 +=======================+=================================+ 1161 | Capture Scene #1 | | 1162 +-----------------------|---------------------------------+ 1163 | VC1 | Description=Left | 1164 | VC2 | Description=Centre | 1165 | VC3 | Description=Right | 1166 | AC1 | Description=room | 1167 | CSE(VC1,VC2,VC3) | | 1168 | CSE(AC1) | | 1169 +=======================+=================================+ 1170 | Capture Scene #2 | | 1171 +-----------------------|---------------------------------+ 1172 | VC4 | Description=Left | 1173 | VC5 | Description=Centre | 1174 | VC6 | Description=Right | 1175 | AC2 | Description=room | 1176 | CSE(VC4,VC5,VC6) | | 1177 | CSE(AC2) | | 1178 +=======================+=================================+ 1179 | Capture Scene #3 | | 1180 +-----------------------|---------------------------------+ 1181 | VC7 | | 1182 | AC3 | | 1183 +=======================+=================================+ 1184 | Capture Scene #4 | | 1185 +-----------------------|---------------------------------+ 1186 | VC8 | | 1187 | AC4 | | 1188 +=======================+=================================+ 1189 | Capture Scene #3 | | 1190 +-----------------------|---------------------------------+ 1191 | MCC1(VC1,VC4,VC7) | SynchronisationID=1 | 1192 | | MaxCaptures=1 | 1193 | MCC2(VC2,VC5,VC8) | SynchronisationID=1 | 1194 | | MaxCaptures=1 | 1195 | MCC3(VC3,VC6) | MaxCaptures=1 | 1196 | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1 | 1197 | | MaxCaptures=1 | 1198 | CSE(MCC1,MCC2,MCC3) | | 1199 | CSE(MCC4) | | 1200 +=======================+=================================+ 1202 Table 4: Example Synchronisation Identity MCC attribute usage 1204 The above Advertisement would indicate that MCC1, MCC2, MCC3 and 1205 MCC4 make up a Capture Scene. There would be four capture 1206 encodings (one for each MCC). Because MCC1 and MCC2 have the same 1207 SynchronisationID, each encoding from MCC1 and MCC2 respectively 1208 would together have content from only Capture Scene 1 or only 1209 Capture Scene 2 or the combination of VC7 and VC8 at a particular 1210 point in time. In this case the Provider has decided the sources 1211 to be synchronized are Scene #1, Scene #2, and Scene #3 and #4 1212 together. The encoding from MCC3 would not be synchronised with 1213 MCC1 or MCC2. As MCC4 also has the same Synchronisation Identity 1214 as MCC1 and MCC2 the content of the audio encoding will be 1215 synchronised with the video content. 1217 7.3. Capture Scene 1219 In order for a Provider's individual Captures to be used 1220 effectively by a Consumer, the Provider organizes the Captures into 1221 one or more Capture Scenes, with the structure and contents of 1222 these Capture Scenes being sent from the Provider to the Consumer 1223 in the Advertisement. 1225 A Capture Scene is a structure representing a spatial region 1226 containing one or more Capture Devices, each capturing media 1227 representing a portion of the region. A Capture Scene includes one 1228 or more Capture Scene entries, with each entry including one or 1229 more Media Captures. A Capture Scene represents, for example, the 1230 video image of a group of people seated next to each other, along 1231 with the sound of their voices, which could be represented by some 1232 number of VCs and ACs in the Capture Scene Entries. A middle box 1233 can also describe in Capture Scenes what it constructs from media 1234 Streams it receives. 1236 A Provider MAY advertise one or more Capture Scenes. What 1237 constitutes an entire Capture Scene is up to the Provider. A 1238 simple Provider might typically use one Capture Scene for 1239 participant media (live video from the room cameras) and another 1240 Capture Scene for a computer generated presentation. In more 1241 complex systems, the use of additional Capture Scenes is also 1242 sensible. For example, a classroom may advertise two Capture 1243 Scenes involving live video, one including only the camera 1244 capturing the instructor (and associated audio), the other 1245 including camera(s) capturing students (and associated audio). 1247 A Capture Scene MAY (and typically will) include more than one type 1248 of media. For example, a Capture Scene can include several Capture 1249 Scene Entries for Video Captures, and several Capture Scene Entries 1250 for Audio Captures. A particular Capture MAY be included in more 1251 than one Capture Scene Entry. 1253 A Provider MAY express spatial relationships between Captures that 1254 are included in the same Capture Scene. However, there is not 1255 necessarily the same spatial relationship between Media Captures 1256 that are in different Capture Scenes. In other words, Capture 1257 Scenes can use their own spatial measurement system as outlined 1258 above in section 6. 1260 A Provider arranges Captures in a Capture Scene to help the 1261 Consumer choose which captures it wants to render. The Capture 1262 Scene Entries in a Capture Scene are different alternatives the 1263 Provider is suggesting for representing the Capture Scene. Each 1264 Capture Scene Entry is given an advertisement unique identity. The 1265 order of Capture Scene Entries within a Capture Scene has no 1266 significance. The Media Consumer can choose to receive all Media 1267 Captures from one Capture Scene Entry for each media type (e.g. 1268 audio and video), or it can pick and choose Media Captures 1269 regardless of how the Provider arranges them in Capture Scene 1270 Entries. Different Capture Scene Entries of the same media type 1271 are not necessarily mutually exclusive alternatives. Also note 1272 that the presence of multiple Capture Scene Entries (with 1273 potentially multiple encoding options in each entry) in a given 1274 Capture Scene does not necessarily imply that a Provider is able to 1275 serve all the associated media simultaneously (although the 1276 construction of such an over-rich Capture Scene is probably not 1277 sensible in many cases). What a Provider can send simultaneously 1278 is determined through the Simultaneous Transmission Set mechanism, 1279 described in section 8. 1281 Captures within the same Capture Scene entry MUST be of the same 1282 media type - it is not possible to mix audio and video captures in 1283 the same Capture Scene Entry, for instance. The Provider MUST be 1284 capable of encoding and sending all Captures (that have an encoding 1285 group) in a single Capture Scene Entry simultaneously. The order 1286 of Captures within a Capture Scene Entry has no significance. A 1287 Consumer can decide to receive all the Captures in a single Capture 1288 Scene Entry, but a Consumer could also decide to receive just a 1289 subset of those captures. A Consumer can also decide to receive 1290 Captures from different Capture Scene Entries, all subject to the 1291 constraints set by Simultaneous Transmission Sets, as discussed in 1292 section 8. 1294 When a Provider advertises a Capture Scene with multiple entries, 1295 it is essentially signaling that there are multiple representations 1296 of the same Capture Scene available. In some cases, these multiple 1297 representations would typically be used simultaneously (for 1298 instance a "video entry" and an "audio entry"). In some cases the 1299 entries would conceptually be alternatives (for instance an entry 1300 consisting of three Video Captures covering the whole room versus 1301 an entry consisting of just a single Video Capture covering only 1302 the center of a room). In this latter example, one sensible choice 1303 for a Consumer would be to indicate (through its Configure and 1304 possibly through an additional offer/answer exchange) the Captures 1305 of that Capture Scene Entry that most closely matched the 1306 Consumer's number of display devices or screen layout. 1308 The following is an example of 4 potential Capture Scene Entries 1309 for an endpoint-style Provider: 1311 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 1313 2. (VC3) - Video Capture associated with loudest room segment 1315 3. (VC4) - Video Capture zoomed out view of all people in the room 1317 4. (AC0) - main audio 1319 The first entry in this Capture Scene example is a list of Video 1320 Captures which have a spatial relationship to each other. 1321 Determination of the order of these captures (VC0, VC1 and VC2) for 1322 rendering purposes is accomplished through use of their Area of 1323 Capture attributes. The second entry (VC3) and the third entry 1324 (VC4) are alternative representations of the same room's video, 1325 which might be better suited to some Consumers' rendering 1326 capabilities. The inclusion of the Audio Capture in the same 1327 Capture Scene indicates that AC0 is associated with all of those 1328 Video Captures, meaning it comes from the same spatial region. 1329 Therefore, if audio were to be rendered at all, this audio would be 1330 the correct choice irrespective of which Video Captures were 1331 chosen. 1333 7.3.1. Capture Scene attributes 1335 Capture Scene Attributes can be applied to Capture Scenes as well 1336 as to individual media captures. Attributes specified at this 1337 level apply to all constituent Captures. Capture Scene attributes 1338 include 1340 . Human-readable description of the Capture Scene, which could 1341 be in multiple languages; 1342 . xCard scene information 1343 . Scale information (millimeters, unknown, no scale), as 1344 described in Section 6. 1346 7.3.1.1. Scene Information 1348 The Scene information attribute provides information regarding the 1349 Capture Scene rather than individual participants. The Provider 1350 may gather the information automatically or manually from a 1351 variety of sources. The scene information attribute allows a 1352 Provider to indicate information such as: organizational or 1353 geographic information allowing a Consumer to determine which 1354 Capture Scenes are of interest in order to then perform Capture 1355 selection. It also allows a Consumer to render information 1356 regarding the Scene or to use it for further processing. 1358 As per 7.1.1.11. the xCard format is used to convey this 1359 information and the Provider may supply a minimal set of 1360 information or a larger set of information. 1362 In order to keep CLUE messages compact the Provider SHOULD use a 1363 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 1364 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 1365 message. 1367 7.3.2. Capture Scene Entry attributes 1369 A Capture Scene can include one or more Capture Scene Entries in 1370 addition to the Capture Scene wide attributes described above. 1371 Capture Scene Entry attributes apply to the Capture Scene Entry as 1372 a whole, i.e. to all Captures that are part of the Capture Scene 1373 Entry. 1375 Capture Scene Entry attributes include: 1377 . Human-readable description (which could be in multiple 1378 languages) of the Capture Scene Entry 1380 7.3.3. Global Capture Scene Entry List 1382 An Advertisement can include an optional global Capture Scene 1383 Entry list. Each item in this list is a set of one or more 1384 Capture Scene Entries of the same media type. Each set of CSEs in 1385 the list is a suggestion from the Provider to the Consumer for 1386 which CSEs provide a complete representation of the simultaneous 1387 captures provided by the Provider, across multiple scenes. The 1388 Provider can include multiple sets, to allow a Consumer to choose 1389 sets of captures appropriate to its capabilities or application. 1390 The choice of how to make these suggestions in the Global CSE list 1391 for what represents all the scenes for which the Provider can send 1392 media is up to the Provider. This is very similar to how each CSE 1393 represents a particular scene. 1395 As an example, suppose an advertisement has three scenes, and each 1396 scene has three CSEs, ranging from one to three video captures in 1397 each CSE. The Provider is advertising a total of nine video 1398 Captures across three scenes. The Provider can use the Global CSE 1399 list to suggest alternatives for Consumers that can't receive all 1400 nine video Captures as separate media streams. For accommodating 1401 a Consumer that wants to receive three video Captures, a Provider 1402 might suggest a single CSE with three Captures and nothing from 1403 the other two scenes. Or a Provider might suggest three different 1404 CSEs, one from each scene, with a single video Capture in each. 1406 Some additional rules: 1408 . The ordering of items (sets of CSEs) in the global CSE list 1409 is not important. 1410 . The ordering of CSEs within each set is not important. 1411 . A particular CSE may be used in multiple sets. 1412 . The Provider must be capable of encoding and sending all 1413 Captures within the CSEs of a given set simultaneously. 1415 8. Simultaneous Transmission Set Constraints 1417 In many practical cases, a Provider has constraints or limitations 1418 on its ability to send Captures simultaneously. One type of 1419 limitation is caused by the physical limitations of capture 1420 mechanisms; these constraints are represented by a simultaneous 1421 transmission set. The second type of limitation reflects the 1422 encoding resources available, such as bandwidth or video encoding 1423 throughput (macroblocks/second). This type of constraint is 1424 captured by encoding groups, discussed below. 1426 Some Endpoints or MCUs can send multiple Captures simultaneously; 1427 however sometimes there are constraints that limit which Captures 1428 can be sent simultaneously with other Captures. A device may not 1429 be able to be used in different ways at the same time. Provider 1430 Advertisements are made so that the Consumer can choose one of 1431 several possible mutually exclusive usages of the device. This 1432 type of constraint is expressed in a Simultaneous Transmission Set, 1433 which lists all the Captures of a particular media type (e.g. 1434 audio, video, text) that can be sent at the same time. There are 1435 different Simultaneous Transmission Sets for each media type in the 1436 Advertisement. This is easier to show in an example. 1438 Consider the example of a room system where there are three cameras 1439 each of which can send a separate capture covering two persons 1440 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1441 optical zoom lens) and show all six persons, VC3. But the middle 1442 camera cannot be used in both modes at the same time - it has to 1443 either show the space where two participants sit or the whole six 1444 seats, but not both at the same time. As a result, VC1 and VC3 1445 cannot be sent simultaneously. 1447 Simultaneous Transmission Sets are expressed as sets of the Media 1448 Captures that the Provider could transmit at the same time (though, 1449 in some cases, it is not intuitive to do so). If a Multiple 1450 Content Capture is included in a Simultaneous Transmission Set it 1451 indicates that the Capture Encoding associated with it could be 1452 transmitted as the same time as the other Captures within the 1453 Simultaneous Transmission Set. It does not imply that the Single 1454 Media Captures contained in the Multiple Content Capture could all 1455 be transmitted at the same time. 1457 In this example the two simultaneous sets are shown in Table 5. If 1458 a Provider advertises one or more mutually exclusive Simultaneous 1459 Transmission Sets, then for each media type the Consumer MUST 1460 ensure that it chooses Media Captures that lie wholly within one of 1461 those Simultaneous Transmission Sets. 1463 +-------------------+ 1464 | Simultaneous Sets | 1465 +-------------------+ 1466 | {VC0, VC1, VC2} | 1467 | {VC0, VC3, VC2} | 1468 +-------------------+ 1470 Table 5: Two Simultaneous Transmission Sets 1472 A Provider OPTIONALLY can include the simultaneous sets in its 1473 Advertisement. These simultaneous set constraints apply across all 1474 the Capture Scenes in the Advertisement. It is a syntax 1475 conformance requirement that the simultaneous transmission sets 1476 MUST allow all the media captures in any particular Capture Scene 1477 Entry to be used simultaneously. Similarly, the simultaneous 1478 transmission sets MUST reflect the simultaneity expressed by any 1479 global CSE sets. 1481 For shorthand convenience, a Provider MAY describe a Simultaneous 1482 Transmission Set in terms of Capture Scene Entries and Capture 1483 Scenes. If a Capture Scene Entry is included in a Simultaneous 1484 Transmission Set, then all Media Captures in the Capture Scene 1485 Entry are included in the Simultaneous Transmission Set. If a 1486 Capture Scene is included in a Simultaneous Transmission Set, then 1487 all its Capture Scene Entries (of the corresponding media type) are 1488 included in the Simultaneous Transmission Set. The end result 1489 reduces to a set of Media Captures in either case. 1491 If an Advertisement does not include Simultaneous Transmission 1492 Sets, then the Provider MUST be able to provide all Capture Scenes 1493 simultaneously. If multiple capture Scene Entries are in a Capture 1494 Scene then the Consumer chooses at most one Capture Scene Entry per 1495 Capture Scene for each media type. Likewise, if there are no 1496 Simultaneous Transmission Sets and there is a global CSE list, then 1497 the Consumer chooses at most one set of CSEs of each media type, 1498 from the global CSE list. 1500 If an Advertisement includes multiple Capture Scene Entries in a 1501 Capture Scene then the Consumer MAY choose one Capture Scene Entry 1502 for each media type, or MAY choose individual Captures based on the 1503 Simultaneous Transmission Sets. 1505 9. Encodings 1507 Individual encodings and encoding groups are CLUE's mechanisms 1508 allowing a Provider to signal its limitations for sending Captures, 1509 or combinations of Captures, to a Consumer. Consumers can map the 1510 Captures they want to receive onto the Encodings, with encoding 1511 parameters they want. As for the relationship between the CLUE- 1512 specified mechanisms based on Encodings and the SIP Offer-Answer 1513 exchange, please refer to section 5. 1515 9.1. Individual Encodings 1517 An Individual Encoding represents a way to encode a Media Capture 1518 to become a Capture Encoding, to be sent as an encoded media stream 1519 from the Provider to the Consumer. An Individual Encoding has a 1520 set of parameters characterizing how the media is encoded. 1522 Different media types have different parameters, and different 1523 encoding algorithms may have different parameters. An Individual 1524 Encoding can be assigned to at most one Capture Encoding at any 1525 given time. 1527 Individual Encoding parameters are represented in SDP [RFC4566], 1528 not in CLUE messages. For example, for a video encoding using 1529 H.26x compression technologies, this can include parameters such 1530 as: 1532 . Maximum bandwidth; 1533 . Maximum picture size in pixels; 1534 . Maxmimum number of pixels to be processed per second; 1536 The bandwidth parameter is the only one that specifically relates 1537 to a CLUE Advertisement, as it can be further constrained by the 1538 maximum group bandwidth in an Encoding Group. 1540 9.2. Encoding Group 1542 An Encoding Group includes a set of one or more Individual 1543 Encodings, and parameters that apply to the group as a whole. By 1544 grouping multiple individual Encodings together, an Encoding Group 1545 describes additional constraints on bandwidth for the group. 1547 The Encoding Group data structure contains: 1549 . Maximum bitrate for all encodings in the group combined; 1550 . A list of identifiers for audio and video encodings, 1551 respectively, belonging to the group. 1553 When the Individual Encodings in a group are instantiated into 1554 Capture Encodings, each Capture Encoding has a bitrate that MUST be 1555 less than or equal to the max bitrate for the particular individual 1556 encoding. The "maximum bitrate for all encodings in the group" 1557 parameter gives the additional restriction that the sum of all the 1558 individual capture encoding bitrates MUST be less than or equal to 1559 the this group value. 1561 The following diagram illustrates one example of the structure of a 1562 media Provider's Encoding Groups and their contents. 1564 ,-------------------------------------------------. 1565 | Media Provider | 1566 | | 1567 | ,--------------------------------------. | 1568 | | ,--------------------------------------. | 1569 | | | ,--------------------------------------. | 1570 | | | | Encoding Group | | 1571 | | | | ,-----------. | | 1572 | | | | | | ,---------. | | 1573 | | | | | | | | ,---------.| | 1574 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1575 | `.| | | | | | `---------'| | 1576 | `.| `-----------' `---------' | | 1577 | `--------------------------------------' | 1578 `-------------------------------------------------' 1580 Figure 3: Encoding Group Structure 1582 A Provider advertises one or more Encoding Groups. Each Encoding 1583 Group includes one or more Individual Encodings. Each Individual 1584 Encoding can represent a different way of encoding media. For 1585 example one Individual Encoding may be 1080p60 video, another could 1586 be 720p30, with a third being CIF, all in, for example, H.264 1587 format. 1588 While a typical three codec/display system might have one Encoding 1589 Group per "codec box" (physical codec, connected to one camera and 1590 one screen), there are many possibilities for the number of 1591 Encoding Groups a Provider may be able to offer and for the 1592 encoding values in each Encoding Group. 1594 There is no requirement for all Encodings within an Encoding Group 1595 to be instantiated at the same time. 1597 9.3. Associating Captures with Encoding Groups 1599 Each Media Capture MAY be associated with at least one Encoding 1600 Group, which is used to instantiate that Capture into one or more 1601 Capture Encodings. Typically MCCs are assigned an Encoding Group 1602 and thus become a Capture Encoding. The Captures (including other 1603 MCCs) referenced by the MCC do not need to be assigned to an 1604 Encoding Group. This means that all the Media Captures referenced 1605 by the MCC will appear in the Capture Encoding according to any MCC 1606 attributes. This allows an Advertiser to specify Capture attributes 1607 associated with the Media Captures without the need to provide an 1608 individual Capture Encoding for each of the inputs. 1610 If an Encoding Group is assigned to a Media Capture referenced by 1611 the MCC it indicates that this Capture may also have an individual 1612 Capture Encoding. 1614 For example: 1616 +--------------------+------------------------------------+ 1617 | Capture Scene #1 | | 1618 +--------------------+------------------------------------+ 1619 | VC1 | EncodeGroupID=1 | 1620 | VC2 | | 1621 | MCC1(VC1,VC2) | EncodeGroupID=2 | 1622 | CSE(VC1) | | 1623 | CSE(MCC1) | | 1624 +--------------------+------------------------------------+ 1626 Table 6: Example usage of Encoding with MCC and source Captures 1628 This would indicate that VC1 may be sent as its own Capture 1629 Encoding from EncodeGroupID=1 or that it may be sent as part of a 1630 Capture Encoding from EncodeGroupID=2 along with VC2. 1632 More than one Capture MAY use the same Encoding Group. 1634 The maximum number of streams that can result from a particular 1635 Encoding Group constraint is equal to the number of individual 1636 Encodings in the group. The actual number of Capture Encodings 1637 used at any time MAY be less than this maximum. Any of the 1638 Captures that use a particular Encoding Group can be encoded 1639 according to any of the Individual Encodings in the group. If 1640 there are multiple Individual Encodings in the group, then the 1641 Consumer can configure the Provider, via a Configure message, to 1642 encode a single Media Capture into multiple different Capture 1643 Encodings at the same time, subject to the Max Capture Encodings 1644 constraint, with each capture encoding following the constraints of 1645 a different Individual Encoding. 1647 It is a protocol conformance requirement that the Encoding Groups 1648 MUST allow all the Captures in a particular Capture Scene Entry to 1649 be used simultaneously. 1651 10. Consumer's Choice of Streams to Receive from the Provider 1653 After receiving the Provider's Advertisement message (that includes 1654 media captures and associated constraints), the Consumer composes 1655 its reply to the Provider in the form of a Configure message. The 1656 Consumer is free to use the information in the Advertisement as it 1657 chooses, but there are a few obviously sensible design choices, 1658 which are outlined below. 1660 If multiple Providers connect to the same Consumer (i.e. in a n 1661 MCU-less multiparty call), it is the responsibility of the Consumer 1662 to compose Configures for each Provider that both fulfill each 1663 Provider's constraints as expressed in the Advertisement, as well 1664 as its own capabilities. 1666 In an MCU-based multiparty call, the MCU can logically terminate 1667 the Advertisement/Configure negotiation in that it can hide the 1668 characteristics of the receiving endpoint and rely on its own 1669 capabilities (transcoding/transrating/...) to create Media Streams 1670 that can be decoded at the Endpoint Consumers. The timing of an 1671 MCU's sending of Advertisements (for its outgoing ports) and 1672 Configures (for its incoming ports, in response to Advertisements 1673 received there) is up to the MCU and implementation dependent. 1675 As a general outline, a Consumer can choose, based on the 1676 Advertisement it has received, which Captures it wishes to receive, 1677 and which Individual Encodings it wants the Provider to use to 1678 encode the Captures. 1680 On receipt of an Advertisement with an MCC the Consumer treats the 1681 MCC as per other non-MCC Captures with the following differences: 1683 - The Consumer would understand that the MCC is a Capture that 1684 includes the referenced individual Captures and that these 1685 individual Captures are delivered as part of the MCC's Capture 1686 Encoding. 1688 - The Consumer may utilise any of the attributes associated with 1689 the referenced individual Captures and any Capture Scene attributes 1690 from where the individual Captures were defined to choose Captures 1691 and for rendering decisions. 1693 - The Consumer may or may not choose to receive all the indicated 1694 captures. Therefore it can choose to receive a sub-set ofCaptures 1695 indicated by the MCC. 1697 For example if the Consumer receives: 1699 MCC1(VC1,VC2,VC3){attributes} 1701 A Consumer could choose all the Captures within a MCCs however if 1702 the Consumer determines that it doesn't want VC3 it can return 1703 MCC1(VC1,VC2). If it wants all the individual Captures then it 1704 returns only the MCC identity (i.e. MCC1). If the MCC in the 1705 advertisement does not reference any individual captures, then the 1706 Consumer cannot choose what is included in the MCC, it is up to the 1707 Provider to decide. 1709 A Configure Message includes a list of Capture Encodings. These 1710 are the Capture Encodings the Consumer wishes to receive from the 1711 Provider. Each Capture Encoding refers to one Media Capture and 1712 one Individual Encoding. A Configure Message does not include 1713 references to Capture Scenes or Capture Scene Entries. 1715 For each Capture the Consumer wants to receive, it configures one 1716 or more of the Encodings in that Capture's Encoding Group. The 1717 Consumer does this by telling the Provider, in its Configure 1718 Message, which Encoding to use for each chosen Capture. Upon 1719 receipt of this Configure from the Consumer, common knowledge is 1720 established between Provider and Consumer regarding sensible 1721 choices for the media streams. The setup of the actual media 1722 channels, at least in the simplest case, is left to a following 1723 offer-answer exchange. Optimized implementations MAY speed up the 1724 reaction to the offer-answer exchange by reserving the resources at 1725 the time of finalization of the CLUE handshake. 1727 CLUE advertisements and configure messages don't necessarily 1728 require a new SDP offer-answer for every CLUE message 1729 exchange. But the resulting encodings sent via RTP must conform to 1730 the most recent SDP offer-answer result. 1732 In order to meaningfully create and send an initial Configure, the 1733 Consumer needs to have received at least one Advertisement from the 1734 Provider. 1736 In addition, the Consumer can send a Configure at any time during 1737 the call. The Configure MUST be valid according to the most 1738 recently received Advertisement. The Consumer can send a Configure 1739 either in response to a new Advertisement from the Provider or on 1740 its own, for example because of a local change in conditions 1741 (people leaving the room, connectivity changes, multipoint related 1742 considerations). 1744 When choosing which Media Streams to receive from the Provider, and 1745 the encoding characteristics of those Media Streams, the Consumer 1746 advantageously takes several things into account: its local 1747 preference, simultaneity restrictions, and encoding limits. 1749 10.1. Local preference 1751 A variety of local factors influence the Consumer's choice of 1752 Media Streams to be received from the Provider: 1754 o if the Consumer is an Endpoint, it is likely that it would 1755 choose, where possible, to receive video and audio Captures that 1756 match the number of display devices and audio system it has 1758 o if the Consumer is a middle box such as an MCU, it MAY choose to 1759 receive loudest speaker streams (in order to perform its own 1760 media composition) and avoid pre-composed video Captures 1762 o user choice (for instance, selection of a new layout) MAY result 1763 in a different set of Captures, or different encoding 1764 characteristics, being required by the Consumer 1766 10.2. Physical simultaneity restrictions 1768 Often there are physical simultaneity constraints of the Provider 1769 that affect the Provider's ability to simultaneously send all of 1770 the captures the Consumer would wish to receive. For instance, a 1771 middle box such as an MCU, when connected to a multi-camera room 1772 system, might prefer to receive both individual video streams of 1773 the people present in the room and an overall view of the room 1774 from a single camera. Some Endpoint systems might be able to 1775 provide both of these sets of streams simultaneously, whereas 1776 others might not (if the overall room view were produced by 1777 changing the optical zoom level on the center camera, for 1778 instance). 1780 10.3. Encoding and encoding group limits 1782 Each of the Provider's encoding groups has limits on bandwidth and 1783 computational complexity, and the constituent potential encodings 1784 have limits on the bandwidth, computational complexity, video 1785 frame rate, and resolution that can be provided. When choosing 1786 the Captures to be received from a Provider, a Consumer device 1787 MUST ensure that the encoding characteristics requested for each 1788 individual Capture fits within the capability of the encoding it 1789 is being configured to use, as well as ensuring that the combined 1790 encoding characteristics for Captures fit within the capabilities 1791 of their associated encoding groups. In some cases, this could 1792 cause an otherwise "preferred" choice of capture encodings to be 1793 passed over in favor of different Capture Encodings--for instance, 1794 if a set of three Captures could only be provided at a low 1795 resolution then a three screen device could switch to favoring a 1796 single, higher quality, Capture Encoding. 1798 11. Extensibility 1800 One important characteristics of the Framework is its 1801 extensibility. The standard for interoperability and handling 1802 multiple streams must be future-proof. The framework itself is 1803 inherently extensible through expanding the data model types. For 1804 example: 1806 o Adding more types of media, such as telemetry, can done by 1807 defining additional types of Captures in addition to audio and 1808 video. 1810 o Adding new functionalities, such as 3-D, say, may require 1811 additional attributes describing the Captures. 1813 The infrastructure is designed to be extended rather than 1814 requiring new infrastructure elements. Extension comes through 1815 adding to defined types. 1817 12. Examples - Using the Framework (Informative) 1819 This section gives some examples, first from the point of view of 1820 the Provider, then the Consumer, then some multipoint scenarios 1822 12.1. Provider Behavior 1824 This section shows some examples in more detail of how a Provider 1825 can use the framework to represent a typical case for telepresence 1826 rooms. First an endpoint is illustrated, then an MCU case is 1827 shown. 1829 12.1.1. Three screen Endpoint Provider 1831 Consider an Endpoint with the following description: 1833 3 cameras, 3 displays, a 6 person table 1834 o Each camera can provide one Capture for each 1/3 section of the 1835 table 1837 o A single Capture representing the active speaker can be provided 1838 (voice activity based camera selection to a given encoder input 1839 port implemented locally in the Endpoint) 1841 o A single Capture representing the active speaker with the other 1842 2 Captures shown picture in picture within the stream can be 1843 provided (again, implemented inside the endpoint) 1845 o A Capture showing a zoomed out view of all 6 seats in the room 1846 can be provided 1848 The audio and video Captures for this Endpoint can be described as 1849 follows. 1851 Video Captures: 1853 o VC0- (the camera-left camera stream), encoding group=EG0, 1854 view=table 1856 o VC1- (the center camera stream), encoding group=EG1, view=table 1858 o VC2- (the camera-right camera stream), encoding group=EG2, 1859 view=table 1861 o MCC3- (the loudest panel stream), encoding group=EG1, 1862 view=table, MaxCaptures=1 1864 o MCC4- (the loudest panel stream with PiPs), encoding group=EG1, 1865 view=room, MaxCaptures=3 1867 o VC5- (the zoomed out view of all people in the room), encoding 1868 group=EG1, view=room 1870 o VC6- (presentation stream), encoding group=EG1, presentation 1872 The following diagram is a top view of the room with 3 cameras, 3 1873 displays, and 6 seats. Each camera is capturing 2 people. The 1874 six seats are not all in a straight line. 1876 ,-. d 1877 ( )`--.__ +---+ 1878 `-' / `--.__ | | 1879 ,-. | `-.._ |_-+Camera 2 (VC2) 1880 ( ).' ___..-+-''`+-+ 1881 `-' |_...---'' | | 1882 ,-.c+-..__ +---+ 1883 ( )| ``--..__ | | 1884 `-' | ``+-..|_-+Camera 1 (VC1) 1885 ,-. | __..--'|+-+ 1886 ( )| __..--' | | 1887 `-'b|..--' +---+ 1888 ,-. |``---..___ | | 1889 ( )\ ```--..._|_-+Camera 0 (VC0) 1890 `-' \ _..-''`-+ 1891 ,-. \ __.--'' | | 1892 ( ) |..-'' +---+ 1893 `-' a 1894 Figure 4: Room Layout 1896 The two points labeled b and c are intended to be at the midpoint 1897 between the seating positions, and where the fields of view of the 1898 cameras intersect. 1900 The plane of interest for VC0 is a vertical plane that intersects 1901 points 'a' and 'b'. 1903 The plane of interest for VC1 intersects points 'b' and 'c'. The 1904 plane of interest for VC2 intersects points 'c' and 'd'. 1906 This example uses an area scale of millimeters. 1908 Areas of capture: 1910 bottom left bottom right top left top right 1911 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1912 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1913 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1914 MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1915 MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1916 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1917 VC6 none 1919 Points of capture: 1920 VC0 (-1678,0,800) 1921 VC1 (0,0,800) 1922 VC2 (1678,0,800) 1923 MCC3 none 1924 MCC4 none 1925 VC5 (0,0,800) 1926 VC6 none 1928 In this example, the right edge of the VC0 area lines up with the 1929 left edge of the VC1 area. It doesn't have to be this way. There 1930 could be a gap or an overlap. One additional thing to note for 1931 this example is the distance from a to b is equal to the distance 1932 from b to c and the distance from c to d. All these distances are 1933 1346 mm. This is the planar width of each area of capture for VC0, 1934 VC1, and VC2. 1936 Note the text in parentheses (e.g. "the camera-left camera 1937 stream") is not explicitly part of the model, it is just 1938 explanatory text for this example, and is not included in the 1939 model with the media captures and attributes. Also, MCC4 doesn't 1940 say anything about how a capture is composed, so the media 1941 consumer can't tell based on this capture that MCC4 is composed of 1942 a "loudest panel with PiPs". 1944 Audio Captures: 1946 o AC0 (camera-left), encoding group=EG3 1948 o AC1 (camera-right), encoding group=EG3 1950 o AC2 (center) encoding group=EG3 1952 o AC3 being a simple pre-mixed audio stream from the room (mono), 1953 encoding group=EG3 1955 o AC4 audio stream associated with the presentation video (mono) 1956 encoding group=EG3, presentation 1958 Point of capture: Point on Line of Capture: 1960 AC0 (-1342,2000,800) (-1342,2925,379) 1961 AC1 ( 1342,2000,800) ( 1342,2925,379) 1962 AC2 ( 0,2000,800) ( 0,3000,379) 1963 AC3 ( 0,2000,800) ( 0,3000,379) 1964 AC4 none 1966 The physical simultaneity information is: 1968 Simultaneous transmission set #1 {VC0, VC1, VC2, MCC3, MCC4, 1969 VC6} 1971 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1973 This constraint indicates it is not possible to use all the VCs at 1974 the same time. VC5 cannot be used at the same time as VC1 or MCC3 1975 or MCC4. Also, using every member in the set simultaneously may 1976 not make sense - for example MCC3(loudest) and MCC4 (loudest with 1977 PIP). (In addition, there are encoding constraints that make 1978 choosing all of the VCs in a set impossible. VC1, MCC3, MCC4, 1979 VC5, VC6 all use EG1 and EG1 has only 3 ENCs. This constraint 1980 shows up in the encoding groups, not in the simultaneous 1981 transmission sets.) 1982 In this example there are no restrictions on which audio captures 1983 can be sent simultaneously. 1985 Encoding Groups: 1987 This example has three encoding groups associated with the video 1988 captures. Each group can have 3 encodings, but with each 1989 potential encoding having a progressively lower specification. In 1990 this example, 1080p60 transmission is possible (as ENC0 has a 1991 maxPps value compatible with that). Significantly, as up to 3 1992 encodings are available per group, it is possible to transmit some 1993 video captures simultaneously that are not in the same entry in 1994 the capture scene. For example VC1 and MCC3 at the same time. 1996 It is also possible to transmit multiple capture encodings of a 1997 single video capture. For example VC0 can be encoded using ENC0 1998 and ENC1 at the same time, as long as the encoding parameters 1999 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 2000 4000000 bps and one at 2000000 bps. 2002 encodeGroupID=EG0, maxGroupBandwidth=6000000 2003 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2004 maxPps=124416000, maxBandwidth=4000000 2005 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2006 maxPps=27648000, maxBandwidth=4000000 2007 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 2008 maxPps=15552000, maxBandwidth=4000000 2009 encodeGroupID=EG1 maxGroupBandwidth=6000000 2010 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2011 maxPps=124416000, maxBandwidth=4000000 2012 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2013 maxPps=27648000, maxBandwidth=4000000 2014 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 2015 maxPps=15552000, maxBandwidth=4000000 2016 encodeGroupID=EG2 maxGroupBandwidth=6000000 2017 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2018 maxPps=124416000, maxBandwidth=4000000 2019 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2020 maxPps=27648000, maxBandwidth=4000000 2021 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 2022 maxPps=15552000, maxBandwidth=4000000 2024 Figure 5: Example Encoding Groups for Video 2026 For audio, there are five potential encodings available, so all 2027 five audio captures can be encoded at the same time. 2029 encodeGroupID=EG3, maxGroupBandwidth=320000 2030 encodeID=ENC9, maxBandwidth=64000 2031 encodeID=ENC10, maxBandwidth=64000 2032 encodeID=ENC11, maxBandwidth=64000 2033 encodeID=ENC12, maxBandwidth=64000 2034 encodeID=ENC13, maxBandwidth=64000 2036 Figure 6: Example Encoding Group for Audio 2038 Capture Scenes: 2040 The following table represents the capture scenes for this 2041 provider. Recall that a capture scene is composed of alternative 2042 capture scene entries covering the same spatial region. Capture 2043 Scene #1 is for the main people captures, and Capture Scene #2 is 2044 for presentation. 2046 Each row in the table is a separate Capture Scene Entry 2048 +------------------+ 2049 | Capture Scene #1 | 2050 +------------------+ 2051 | VC0, VC1, VC2 | 2052 | MCC3 | 2053 | MCC4 | 2054 | VC5 | 2055 | AC0, AC1, AC2 | 2056 | AC3 | 2057 +------------------+ 2059 +------------------+ 2060 | Capture Scene #2 | 2061 +------------------+ 2062 | VC6 | 2063 | AC4 | 2064 +------------------+ 2066 Table 7: Example Capture Scene Entries 2068 Different capture scenes are unique to each other, non- 2069 overlapping. A consumer can choose an entry from each capture 2070 scene. In this case the three captures VC0, VC1, and VC2 are one 2071 way of representing the video from the endpoint. These three 2072 captures should appear adjacent next to each other. 2073 Alternatively, another way of representing the Capture Scene is 2074 with the capture MCC3, which automatically shows the person who is 2075 talking. Similarly for the MCC4 and MCC5 alternatives. 2077 As in the video case, the different entries of audio in Capture 2078 Scene #1 represent the "same thing", in that one way to receive 2079 the audio is with the 3 audio captures (AC0, AC1, AC2), and 2080 another way is with the mixed AC3. The Media Consumer can choose 2081 an audio capture entry it is capable of receiving. 2083 The spatial ordering is understood by the media capture attributes 2084 Area of Capture and Point of Capture and Point on Line of Capture. 2086 A Media Consumer would likely want to choose a capture scene entry 2087 to receive based in part on how many streams it can simultaneously 2088 receive. A consumer that can receive three people streams would 2089 probably prefer to receive the first entry of Capture Scene #1 2090 (VC0, VC1, VC2) and not receive the other entries. A consumer 2091 that can receive only one people stream would probably choose one 2092 of the other entries. 2094 If the consumer can receive a presentation stream too, it would 2095 also choose to receive the only entry from Capture Scene #2 (VC6). 2097 12.1.2. Encoding Group Example 2099 This is an example of an encoding group to illustrate how it can 2100 express dependencies between encodings. 2102 encodeGroupID=EG0 maxGroupBandwidth=6000000 2103 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2104 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2105 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2106 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2107 encodeID=AUDENC0, maxBandwidth=96000 2108 encodeID=AUDENC1, maxBandwidth=96000 2109 encodeID=AUDENC2, maxBandwidth=96000 2111 Here, the encoding group is EG0. Although the encoding group is 2112 capable of transmitting up to 6Mbit/s, no individual video 2113 encoding can exceed 4Mbit/s. 2115 This encoding group also allows up to 3 audio encodings, AUDENC<0- 2116 2>. It is not required that audio and video encodings reside 2117 within the same encoding group, but if so then the group's overall 2118 maxBandwidth value is a limit on the sum of all audio and video 2119 encodings configured by the consumer. A system that does not wish 2120 or need to combine bandwidth limitations in this way should 2121 instead use separate encoding groups for audio and video in order 2122 for the bandwidth limitations on audio and video to not interact. 2124 Audio and video can be expressed in separate encoding groups, as 2125 in this illustration. 2127 encodeGroupID=EG0 maxGroupBandwidth=6000000 2128 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2129 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2130 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2131 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2132 encodeGroupID=EG1 maxGroupBandwidth=500000 2133 encodeID=AUDENC0, maxBandwidth=96000 2134 encodeID=AUDENC1, maxBandwidth=96000 2135 encodeID=AUDENC2, maxBandwidth=96000 2137 12.1.3. The MCU Case 2139 This section shows how an MCU might express its Capture Scenes, 2140 intending to offer different choices for consumers that can handle 2141 different numbers of streams. A single audio capture stream is 2142 provided for all single and multi-screen configurations that can 2143 be associated (e.g. lip-synced) with any combination of video 2144 captures at the consumer. 2146 +-----------------------+---------------------------------+ 2147 | Capture Scene #1 | | 2148 +-----------------------|---------------------------------+ 2149 | VC0 | VC for a single screen consumer | 2150 | VC1, VC2 | VCs for a two screen consumer | 2151 | VC3, VC4, VC5 | VCs for a three screen consumer | 2152 | VC6, VC7, VC8, VC9 | VCs for a four screen consumer | 2153 | AC0 | AC representing all participants| 2154 | CSE(VC0) | | 2155 | CSE(VC1,VC2) | | 2156 | CSE(VC3,VC4,VC5) | | 2157 | CSE(VC6,VC7,VC8,VC9) | | 2158 | CSE(AC0) | | 2159 +-----------------------+---------------------------------+ 2160 Table 8: MCU main Capture Scenes 2162 If / when a presentation stream becomes active within the 2163 conference the MCU might re-advertise the available media as: 2165 +------------------+--------------------------------------+ 2166 | Capture Scene #2 | note | 2167 +------------------+--------------------------------------+ 2168 | VC10 | video capture for presentation | 2169 | AC1 | presentation audio to accompany VC10 | 2170 | CSE(VC10) | | 2171 | CSE(AC1) | | 2172 +------------------+--------------------------------------+ 2174 Table 9: MCU presentation Capture Scene 2176 12.2. Media Consumer Behavior 2178 This section gives an example of how a Media Consumer might behave 2179 when deciding how to request streams from the three screen 2180 endpoint described in the previous section. 2182 The receive side of a call needs to balance its requirements, 2183 based on number of screens and speakers, its decoding capabilities 2184 and available bandwidth, and the provider's capabilities in order 2185 to optimally configure the provider's streams. Typically it would 2186 want to receive and decode media from each Capture Scene 2187 advertised by the Provider. 2189 A sane, basic, algorithm might be for the consumer to go through 2190 each Capture Scene in turn and find the collection of Video 2191 Captures that best matches the number of screens it has (this 2192 might include consideration of screens dedicated to presentation 2193 video display rather than "people" video) and then decide between 2194 alternative entries in the video Capture Scenes based either on 2195 hard-coded preferences or user choice. Once this choice has been 2196 made, the consumer would then decide how to configure the 2197 provider's encoding groups in order to make best use of the 2198 available network bandwidth and its own decoding capabilities. 2200 12.2.1. One screen Media Consumer 2202 MCC3, MCC4 and VC5 are all different entries by themselves, not 2203 grouped together in a single entry, so the receiving device should 2204 choose between one of those. The choice would come down to 2205 whether to see the greatest number of participants simultaneously 2206 at roughly equal precedence (VC5), a switched view of just the 2207 loudest region (MCC3) or a switched view with PiPs (MCC4). An 2208 endpoint device with a small amount of knowledge of these 2209 differences could offer a dynamic choice of these options, in- 2210 call, to the user. 2212 12.2.2. Two screen Media Consumer configuring the example 2214 Mixing systems with an even number of screens, "2n", and those 2215 with "2n+1" cameras (and vice versa) is always likely to be the 2216 problematic case. In this instance, the behavior is likely to be 2217 determined by whether a "2 screen" system is really a "2 decoder" 2218 system, i.e., whether only one received stream can be displayed 2219 per screen or whether more than 2 streams can be received and 2220 spread across the available screen area. To enumerate 3 possible 2221 behaviors here for the 2 screen system when it learns that the far 2222 end is "ideally" expressed via 3 capture streams: 2224 1. Fall back to receiving just a single stream (MCC3, MCC4 or VC5 2225 as per the 1 screen consumer case above) and either leave one 2226 screen blank or use it for presentation if / when a 2227 presentation becomes active. 2229 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 2230 screens (either with each capture being scaled to 2/3 of a 2231 screen and the center capture being split across 2 screens) or, 2232 as would be necessary if there were large bezels on the 2233 screens, with each stream being scaled to 1/2 the screen width 2234 and height and there being a 4th "blank" panel. This 4th panel 2235 could potentially be used for any presentation that became 2236 active during the call. 2238 3. Receive 3 streams, decode all 3, and use control information 2239 indicating which was the most active to switch between showing 2240 the left and center streams (one per screen) and the center and 2241 right streams. 2243 For an endpoint capable of all 3 methods of working described 2244 above, again it might be appropriate to offer the user the choice 2245 of display mode. 2247 12.2.3. Three screen Media Consumer configuring the example 2249 This is the most straightforward case - the Media Consumer would 2250 look to identify a set of streams to receive that best matched its 2251 available screens and so the VC0 plus VC1 plus VC2 should match 2252 optimally. The spatial ordering would give sufficient information 2253 for the correct video capture to be shown on the correct screen, 2254 and the consumer would either need to divide a single encoding 2255 group's capability by 3 to determine what resolution and frame 2256 rate to configure the provider with or to configure the individual 2257 video captures' encoding groups with what makes most sense (taking 2258 into account the receive side decode capabilities, overall call 2259 bandwidth, the resolution of the screens plus any user preferences 2260 such as motion vs sharpness). 2262 12.3. Multipoint Conference utilizing Multiple Content Captures 2264 The use of MCCs allows the MCU to construct outgoing Advertisements 2265 describing complex and media switching and composition scenarios. 2266 The following sections provide several examples. 2268 Note: In the examples the identities of the CLUE elements (e.g. 2269 Captures, Capture Scene) in the incoming Advertisements overlap. 2270 This is because there is no co-ordination between the endpoints. 2271 The MCU is responsible for making these unique in the outgoing 2272 advertisement. 2274 12.3.1. Single Media Captures and MCC in the same Advertisement 2276 Four endpoints are involved in a Conference where CLUE is used. An 2277 MCU acts as a middlebox between the endpoints with a CLUE channel 2278 between each endpoint and the MCU. The MCU receives the following 2279 Advertisements. 2281 +-----------------------+---------------------------------+ 2282 | Capture Scene #1 | Description=AustralianConfRoom | 2283 +-----------------------|---------------------------------+ 2284 | VC1 | Description=Audience | 2285 | | EncodeGroupID=1 | 2286 | CSE(VC1) | | 2287 +---------------------------------------------------------+ 2289 Table 10: Advertisement received from Endpoint A 2291 +-----------------------+---------------------------------+ 2292 | Capture Scene #1 | Description=ChinaConfRoom | 2293 +-----------------------|---------------------------------+ 2294 | VC1 | Description=Speaker | 2295 | | EncodeGroupID=1 | 2296 | VC2 | Description=Audience | 2297 | | EncodeGroupID=1 | 2298 | CSE(VC1, VC2) | | 2299 +---------------------------------------------------------+ 2301 Table 11: Advertisement received from Endpoint B 2303 +-----------------------+---------------------------------+ 2304 | Capture Scene #1 | Description=USAConfRoom | 2305 +-----------------------|---------------------------------+ 2306 | VC1 | Description=Audience | 2307 | | EncodeGroupID=1 | 2308 | CSE(VC1) | | 2309 +---------------------------------------------------------+ 2311 Table 12: Advertisement received from Endpoint C 2313 Note: Endpoint B above indicates that it sends two streams. 2315 If the MCU wanted to provide a Multiple Content Capture containing 2316 a round robin switched view of the audience from the 3 endpoints 2317 and the speaker it could construct the following advertisement: 2319 Advertisement sent to Endpoint F 2321 +=======================+=================================+ 2322 | Capture Scene #1 | Description=AustralianConfRoom | 2323 +-----------------------|---------------------------------+ 2324 | VC1 | Description=Audience | 2325 | CSE(VC1) | | 2326 +=======================+=================================+ 2327 | Capture Scene #2 | Description=ChinaConfRoom | 2328 +-----------------------|---------------------------------+ 2329 | VC2 | Description=Speaker | 2330 | VC3 | Description=Audience | 2331 | CSE(VC2, VC3) | | 2332 +=======================+=================================+ 2333 | Capture Scene #3 | Description=USAConfRoom | 2334 +-----------------------|---------------------------------+ 2335 | VC4 | Description=Audience | 2336 | CSE(VC4) | | 2337 +=======================+=================================+ 2338 | Capture Scene #4 | | 2339 +-----------------------|---------------------------------+ 2340 | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1 | 2341 | | MaxCaptures=1 | 2342 | | EncodingGroup=1 | 2343 | CSE(MCC1) | | 2344 +=======================+=================================+ 2346 Table 13: Advertisement sent to Endpoint F - One Encoding 2348 Alternatively if the MCU wanted to provide the speaker as one media 2349 stream and the audiences as another it could assign an encoding 2350 group to VC2 in Capture Scene 2 and provide a CSE in Capture Scene 2351 #4 as per the example below. 2353 Advertisement sent to Endpoint F 2355 +=======================+=================================+ 2356 | Capture Scene #1 | Description=AustralianConfRoom | 2357 +-----------------------|---------------------------------+ 2358 | VC1 | Description=Audience | 2359 | CSE(VC1) | | 2360 +=======================+=================================+ 2361 | Capture Scene #2 | Description=ChinaConfRoom | 2362 +-----------------------|---------------------------------+ 2363 | VC2 | Description=Speaker | 2364 | | EncodingGroup=1 | 2365 | VC3 | Description=Audience | 2366 | CSE(VC2, VC3) | | 2367 +=======================+=================================+ 2368 | Capture Scene #3 | Description=USAConfRoom | 2369 +-----------------------|---------------------------------+ 2370 | VC4 | Description=Audience | 2371 | CSE(VC4) | | 2372 +=======================+=================================+ 2373 | Capture Scene #4 | | 2374 +-----------------------|---------------------------------+ 2375 | MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1 | 2376 | | MaxCaptures=1 | 2377 | | EncodingGroup=1 | 2378 | MCC2(VC2) | MaxCaptures=1 | 2379 | | EncodingGroup=1 | 2380 | CSE2(MCC1,MCC2) | | 2381 +=======================+=================================+ 2383 Table 14: Advertisement sent to Endpoint F - Two Encodings 2385 Therefore a Consumer could choose whether or not to have a separate 2386 speaker related stream and could choose which endpoints to see. If 2387 it wanted the second stream but not the Australian conference room 2388 it could indicate the following captures in the Configure message: 2390 +-----------------------+---------------------------------+ 2391 | MCC1(VC3,VC4) | Encoding | 2392 | VC2 | Encoding | 2393 +-----------------------|---------------------------------+ 2394 Table 15: MCU case: Consumer Response 2396 12.3.2. Several MCCs in the same Advertisement 2398 Multiple MCCs can be used where multiple streams are used to carry 2399 media from multiple endpoints. For example: 2401 A conference has three endpoints D, E and F. Each end point has 2402 three video captures covering the left, middle and right regions of 2403 each conference room. The MCU receives the following 2404 advertisements from D and E. 2406 +-----------------------+---------------------------------+ 2407 | Capture Scene #1 | Description=AustralianConfRoom | 2408 +-----------------------|---------------------------------+ 2409 | VC1 | CaptureArea=Left | 2410 | | EncodingGroup=1 | 2411 | VC2 | CaptureArea=Centre | 2412 | | EncodingGroup=1 | 2413 | VC3 | CaptureArea=Right | 2414 | | EncodingGroup=1 | 2415 | CSE(VC1,VC2,VC3) | | 2416 +---------------------------------------------------------+ 2418 Table 16: Advertisement received from Endpoint D 2420 +-----------------------+---------------------------------+ 2421 | Capture Scene #1 | Description=ChinaConfRoom | 2422 +-----------------------|---------------------------------+ 2423 | VC1 | CaptureArea=Left | 2424 | | EncodingGroup=1 | 2425 | VC2 | CaptureArea=Centre | 2426 | | EncodingGroup=1 | 2427 | VC3 | CaptureArea=Right | 2428 | | EncodingGroup=1 | 2429 | CSE(VC1,VC2,VC3) | | 2430 +---------------------------------------------------------+ 2432 Table 17: Advertisement received from Endpoint E 2434 The MCU wants to offer Endpoint F three Capture Encodings. Each 2435 Capture Encoding would contain all the Captures from either 2436 Endpoint D or Endpoint E depending based on the active speaker. 2437 The MCU sends the following Advertisement: 2439 +=======================+=================================+ 2440 | Capture Scene #1 | Description=AustralianConfRoom | 2441 +-----------------------|---------------------------------+ 2442 | VC1 | | 2443 | VC2 | | 2444 | VC3 | | 2445 | CSE(VC1,VC2,VC3) | | 2446 +=======================+=================================+ 2447 | Capture Scene #2 | Description=ChinaConfRoom | 2448 +-----------------------|---------------------------------+ 2449 | VC4 | | 2450 | VC5 | | 2451 | VC6 | | 2452 | CSE(VC4,VC5,VC6) | | 2453 +=======================+=================================+ 2454 | Capture Scene #3 | | 2455 +-----------------------|---------------------------------+ 2456 | MCC1(VC1,VC4) | CaptureArea=Left | 2457 | | MaxCaptures=1 | 2458 | | SynchronisationID=1 | 2459 | | EncodingGroup=1 | 2460 | MCC2(VC2,VC5) | CaptureArea=Centre | 2461 | | MaxCaptures=1 | 2462 | | SynchronisationID=1 | 2463 | | EncodingGroup=1 | 2464 | MCC3(VC3,VC6) | CaptureArea=Right | 2465 | | MaxCaptures=1 | 2466 | | SynchronisationID=1 | 2467 | | EncodingGroup=1 | 2468 | CSE(MCC1,MCC2,MCC3) | | 2469 +=======================+=================================+ 2470 Table 17: Advertisement received from Endpoint E 2472 12.3.3. Heterogeneous conference with switching and composition 2474 Consider a conference between endpoints with the following 2475 characteristics: 2477 Endpoint A - 4 screens, 3 cameras 2479 Endpoint B - 3 screens, 3 cameras 2481 Endpoint C - 3 screens, 3 cameras 2483 Endpoint D - 3 screens, 3 cameras 2485 Endpoint E - 1 screen, 1 camera 2487 Endpoint F - 2 screens, 1 camera 2489 Endpoint G - 1 screen, 1 camera 2491 This example focuses on what the user in one of the 3-camera multi- 2492 screen endpoints sees. Call this person User A, at Endpoint A. 2493 There are 4 large display screens at Endpoint A. Whenever somebody 2494 at another site is speaking, all the video captures from that 2495 endpoint are shown on the large screens. If the talker is at a 3- 2496 camera site, then the video from those 3 cameras fills 3 of the 2497 screens. If the talker is at a single-camera site, then video from 2498 that camera fills one of the screens, while the other screens show 2499 video from other single-camera endpoints. 2501 User A hears audio from the 4 loudest talkers. 2503 User A can also see video from other endpoints, in addition to the 2504 current talker, although much smaller in size. Endpoint A has 4 2505 screens, so one of those screens shows up to 9 other Media Captures 2506 in a tiled fashion. When video from a 3 camera endpoint appears in 2507 the tiled area, video from all 3 cameras appears together across 2508 the screen with correct spatial relationship among those 3 images. 2510 +---+---+---+ +-------------+ +-------------+ +-------------+ 2511 | | | | | | | | | | 2512 +---+---+---+ | | | | | | 2513 | | | | | | | | | | 2514 +---+---+---+ | | | | | | 2515 | | | | | | | | | | 2516 +---+---+---+ +-------------+ +-------------+ +-------------+ 2517 Figure 7: Endpoint A - 4 Screen Display 2519 User B at Endpoint B sees a similar arrangement, except there are 2520 only 3 screens, so the 9 other Media Captures are spread out across 2521 the bottom of the 3 displays, in a picture-in-picture (PIP) format. 2522 When video from a 3 camera endpoint appears in the PIP area, video 2523 from all 3 cameras appears together across a single screen with 2524 correct spatial relationship. 2526 +-------------+ +-------------+ +-------------+ 2527 | | | | | | 2528 | | | | | | 2529 | | | | | | 2530 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2531 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2532 +-------------+ +-------------+ +-------------+ 2533 Figure 8: Endpoint B - 3 Screen Display with PiPs 2535 When somebody at a different endpoint becomes the current talker, 2536 then User A and User B both see the video from the new talker 2537 appear on their large screen area, while the previous talker takes 2538 one of the smaller tiled or PIP areas. The person who is the 2539 current talker doesn't see themselves; they see the previous talker 2540 in their large screen area. 2542 One of the points of this example is that endpoints A and B each 2543 want to receive 3 capture encodings for their large display areas, 2544 and 9 encodings for their smaller areas. A and B are be able to 2545 each send the same Configure message to the MCU, and each receive 2546 the same conceptual Media Captures from the MCU. The differences 2547 are in how they are rendered and are purely a local matter at A and 2548 B. 2550 The Advertisements for such a scenario are described below. 2552 +-----------------------+---------------------------------+ 2553 | Capture Scene #1 | Description=Endpoint x | 2554 +-----------------------|---------------------------------+ 2555 | VC1 | EncodingGroup=1 | 2556 | VC2 | EncodingGroup=1 | 2557 | VC3 | EncodingGroup=1 | 2558 | AC1 | EncodingGroup=2 | 2559 | CSE1(VC1, VC2, VC3) | | 2560 | CSE2(AC1) | | 2561 +---------------------------------------------------------+ 2563 Table 19: Advertisement received at the MCU from Endpoints A to D 2565 +-----------------------+---------------------------------+ 2566 | Capture Scene #1 | Description=Endpoint y | 2567 +-----------------------|---------------------------------+ 2568 | VC1 | EncodingGroup=1 | 2569 | AC1 | EncodingGroup=2 | 2570 | CSE1(VC1) | | 2571 | CSE2(AC1) | | 2572 +---------------------------------------------------------+ 2574 Table 20: Advertisement received at the MCU from Endpoints E to F 2576 Rather than considering what is displayed the CLUE concentrates 2577 more on what the MCU sends. The MCU doesn't know anything about 2578 the number of screens an endpoint has. 2580 As Endpoints A to D each advertise that three Captures make up a 2581 Capture Scene, the MCU offers these in a "site" switching mode. 2582 That is that there are three Multiple Content Captures (and 2583 Capture Encodings) each switching between Endpoints. The MCU 2584 switches in the applicable media into the stream based on voice 2585 activity. Endpoint A will not see a capture from itself. 2587 Using the MCC concept the MCU would send the following 2588 Advertisement to endpoint A: 2590 +=======================+=================================+ 2591 | Capture Scene #1 | Description=Endpoint B | 2592 +-----------------------|---------------------------------+ 2593 | VC4 | Left | 2594 | VC5 | Center | 2595 | VC6 | Right | 2596 | AC1 | | 2597 | CSE(VC4,VC5,VC6) | | 2598 | CSE(AC1) | | 2599 +=======================+=================================+ 2600 | Capture Scene #2 | Description=Endpoint C | 2601 +-----------------------|---------------------------------+ 2602 | VC7 | Left | 2603 | VC8 | Center | 2604 | VC9 | Right | 2605 | AC2 | | 2606 | CSE(VC7,VC8,VC9) | | 2607 | CSE(AC2) | | 2608 +=======================+=================================+ 2609 | Capture Scene #3 | Description=Endpoint D | 2610 +-----------------------|---------------------------------+ 2611 | VC10 | Left | 2612 | VC11 | Center | 2613 | VC12 | Right | 2614 | AC3 | | 2615 | CSE(VC10,VC11,VC12) | | 2616 | CSE(AC3) | | 2617 +=======================+=================================+ 2618 | Capture Scene #4 | Description=Endpoint E | 2619 +-----------------------|---------------------------------+ 2620 | VC13 | | 2621 | AC4 | | 2622 | CSE(VC13) | | 2623 | CSE(AC4) | | 2624 +=======================+=================================+ 2625 | Capture Scene #5 | Description=Endpoint F | 2626 +-----------------------|---------------------------------+ 2627 | VC14 | | 2628 | AC5 | | 2629 | CSE(VC14) | | 2630 | CSE(AC5) | | 2631 +=======================+=================================+ 2632 | Capture Scene #6 | Description=Endpoint G | 2633 +-----------------------|---------------------------------+ 2634 | VC15 | | 2635 | AC6 | | 2636 | CSE(VC15) | | 2637 | CSE(AC6) | | 2638 +=======================+=================================+ 2640 Table 21: Advertisement sent to endpoint A - Source Part 2642 The above part of the Advertisement presents information about the 2643 sources to the MCC. The information is effectively the same as the 2644 received Advertisements except that there are no Capture Encodings 2645 associated with them and the identities have been re-numbered. 2647 In addition to the source Capture information the MCU advertises 2648 "site" switching of Endpoints B to G in three streams. 2650 +=======================+=================================+ 2651 | Capture Scene #7 | Description=Output3streammix | 2652 +-----------------------|---------------------------------+ 2653 | MCC1(VC4,VC7,VC10, | CaptureArea=Left | 2654 | VC13) | MaxCaptures=1 | 2655 | | SynchronisationID=1 | 2656 | | Policy=SoundLevel:0 | 2657 | | EncodingGroup=1 | 2658 | | | 2659 | MCC2(VC5,VC8,VC11, | CaptureArea=Center | 2660 | VC14) | MaxCaptures=1 | 2661 | | SynchronisationID=1 | 2662 | | Policy=SoundLevel:0 | 2663 | | EncodingGroup=1 | 2664 | | | 2665 | MCC3(VC6,VC9,VC12, | CaptureArea=Right | 2666 | VC15) | MaxCaptures=1 | 2667 | | SynchronisationID=1 | 2668 | | Policy=SoundLevel:0 | 2669 | | EncodingGroup=1 | 2670 | | | 2671 | MCC4() (for audio) | CaptureArea=whole scene | 2672 | | MaxCaptures=1 | 2673 | | Policy=SoundLevel:0 | 2674 | | EncodingGroup=2 | 2675 | | | 2676 | MCC5() (for audio) | CaptureArea=whole scene | 2677 | | MaxCaptures=1 | 2678 | | Policy=SoundLevel:1 | 2679 | | EncodingGroup=2 | 2680 | | | 2681 | MCC6() (for audio) | CaptureArea=whole scene | 2682 | | MaxCaptures=1 | 2683 | | Policy=SoundLevel:2 | 2684 | | EncodingGroup=2 | 2685 | | | 2686 | MCC7() (for audio) | CaptureArea=whole scene | 2687 | | MaxCaptures=1 | 2688 | | Policy=SoundLevel:3 | 2689 | | EncodingGroup=2 | 2690 | | | 2691 | CSE(MCC1,MCC2,MCC3) | | 2692 | CSE(MCC4,MCC5,MCC6, | | 2693 | MCC7) | | 2694 +=======================+=================================+ 2696 Table 22: Advertisement send to endpoint A - switching part 2698 The above part describes the switched 3 main streams that relate to 2699 site switching. MaxCaptures=1 indicates that only one Capture from 2700 the MCC is sent at a particular time. SynchronisationID=1 indicates 2701 that the source sending is synchronised. The provider can choose to 2702 group together VC13, VC14, and VC15 for the purpose of switching 2703 according to the SynchronisationID. Therefore when the provider 2704 switches one of them into an MCC, it can also switch the others 2705 even though they are not part of the same Capture Scene. 2707 All the audio for the conference is included in this Scene #7. 2708 There isn't necessarily a one to one relation between any audio 2709 capture and video capture in this scene. Typically a change in 2710 loudest talker will cause the MCU to switch the audio streams more 2711 quickly than switching video streams. 2713 The MCU can also supply nine media streams showing the active and 2714 previous eight speakers. It includes the following in the 2715 Advertisement: 2717 +=======================+=================================+ 2718 | Capture Scene #8 | Description=Output9stream | 2719 +-----------------------|---------------------------------+ 2720 | MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2721 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:0 | 2722 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2723 | | | 2724 | MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2725 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:1 | 2726 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2727 | | | 2728 to to | 2729 | | | 2730 | MCC16(VC4,VC5,VC6,VC7,| MaxCaptures=1 | 2731 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:8 | 2732 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2733 | | | 2734 | CSE(MCC8,MCC9,MCC10, | | 2735 | MCC11,MCC12,MCC13,| | 2736 | MCC14,MCC15,MCC16)| | 2737 +=======================+=================================+ 2739 Table 23: Advertisement sent to endpoint A - 9 switched part 2741 The above part indicates that there are 9 capture encodings. Each 2742 of the Capture Encodings may contain any captures from any source 2743 site with a maximum of one Capture at a time. Which Capture is 2744 present is determined by the policy. The MCCs in this scene do not 2745 have any spatial attributes. 2747 Note: The Provider alternatively could provide each of the MCCs 2748 above in its own Capture Scene. 2750 If the MCU wanted to provide a composed Capture Encoding containing 2751 all of the 9 captures it could Advertise in addition: 2753 +=======================+=================================+ 2754 | Capture Scene #9 | Description=NineTiles | 2755 +-----------------------|---------------------------------+ 2756 | MCC13(MCC8,MCC9,MCC10,| MaxCaptures=9 | 2757 | MCC11,MCC12,MCC13,| EncodingGroup=1 | 2758 | MCC14,MCC15,MCC16)| | 2759 | | | 2760 | CSE(MCC13) | | 2761 +=======================+=================================+ 2763 Table 24: Advertisement sent to endpoint A - 9 composed part 2765 As MaxCaptures is 9 it indicates that the capture encoding contains 2766 information from 9 sources at a time. 2768 The Advertisement to Endpoint B is identical to the above other 2769 than the captures from Endpoint A would be added and the captures 2770 from Endpoint B would be removed. Whether the Captures are rendered 2771 on a four screen display or a three screen display is up to the 2772 Consumer to determine. The Consumer wants to place video captures 2773 from the same original source endpoint together, in the correct 2774 spatial order, but the MCCs do not have spatial attributes. So the 2775 Consumer needs to associate incoming media packets with the 2776 original individual captures in the advertisement (such as VC4, 2777 VC5, and VC6) in order to know the spatial information it needs for 2778 correct placement on the screens. 2780 Editor's note: this is an open issue, about how to associate 2781 incoming packets with the original capture that is a constituent of 2782 an MCC. This document probably should mention it in an earlier 2783 section, after the solution is worked out in the other CLUE 2784 documents. 2786 12.3.4. Heterogeneous conference with voice activated switching 2788 This example illustrates how multipoint "voice activated switching" 2789 behavior can be realized, with an endpoint making its own decision 2790 about which of its outgoing video streams is considered the "active 2791 talker" from that endpoint. Then an MCU can decide which is the 2792 active talker among the whole conference. 2794 Consider a conference between endpoints with the following 2795 characteristics: 2797 Endpoint A - 3 screens, 3 cameras 2799 Endpoint B - 3 screens, 3 cameras 2801 Endpoint C - 1 screen, 1 camera 2803 This example focuses on what the user at endpoint C sees. The 2804 user would like to see the video capture of the current talker, 2805 without composing it with any other video capture. In this 2806 example endpoint C is capable of receiving only a single video 2807 stream. The following tables describe advertisements from A and B 2808 to the MCU, and from the MCU to C, that can be used to accomplish 2809 this. 2811 +-----------------------+---------------------------------+ 2812 | Capture Scene #1 | Description=Endpoint x | 2813 +-----------------------|---------------------------------+ 2814 | VC1 | CaptureArea=Left | 2815 | | EncodingGroup=1 | 2816 | VC2 | CaptureArea=Center | 2817 | | EncodingGroup=1 | 2818 | VC3 | CaptureArea=Right | 2819 | | EncodingGroup=1 | 2820 | MCC1(VC1,VC2,VC3) | MaxCaptures=1 | 2821 | | CaptureArea=whole scene | 2822 | | Policy=SoundLevel:0 | 2823 | | EncodingGroup=1 | 2824 | AC1 | CaptureArea=whole scene | 2825 | | EncodingGroup=2 | 2826 | CSE1(VC1, VC2, VC3) | | 2827 | CSE2(MCC1) | | 2828 | CSE3(AC1) | | 2829 +---------------------------------------------------------+ 2831 Table 25: Advertisement received at the MCU from Endpoints A and B 2833 Endpoints A and B are advertising each individual video capture, 2834 and also a switched capture MCC1 which switches between the other 2835 three based on who is the active talker. These endpoints do not 2836 advertise distinct audio captures associated with each individual 2837 video capture, so it would be impossible for the MCU (as a media 2838 consumer) to make its own determination of which video capture is 2839 the active talker based just on information in the audio streams. 2841 +-----------------------+---------------------------------+ 2842 | Capture Scene #1 | Description=conference | 2843 +-----------------------|---------------------------------+ 2844 | MCC1() | CaptureArea=Left | 2845 | | MaxCaptures=1 | 2846 | | SynchronisationID=1 | 2847 | | Policy=SoundLevel:0 | 2848 | | EncodingGroup=1 | 2849 | | | 2850 | MCC2() | CaptureArea=Center | 2851 | | MaxCaptures=1 | 2852 | | SynchronisationID=1 | 2853 | | Policy=SoundLevel:0 | 2854 | | EncodingGroup=1 | 2855 | | | 2856 | MCC3() | CaptureArea=Right | 2857 | | MaxCaptures=1 | 2858 | | SynchronisationID=1 | 2859 | | Policy=SoundLevel:0 | 2860 | | EncodingGroup=1 | 2861 | | | 2862 | MCC4() | CaptureArea=whole scene | 2863 | | MaxCaptures=1 | 2864 | | Policy=SoundLevel:0 | 2865 | | EncodingGroup=1 | 2866 | | | 2867 | MCC5() (for audio) | CaptureArea=whole scene | 2868 | | MaxCaptures=1 | 2869 | | Policy=SoundLevel:0 | 2870 | | EncodingGroup=2 | 2871 | | | 2872 | MCC6() (for audio) | CaptureArea=whole scene | 2873 | | MaxCaptures=1 | 2874 | | Policy=SoundLevel:1 | 2875 | | EncodingGroup=2 | 2876 | CSE1(MCC1,MCC2,MCC3 | | 2877 | CSE2(MCC4) | | 2878 | CSE3(MCC5,MCC6) | | 2879 +---------------------------------------------------------+ 2881 Table 26: Advertisement sent from the MCU to C 2883 The MCU advertises one scene, with four video MCCs. Three of them 2884 in CSE1 give a left, center, right view of the conference, with 2885 "site switching". MCC4 provides a single video capture 2886 representing a view of the whole conference. The MCU intends for 2887 MCC4 to be switched between all the other original source 2888 captures. In this example advertisement the MCU is not giving all 2889 the information about all the other endpoints' scenes and which of 2890 those captures is included in the MCCs. The MCU could include all 2891 that information if it wants to give the consumers more 2892 information, but it is not necessary for this example scenario. 2894 The Provider advertises MCC5 and MCC6 for audio. Both are 2895 switched captures, with different SoundLevel policies indicating 2896 they are the top two dominant talkers. The Provider advertises 2897 CSE3 with both MCCs, suggesting the Consumer should use both if it 2898 can. 2900 Endpoint C, in its configure message to the MCU, requests to 2901 receive MCC4 for video, and MCC5 and MCC6 for audio. In order for 2902 the MCU to get the information it needs to construct MCC4, it has 2903 to send configure messages to A and B asking to receive MCC1 from 2904 each of them, along with their AC1 audio. Now the MCU can use 2905 audio energy information from the two incoming audio streams from 2906 A and B to determine which of those alternatives is the current 2907 talker. Based on that, the MCU uses either MCC1 from A or MCC1 2908 from B as the source of MCC4 to send to C. 2910 13. Acknowledgements 2912 Allyn Romanow and Brian Baldino were authors of early versions. 2913 Mark Gorzynski also contributed much to the initial approach. 2914 Many others also contributed, including Christian Groves, Jonathan 2915 Lennox, Paul Kyzivat, Rob Hansen, Roni Even, Christer Holmberg, 2916 Stephen Botzko, Mary Barnes, John Leslie, Paul Coverdale. 2918 14. IANA Considerations 2920 None. 2922 15. Security Considerations 2924 There are several potential attacks related to telepresence, and 2925 specifically the protocols used by CLUE, in the case of 2926 conferencing sessions, due to the natural involvement of multiple 2927 endpoints and the many, often user-invoked, capabilities provided 2928 by the systems. 2930 A middle box involved in a CLUE session can experience many of the 2931 same attacks as that of a conferencing system such as that enabled 2932 by the XCON framework [RFC 6503]. Examples of attacks include the 2933 following: an endpoint attempting to listen to sessions in which 2934 it is not authorized to participate, an endpoint attempting to 2935 disconnect or mute other users, and theft of service by an 2936 endpoint in attempting to create telepresence sessions it is not 2937 allowed to create. Thus, it is RECOMMENDED that a middle box 2938 implementing the protocols necessary to support CLUE, follow the 2939 security recommendations specified in the conference control 2940 protocol documents. In the case of CLUE, SIP is the default 2941 conferencing protocol, thus the security considerations in RFC 2942 4579 MUST be followed. 2944 One primary security concern, surrounding the CLUE framework 2945 introduced in this document, involves securing the actual 2946 protocols and the associated authorization mechanisms. These 2947 concerns apply to endpoint to endpoint sessions, as well as 2948 sessions involving multiple endpoints and middle boxes. Figure 2 2949 in section 5 provides a basic flow of information exchange for 2950 CLUE and the protocols involved. 2952 As described in section 5, CLUE uses SIP/SDP to establish the 2953 session prior to exchanging any CLUE specific information. Thus 2954 the security mechanisms recommended for SIP [RFC 3261], including 2955 user authentication and authorization, SHOULD be followed. In 2956 addition, the media is based on RTP and thus existing RTP security 2957 mechanisms, such as DTLS/SRTP, MUST be supported. 2959 A separate data channel is established to transport the CLUE 2960 protocol messages. The contents of the CLUE protocol messages are 2961 based on information introduced in this document, which is 2962 represented by an XML schema for this information defined in the 2963 CLUE data model [ref]. Some of the information which could 2964 possibly introduce privacy concerns is the xCard information as 2965 described in section x. In addition, the (text) description field 2966 in the Media Capture attribute (section 7.1.1.7) could possibly 2967 reveal sensitive information or specific identities. The same 2968 would be true for the descriptions in the Capture Scene (section 2969 7.3.1) and Capture Scene Entry (7.3.2) attributes. One other 2970 important consideration for the information in the xCard as well 2971 as the description field in the Media Capture and Capture Scene 2972 Entry attributes is that while the endpoints involved in the 2973 session have been authenticated, there is no assurance that the 2974 information in the xCard or description fields is authentic. 2975 Thus, this information SHOULD not be used to make any 2976 authorization decisions and the participants in the sessions 2977 SHOULD be made aware of this. 2979 While other information in the CLUE protocol messages does not 2980 reveal specific identities, it can reveal characteristics and 2981 capabilities of the endpoints. That information could possibly 2982 uniquely identify specific endpoints. It might also be possible 2983 for an attacker to manipulate the information and disrupt the CLUE 2984 sessions. It would also be possible to mount a DoS attack on the 2985 CLUE endpoints if a malicious agent has access to the data 2986 channel. Thus, It MUST be possible for the endpoints to establish 2987 a channel which is secure against both message recovery and 2988 message modification. Further details on this are provided in the 2989 CLUE data channel solution document. 2991 There are also security issues associated with the authorization 2992 to perform actions at the CLUE endpoints to invoke specific 2993 capabilities (e.g., re-arranging screens, sharing content, etc.). 2994 However, the policies and security associated with these actions 2995 are outside the scope of this document and the overall CLUE 2996 solution. 2998 16. Changes Since Last Version 3000 NOTE TO THE RFC-Editor: Please remove this section prior to 3001 publication as an RFC. 3003 Changes from 15 to 16: 3005 1. Remove Audio Channel Format attribute 3007 2. Add Audio Capture Sensitivity Pattern attribute 3009 3. Clarify audio spatial information regarding point of capture 3010 and point on line of capture. Area of capture does not apply 3011 to audio. 3013 4. Update section 12 example for new treatment of audio spatial 3014 information. 3016 5. Clean up wording of some definitions, and various places in 3017 sections 5 and 10. 3019 6. Remove individual encoding parameter paragraph from section 3020 9. 3022 7. Update Advertisement diagram. 3024 8. Update Acknowledgements. 3026 9. References to use cases and requirements now refer to RFCs. 3028 10. Minor editorial changes. 3030 Changes from 14 to 15: 3032 1. Add "=" and "<=" qualifiers to MaxCaptures attribute, and 3033 clarify the meaning regarding switched and composed MCC. 3035 2. Add section 7.3.3 Global Capture Scene Entry List, and a few 3036 other sentences elsewhere that refer to global CSE sets. 3038 3. Clarify: The Provider MUST be capable of encoding and sending 3039 all Captures (*that have an encoding group*) in a single 3040 Capture Scene Entry simultaneously. 3042 4. Add voice activated switching example in section 12. 3044 5. Change name of attributes Participant Info/Type to Person 3045 Info/Type. 3047 6. Clarify the Person Info/Type attributes have the same meaning 3048 regardless of whether or not the capture has a Presentation 3049 attribute. 3051 7. Update example section 12.1 to be consistent with the rest of 3052 the document, regarding MCC and capture attributes. 3054 8. State explicitly each CSE has a unique ID. 3056 Changes from 13 to 14: 3058 1. Fill in section for Security Considerations. 3060 2. Replace Role placeholder with Participant Information, 3061 Participant Type, and Scene Information attributes. 3063 3. Spatial information implies nothing about how constituent 3064 media captures are combined into a composed MCC. 3066 4. Clean up MCC example in Section 12.3.3. Clarify behavior of 3067 tiled and PIP display windows. Add audio. Add new open 3068 issue about associating incoming packets to original source 3069 capture. 3071 5. Remove editor's note and associated statement about RTP 3072 multiplexing at end of section 5. 3074 6. Remove editor's note and associated paragraph about 3075 overloading media channel with both CLUE and non-CLUE usage, 3076 in section 5. 3078 7. In section 10, clarify intent of media encodings conforming 3079 to SDP, even with multiple CLUE message exchanges. Remove 3080 associated editor's note. 3082 Changes from 12 to 13: 3084 1. Added the MCC concept including updates to existing sections 3085 to incorporate the MCC concept. New MCC attributes: 3086 MaxCaptures, SynchronisationID and Policy. 3088 2. Removed the "composed" and "switched" Capture attributes due 3089 to overlap with the MCC concept. 3091 3. Removed the "Scene-switch-policy" CSE attribute, replaced by 3092 MCC and SynchronisationID. 3094 4. Editorial enhancements including numbering of the Capture 3095 attribute sections, tables, figures etc. 3097 Changes from 11 to 12: 3099 1. Ticket #44. Remove note questioning about requiring a 3100 Consumer to send a Configure after receiving Advertisement. 3102 2. Ticket #43. Remove ability for consumer to choose value of 3103 attribute for scene-switch-policy. 3105 3. Ticket #36. Remove computational complexity parameter, 3106 MaxGroupPps, from Encoding Groups. 3108 4. Reword the Abstract and parts of sections 1 and 4 (now 5) 3109 based on Mary's suggestions as discussed on the list. Move 3110 part of the Introduction into a new section Overview & 3111 Motivation. 3113 5. Add diagram of an Advertisement, in the Overview of the 3114 Framework/Model section. 3116 6. Change Intended Status to Standards Track. 3118 7. Clean up RFC2119 keyword language. 3120 Changes from 10 to 11: 3122 1. Add description attribute to Media Capture and Capture Scene 3123 Entry. 3125 2. Remove contradiction and change the note about open issue 3126 regarding always responding to Advertisement with a Configure 3127 message. 3129 3. Update example section, to cleanup formatting and make the 3130 media capture attributes and encoding parameters consistent 3131 with the rest of the document. 3133 Changes from 09 to 10: 3135 1. Several minor clarifications such as about SDP usage, Media 3136 Captures, Configure message. 3138 2. Simultaneous Set can be expressed in terms of Capture Scene 3139 and Capture Scene Entry. 3141 3. Removed Area of Scene attribute. 3143 4. Add attributes from draft-groves-clue-capture-attr-01. 3145 5. Move some of the Media Capture attribute descriptions back 3146 into this document, but try to leave detailed syntax to the 3147 data model. Remove the OUTSOURCE sections, which are already 3148 incorporated into the data model document. 3150 Changes from 08 to 09: 3152 1. Use "document" instead of "memo". 3154 2. Add basic call flow sequence diagram to introduction. 3156 3. Add definitions for Advertisement and Configure messages. 3158 4. Add definitions for Capture and Provider. 3160 5. Update definition of Capture Scene. 3162 6. Update definition of Individual Encoding. 3164 7. Shorten definition of Media Capture and add key points in the 3165 Media Captures section. 3167 8. Reword a bit about capture scenes in overview. 3169 9. Reword about labeling Media Captures. 3171 10. Remove the Consumer Capability message. 3173 11. New example section heading for media provider behavior 3175 12. Clarifications in the Capture Scene section. 3177 13. Clarifications in the Simultaneous Transmission Set section. 3179 14. Capitalize defined terms. 3181 15. Move call flow example from introduction to overview section 3183 16. General editorial cleanup 3185 17. Add some editors' notes requesting input on issues 3186 18. Summarize some sections, and propose details be outsourced 3187 to other documents. 3189 Changes from 06 to 07: 3191 1. Ticket #9. Rename Axis of Capture Point attribute to Point 3192 on Line of Capture. Clarify the description of this 3193 attribute. 3195 2. Ticket #17. Add "capture encoding" definition. Use this new 3196 term throughout document as appropriate, replacing some usage 3197 of the terms "stream" and "encoding". 3199 3. Ticket #18. Add Max Capture Encodings media capture 3200 attribute. 3202 4. Add clarification that different capture scene entries are 3203 not necessarily mutually exclusive. 3205 Changes from 05 to 06: 3207 1. Capture scene description attribute is a list of text strings, 3208 each in a different language, rather than just a single string. 3210 2. Add new Axis of Capture Point attribute. 3212 3. Remove appendices A.1 through A.6. 3214 4. Clarify that the provider must use the same coordinate system 3215 with same scale and origin for all coordinates within the same 3216 capture scene. 3218 Changes from 04 to 05: 3220 1. Clarify limitations of "composed" attribute. 3222 2. Add new section "capture scene entry attributes" and add the 3223 attribute "scene-switch-policy". 3225 3. Add capture scene description attribute and description 3226 language attribute. 3228 4. Editorial changes to examples section for consistency with the 3229 rest of the document. 3231 Changes from 03 to 04: 3233 1. Remove sentence from overview - "This constitutes a significant 3234 change ..." 3236 2. Clarify a consumer can choose a subset of captures from a 3237 capture scene entry or a simultaneous set (in section "capture 3238 scene" and "consumer's choice..."). 3240 3. Reword first paragraph of Media Capture Attributes section. 3242 4. Clarify a stereo audio capture is different from two mono audio 3243 captures (description of audio channel format attribute). 3245 5. Clarify what it means when coordinate information is not 3246 specified for area of capture, point of capture, area of scene. 3248 6. Change the term "producer" to "provider" to be consistent (it 3249 was just in two places). 3251 7. Change name of "purpose" attribute to "content" and refer to 3252 RFC4796 for values. 3254 8. Clarify simultaneous sets are part of a provider advertisement, 3255 and apply across all capture scenes in the advertisement. 3257 9. Remove sentence about lip-sync between all media captures in a 3258 capture scene. 3260 10. Combine the concepts of "capture scene" and "capture set" 3261 into a single concept, using the term "capture scene" to 3262 replace the previous term "capture set", and eliminating the 3263 original separate capture scene concept. 3265 Informative References 3267 Edt. Note: Decide which of these really are Normative References. 3269 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3270 Requirement Levels", BCP 14, RFC 2119, March 1997. 3272 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 3273 Johnston, 3274 A., Peterson, J., Sparks, R., Handley, M., and E. 3276 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 3277 June 2002. 3279 [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model 3280 with the Session Description Protocol (SDP)", RFC 3264, 3281 June 2002. 3283 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 3284 Jacobson, "RTP: A Transport Protocol for Real-Time 3285 Applications", STD 64, RFC 3550, July 2003. 3287 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 3288 Session Initiation Protocol (SIP)", RFC 4353, 3289 February 2006. 3291 [RFC4579] Johnston, A., Levin, O., "SIP Call Control - 3292 Conferencing for User Agents", RFC 4579, August 2006 3294 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 3295 5117, January 2008. 3297 [RFC7205] Romanow, A., Botzko, S., Duckworth, M., Even, R., 3298 "Use Cases for Telepresence Multistreams", RFC 7205, 3299 April 2014. 3301 [RFC7262] Romanow, A., Botzko, S., Barnes, M., "Requirements 3302 for Telepresence Multistreams", RFC 7262, June 2014. 3304 17. Authors' Addresses 3306 Mark Duckworth (editor) 3307 Polycom 3308 Andover, MA 01810 3309 USA 3311 Email: mark.duckworth@polycom.com 3313 Andrew Pepperell 3314 Acano 3315 Uxbridge, England 3316 UK 3317 Email: apeppere@gmail.com 3319 Stephan Wenger 3320 Vidyo, Inc. 3321 433 Hackensack Ave. 3322 Hackensack, N.J. 07601 3323 USA 3325 Email: stewe@stewe.org