idnits 2.17.1 draft-ietf-clue-framework-19.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1112 has weird spacing: '... switch betwe...' == Line 1960 has weird spacing: '...om left bot...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: A separate data channel is established to transport the CLUE protocol messages. The contents of the CLUE protocol messages are based on information introduced in this document, which is represented by an XML schema for this information defined in the CLUE data model [ref]. Some of the information which could possibly introduce privacy concerns is the xCard information as described in section 7.1.1.11. In addition, the (text) description field in the Media Capture attribute (section 7.1.1.7) could possibly reveal sensitive information or specific identities. The same would be true for the descriptions in the Capture Scene (section 7.3.1) and Capture Scene View (7.3.2) attributes. One other important consideration for the information in the xCard as well as the description field in the Media Capture and Capture Scene View attributes is that while the endpoints involved in the session have been authenticated, there is no assurance that the information in the xCard or description fields is authentic. Thus, this information SHOULD not be used to make any authorization decisions and the participants in the sessions SHOULD be made aware of this. -- The document date (December 11, 2014) is 3422 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC6351' is mentioned on line 864, but not defined == Missing Reference: 'RFC6350' is mentioned on line 875, but not defined == Missing Reference: 'RFC4566' is mentioned on line 1582, but not defined ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866) == Missing Reference: 'RFC 6503' is mentioned on line 2980, but not defined == Missing Reference: 'RFC 3261' is mentioned on line 3002, but not defined == Unused Reference: 'I-D.ietf-clue-data-model-schema' is defined on line 3367, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-clue-protocol' is defined on line 3372, but no explicit reference was found in the text == Unused Reference: 'RFC4579' is defined on line 3398, but no explicit reference was found in the text == Outdated reference: A later version (-18) exists of draft-ietf-clue-datachannel-05 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-datachannel (ref. 'I-D.ietf-clue-datachannel') == Outdated reference: A later version (-17) exists of draft-ietf-clue-data-model-schema-07 == Outdated reference: A later version (-19) exists of draft-ietf-clue-protocol-02 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-protocol (ref. 'I-D.ietf-clue-protocol') == Outdated reference: A later version (-15) exists of draft-ietf-clue-signaling-04 ** Downref: Normative reference to an Experimental draft: draft-ietf-clue-signaling (ref. 'I-D.ietf-clue-signaling') == Outdated reference: A later version (-14) exists of draft-ietf-clue-rtp-mapping-03 -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 4 errors (**), 0 flaws (~~), 18 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Standards Track A. Pepperell 4 Expires: June 11, 2015 Acano 5 S. Wenger 6 Vidyo 7 December 11, 2014 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-19.txt 12 Abstract 14 This document defines a framework for a protocol to enable devices 15 in a telepresence conference to interoperate. The protocol enables 16 communication of information about multiple media streams so a 17 sending system and receiving system can make reasonable decisions 18 about transmitting, selecting and rendering the media streams. 19 This protocol is used in addition to SIP signaling for setting up a 20 telepresence session. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current 30 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other 34 documents at any time. It is inappropriate to use Internet-Drafts 35 as reference material or to cite them other than as "work in 36 progress." 38 This Internet-Draft will expire on June 11, 2015. 40 Copyright Notice 42 Copyright (c) 2013 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with 50 respect to this document. Code Components extracted from this 51 document must include Simplified BSD License text as described in 52 Section 4.e of the Trust Legal Provisions and are provided without 53 warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction...................................................3 58 2. Terminology....................................................4 59 3. Definitions....................................................4 60 4. Overview & Motivation..........................................7 61 5. Overview of the Framework/Model................................9 62 6. Spatial Relationships.........................................14 63 7. Media Captures and Capture Scenes.............................16 64 7.1. Media Captures...........................................16 65 7.1.1. Media Capture Attributes............................17 66 7.2. Multiple Content Capture.................................22 67 7.2.1. MCC Attributes......................................23 68 7.3. Capture Scene............................................28 69 7.3.1. Capture Scene attributes............................31 70 7.3.2. Capture Scene View attributes.......................32 71 7.3.3. Global View List....................................32 72 8. Simultaneous Transmission Set Constraints.....................34 73 9. Encodings.....................................................36 74 9.1. Individual Encodings.....................................36 75 9.2. Encoding Group...........................................37 76 9.3. Associating Captures with Encoding Groups................38 77 10. Consumer's Choice of Streams to Receive from the Provider....39 78 10.1. Local preference........................................42 79 10.2. Physical simultaneity restrictions......................42 80 10.3. Encoding and encoding group limits......................42 81 11. Extensibility................................................43 82 12. Examples - Using the Framework (Informative).................43 83 12.1. Provider Behavior.......................................43 84 12.1.1. Three screen Endpoint Provider.....................43 85 12.1.2. Encoding Group Example.............................50 86 12.1.3. The MCU Case.......................................51 88 12.2. Media Consumer Behavior.................................52 89 12.2.1. One screen Media Consumer..........................52 90 12.2.2. Two screen Media Consumer configuring the example..53 91 12.2.3. Three screen Media Consumer configuring the example53 92 12.3. Multipoint Conference utilizing Multiple Content Captures54 93 12.3.1. Single Media Captures and MCC in the same 94 Advertisement..............................................54 95 12.3.2. Several MCCs in the same Advertisement.............57 96 12.3.3. Heterogeneous conference with switching and 97 composition................................................58 98 12.3.4. Heterogeneous conference with voice activated 99 switching..................................................65 100 13. Acknowledgements.............................................68 101 14. IANA Considerations..........................................68 102 15. Security Considerations......................................68 103 16. Changes Since Last Version...................................70 104 17. Normative References.........................................77 105 18. Informative References.......................................78 106 19. Authors' Addresses...........................................79 108 1. Introduction 110 Current telepresence systems, though based on open standards such 111 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 112 each other. A major factor limiting the interoperability of 113 telepresence systems is the lack of a standardized way to describe 114 and negotiate the use of the multiple streams of audio and video 115 comprising the media flows. This document provides a framework for 116 protocols to enable interoperability by handling multiple streams 117 in a standardized way. The framework is intended to support the 118 use cases described in Use Cases for Telepresence Multistreams 119 [RFC7205] and to meet the requirements in Requirements for 120 Telepresence Multistreams [RFC7262]. 122 The basic session setup for the use cases is based on SIP [RFC3261] 123 and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP 124 offer/answer, CLUE specific signaling is required to exchange the 125 information describing the multiple media streams. The motivation 126 for this framework, an overview of the signaling, and information 127 required to be exchanged is described in subsequent sections of 128 this document. Companion documents describe the signaling details 129 [I-D.ietf-clue-signaling] and the data model [I-D.ietf-clue-data- 130 model-schema]. 132 2. Terminology 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 135 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 136 this document are to be interpreted as described in RFC 2119 137 [RFC2119]. 139 3. Definitions 141 The terms defined below are used throughout this document and 142 companion documents and they are normative. In order to easily 143 identify the use of a defined term, those terms are capitalized. 145 Advertisement: a CLUE message a Media Provider sends to a Media 146 Consumer describing specific aspects of the content of the media, 147 and any restrictions it has in terms of being able to provide 148 certain Streams simultaneously. 150 Audio Capture: Media Capture for audio. Denoted as ACn in the 151 examples in this document. 153 Capture: Same as Media Capture. 155 Capture Device: A device that converts physical input, such as 156 audio, video or text, into an electrical signal, in most cases to 157 be fed into a media encoder. 159 Capture Encoding: A specific encoding of a Media Capture, to be 160 sent by a Media Provider to a Media Consumer via RTP. 162 Capture Scene: a structure representing a spatial region captured 163 by one or more Capture Devices, each capturing media representing a 164 portion of the region. The spatial region represented by a Capture 165 Scene MAY or may not correspond to a real region in physical space, 166 such as a room. A Capture Scene includes attributes and one or 167 more Capture Scene Views, with each view including one or more 168 Media Captures. 170 Capture Scene View (CSV): a list of Media Captures of the same 171 media type that together form one way to represent the entire 172 Capture Scene. 174 CLUE-capable device: A device that supports the CLUE data channel 175 [I-D.ietf-clue-datachannel], the CLUE protocol [I-D.ietf-clue- 176 protocol] and the principles of CLUE negotiation, and wishes to 177 upgrade the call to CLUE-enabled status. 179 CLUE-enabled call: A call in which two CLUE-capable devices have 180 successfully negotiated support for a CLUE data channel in SDP. A 181 CLUE-enabled call is not necessarily immediately able to send CLUE- 182 controlled media; negotiation of the data channel and of the CLUE 183 protocol must complete first. Calls between two CLUE-capable 184 devices which have not yet successfully completed negotiation of 185 support for the CLUE data channel in SDP are not considered CLUE- 186 enabled. 188 Conference: used as defined in [RFC4353], A Framework for 189 Conferencing within the Session Initiation Protocol (SIP). 191 Configure Message: A CLUE message a Media Consumer sends to a Media 192 Provider specifying which content and media streams it wants to 193 receive, based on the information in a corresponding Advertisement 194 message. 196 Consumer: short for Media Consumer. 198 Encoding or Individual Encoding: a set of parameters representing a 199 way to encode a Media Capture to become a Capture Encoding. 201 Encoding Group: A set of encoding parameters representing a total 202 media encoding capability to be sub-divided across potentially 203 multiple Individual Encodings. 205 Endpoint: A CLUE capable-device which is the logical point of final 206 termination through receiving, decoding and rendering, and/or 207 initiation through capturing, encoding, and sending of media 208 streams. An endpoint consists of one or more physical devices 209 which source and sink media streams, and exactly one [RFC4353] 210 Participant (which, in turn, includes exactly one SIP User Agent). 211 Endpoints can be anything from multiscreen/multicamera rooms to 212 handheld devices. 214 Global View: A set of references to one or more Capture Scene Views 215 of the same media type that are defined within scenes of the same 216 advertisement. A Global View is a suggestion from the Provider to 217 the Consumer for which CSVs provide a complete representation of 218 the simultaneous captures provided by the Provider, across multiple 219 scenes. 221 Global View List: A list of Global Views included in an 222 Advertisement. A Global View List may include Global Views of 223 different media types. 225 MCU: Multipoint Control Unit (MCU) - a CLUE-capable device that 226 connects two or more endpoints together into one single multimedia 227 conference [RFC5117]. An MCU includes an [RFC4353] like Mixer, 228 without the [RFC4353] requirement to send media to each 229 participant. 231 Media: Any data that, after suitable encoding, can be conveyed over 232 RTP, including audio, video or timed text. 234 Media Capture: a source of Media, such as from one or more Capture 235 Devices or constructed from other Media streams. 237 Media Consumer: a CLUE-capable device that intends to receive 238 Capture Encodings 240 Media Provider: a CLUE-capable device that intends to send Capture 241 Encodings 243 Multiple Content Capture (MCC): A Capture that mixes and/or 244 switches other Captures of a single type. (E.g. all audio or all 245 video.) Particular Media Captures may or may not be present in the 246 resultant Capture Encoding depending on time or space. Denoted as 247 MCCn in the example cases in this document. 249 Plane of Interest: The spatial plane containing the most relevant 250 subject matter. 252 Provider: Same as Media Provider. 254 Render: the process of generating a representation from media, such 255 as displayed motion video or sound emitted from loudspeakers. 257 Simultaneous Transmission Set: a set of Media Captures that can be 258 transmitted simultaneously from a Media Provider. 260 Single Media Capture: A capture which contains media from a single 261 source capture device, e.g. an audio capture from a single 262 microphone, a video capture from a single camera. 264 Spatial Relation: The arrangement in space of two objects, in 265 contrast to relation in time or other relationships. 267 Stream: a Capture Encoding sent from a Media Provider to a Media 268 Consumer via RTP [RFC3550]. 270 Stream Characteristics: the media stream attributes commonly used 271 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 272 resolution, profile/level etc.) as well as CLUE specific 273 attributes, such as the Capture ID or a spatial location. 275 Video Capture: Media Capture for video. Denoted as VCn in the 276 example cases in this document. 278 Video Composite: A single image that is formed, normally by an RTP 279 mixer inside an MCU, by combining visual elements from separate 280 sources. 282 4. Overview & Motivation 284 This section provides an overview of the functional elements 285 defined in this document to represent a telepresence system. The 286 motivations for the framework described in this document are also 287 provided. 289 Two key concepts introduced in this document are the terms "Media 290 Provider" and "Media Consumer". A Media Provider represents the 291 entity that sends the media and a Media Consumer represents the 292 entity that receives the media. A Media Provider provides Media in 293 the form of RTP packets, a Media Consumer consumes those RTP 294 packets. Media Providers and Media Consumers can reside in 295 Endpoints or in Multipoint Control Units (MCUs). A Media Provider 296 in an Endpoint is usually associated with the generation of media 297 for Media Captures; these Media Captures are typically sourced 298 from cameras, microphones, and the like. Similarly, the Media 299 Consumer in an Endpoint is usually associated with renderers, such 300 as screens and loudspeakers. In MCUs, Media Providers and 301 Consumers can have the form of outputs and inputs, respectively, 302 of RTP mixers, RTP translators, and similar devices. Typically, 303 telepresence devices such as Endpoints and MCUs would perform as 304 both Media Providers and Media Consumers, the former being 305 concerned with those devices' transmitted media and the latter 306 with those devices' received media. In a few circumstances, a 307 CLUE-capable device includes only Consumer or Provider 308 functionality, such as recorder-type Consumers or webcam-type 309 Providers. 311 The motivations for the framework outlined in this document 312 include the following: 314 (1) Endpoints in telepresence systems typically have multiple Media 315 Capture and Media Render devices, e.g., multiple cameras and 316 screens. While previous system designs were able to set up calls 317 that would capture media using all cameras and display media on all 318 screens, for example, there was no mechanism that can associate 319 these Media Captures with each other in space and time. 321 (2) The mere fact that there are multiple capturing and rendering 322 devices, each of which may be configurable in aspects such as zoom, 323 leads to the difficulty that a variable number of such devices can 324 be used to capture different aspects of a region. The Capture 325 Scene concept allows for the description of multiple setups for 326 those multiple capture devices that could represent sensible 327 operation points of the physical capture devices in a room, chosen 328 by the operator. A Consumer can pick and choose from those 329 configurations based on its rendering abilities and inform the 330 Provider about its choices. Details are provided in section 7. 332 (3) In some cases, physical limitations or other reasons disallow 333 the concurrent use of a device in more than one setup. For 334 example, the center camera in a typical three-camera conference 335 room can set its zoom objective either to capture only the middle 336 few seats, or all seats of a room, but not both concurrently. The 337 Simultaneous Transmission Set concept allows a Provider to signal 338 such limitations. Simultaneous Transmission Sets are part of the 339 Capture Scene description, and discussed in section 8. 341 (4) Often, the devices in a room do not have the computational 342 complexity or connectivity to deal with multiple encoding options 343 simultaneously, even if each of these options is sensible in 344 certain scenarios, and even if the simultaneous transmission is 345 also sensible (i.e. in case of multicast media distribution to 346 multiple endpoints). Such constraints can be expressed by the 347 Provider using the Encoding Group concept, described in section 9. 349 (5) Due to the potentially large number of RTP flows required for a 350 Multimedia Conference involving potentially many Endpoints, each of 351 which can have many Media Captures and media renderers, it has 352 become common to multiplex multiple RTP media flows onto the same 353 transport address, so to avoid using the port number as a 354 multiplexing point and the associated shortcomings such as 355 NAT/firewall traversal. While the actual mapping of those RTP 356 flows to the header fields of the RTP packets is not subject of 357 this specification, the large number of possible permutations of 358 sensible options a Media Provider can make available to a Media 359 Consumer makes a mechanism desirable that allows to narrow down the 360 number of possible options that a SIP offer-answer exchange has to 361 consider. Such information is made available using protocol 362 mechanisms specified in this document and companion documents, 363 although it should be stressed that its use in an implementation is 364 OPTIONAL. Also, there are aspects of the control of both Endpoints 365 and MCUs that dynamically change during the progress of a call, 366 such as audio-level based screen switching, layout changes, and so 367 on, which need to be conveyed. Note that these control aspects are 368 complementary to those specified in traditional SIP based 369 conference management such as BFCP. An exemplary call flow can be 370 found in section 5. 372 Finally, all this information needs to be conveyed, and the notion 373 of support for it needs to be established. This is done by the 374 negotiation of a "CLUE channel", a data channel negotiated early 375 during the initiation of a call. An Endpoint or MCU that rejects 376 the establishment of this data channel, by definition, does not 377 support CLUE based mechanisms, whereas an Endpoint or MCU that 378 accepts it is REQUIRED to use it to the extent specified in this 379 document and its companion documents. 381 5. Overview of the Framework/Model 383 The CLUE framework specifies how multiple media streams are to be 384 handled in a telepresence conference. 386 A Media Provider (transmitting Endpoint or MCU) describes specific 387 aspects of the content of the media and the media stream encodings 388 it can send in an Advertisement; and the Media Consumer responds to 389 the Media Provider by specifying which content and media streams it 390 wants to receive in a Configure message. The Provider then 391 transmits the asked-for content in the specified streams. 393 This Advertisement and Configure typically occur during call 394 initiation, after CLUE has been enabled in a call, but MAY also 395 happen at any time throughout the call, whenever there is a change 396 in what the Consumer wants to receive or (perhaps less common) the 397 Provider can send. 399 An Endpoint or MCU typically act as both Provider and Consumer at 400 the same time, sending Advertisements and sending Configurations in 401 response to receiving Advertisements. (It is possible to be just 402 one or the other.) 404 The data model is based around two main concepts: a Capture and an 405 Encoding. A Media Capture (MC), such as audio or video, has 406 attributes to describe the content a Provider can send. Media 407 Captures are described in terms of CLUE-defined attributes, such as 408 spatial relationships and purpose of the capture. Providers tell 409 Consumers which Media Captures they can provide, described in terms 410 of the Media Capture attributes. 412 A Provider organizes its Media Captures into one or more Capture 413 Scenes, each representing a spatial region, such as a room. A 414 Consumer chooses which Media Captures it wants to receive from the 415 Capture Scenes. 417 In addition, the Provider can send the Consumer a description of 418 the Individual Encodings it can send in terms of identifiers which 419 relate to items in SDP. 421 The Provider can also specify constraints on its ability to provide 422 Media, and a sensible design choice for a Consumer is to take these 423 into account when choosing the content and Capture Encodings it 424 requests in the later offer-answer exchange. Some constraints are 425 due to the physical limitations of devices--for example, a camera 426 may not be able to provide zoom and non-zoom views simultaneously. 427 Other constraints are system based, such as maximum bandwidth. 429 The following diagram illustrates the information contained in an 430 Advertisement. 432 ................................................................... 433 . Provider Advertisement +--------------------+ . 434 . | Simultaneous Sets | . 435 . +------------------------+ +--------------------+ . 436 . | Capture Scene N | +--------------------+ . 437 . +-+----------------------+ | | Global View List | . 438 . | Capture Scene 2 | | +--------------------+ . 439 . +-+----------------------+ | | +----------------------+ . 440 . | Capture Scene 1 | | | | Encoding Group N | . 441 . | +---------------+ | | | +-+--------------------+ | . 442 . | | Attributes | | | | | Encoding Group 2 | | . 443 . | +---------------+ | | | +-+--------------------+ | | . 444 . | | | | | Encoding Group 1 | | | . 445 . | +----------------+ | | | | parameters | | | . 446 . | | V i e w s | | | | | bandwidth | | | . 447 . | | +---------+ | | | | | +-------------------+| | | . 448 . | | |Attribute| | | | | | | V i d e o || | | . 449 . | | +---------+ | | | | | | E n c o d i n g s || | | . 450 . | | | | | | | | Encoding 1 || | | . 451 . | | View 1 | | | | | | || | | . 452 . | | (list of MCs) | | |-+ | +-------------------+| | | . 453 . | +----|-|--|------+ |-+ | | | | . 454 . +---------|-|--|---------+ | +-------------------+| | | . 455 . | | | | | A u d i o || | | . 456 . | | | | | E n c o d i n g s || | | . 457 . v | | | | Encoding 1 || | | . 458 . +---------|--|--------+ | | || | | . 459 . | Media Capture N |------>| +-------------------+| | | . 460 . +-+---------v--|------+ | | | | | . 461 . | Media Capture 2 | | | | |-+ . 462 . +-+--------------v----+ |-------->| | | . 463 . | Media Capture 1 | | | | |-+ . 464 . | +----------------+ |---------->| | . 465 . | | Attributes | | |_+ +----------------------+ . 466 . | +----------------+ |_+ . 467 . +---------------------+ . 468 . . 469 ................................................................... 471 Figure 1: Advertisement Structure 473 A very brief outline of the call flow used by a simple system (two 474 Endpoints) in compliance with this document can be described as 475 follows, and as shown in the following figure. 477 +-----------+ +-----------+ 478 | Endpoint1 | | Endpoint2 | 479 +----+------+ +-----+-----+ 480 | INVITE (BASIC SDP+CLUECHANNEL) | 481 |--------------------------------->| 482 | 200 0K (BASIC SDP+CLUECHANNEL)| 483 |<---------------------------------| 484 | ACK | 485 |--------------------------------->| 486 | | 487 |<################################>| 488 | BASIC SDP MEDIA SESSION | 489 |<################################>| 490 | | 491 | CONNECT (CLUE CTRL CHANNEL) | 492 |=================================>| 493 | ... | 494 |<================================>| 495 | CLUE CTRL CHANNEL ESTABLISHED | 496 |<================================>| 497 | | 498 | ADVERTISEMENT 1 | 499 |*********************************>| 500 | ADVERTISEMENT 2 | 501 |<*********************************| 502 | | 503 | CONFIGURE 1 | 504 |<*********************************| 505 | CONFIGURE 2 | 506 |*********************************>| 507 | | 508 | REINVITE (UPDATED SDP) | 509 |--------------------------------->| 510 | 200 0K (UPDATED SDP)| 511 |<---------------------------------| 512 | ACK | 513 |--------------------------------->| 514 | | 515 |<################################>| 516 | UPDATED SDP MEDIA SESSION | 517 |<################################>| 518 | | 519 v v 521 Figure 2: Basic Information Flow 523 An initial offer/answer exchange establishes a basic media session, 524 for example audio-only, and a CLUE channel between two Endpoints. 525 With the establishment of that channel, the endpoints have 526 consented to use the CLUE protocol mechanisms and, therefore, MUST 527 adhere to the CLUE protocol suite as outlined herein. 529 Over this CLUE channel, the Provider in each Endpoint conveys its 530 characteristics and capabilities by sending an Advertisement as 531 specified herein. The Advertisement is typically not sufficient to 532 set up all media. The Consumer in the Endpoint receives the 533 information provided by the Provider, and can use it for two 534 purposes. First, it MUST construct and send a CLUE Configure 535 message to tell the Provider what the Consumer wishes to receive. 536 Second, it MAY, but is not necessarily REQUIRED to, use the 537 information provided to tailor the SDP it is going to send during 538 the following SIP offer/answer exchange, and its reaction to SDP it 539 receives in that step. It is often a sensible implementation 540 choice to do so, as the representation of the media information 541 conveyed over the CLUE channel can dramatically cut down on the 542 size of SDP messages used in the O/A exchange that follows. 543 Spatial relationships associated with the Media can be included in 544 the Advertisement, and it is often sensible for the Media Consumer 545 to take those spatial relationships into account when tailoring the 546 SDP. 548 This CLUE exchange MUST be followed by an SDP offer answer exchange 549 that not only establishes those aspects of the media that have not 550 been "negotiated" over CLUE, but has also the side effect of 551 setting up the media transmission itself, involving potentially 552 security exchanges, ICE, and whatnot. This step is plain vanilla 553 SIP, with the exception that the SDP used herein, in most (but not 554 necessarily all) cases can be considerably smaller than the SDP a 555 system would typically need to exchange if there were no pre- 556 established knowledge about the Provider and Consumer 557 characteristics. (The need for cutting down SDP size is not quite 558 obvious for a point-to-point call involving simple endpoints; 559 however, when considering a large multipoint conference involving 560 many multi-screen/multi-camera endpoints, each of which can operate 561 using multiple codecs for each camera and microphone, it becomes 562 perhaps somewhat more intuitive.) 564 During the lifetime of a call, further exchanges MAY occur over the 565 CLUE channel. In some cases, those further exchanges lead to a 566 modified system behavior of Provider or Consumer (or both) without 567 any other protocol activity such as further offer/answer exchanges. 568 For example, voice-activated screen switching, signaled over the 569 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 570 re-invites. However, in other cases, after the CLUE negotiation an 571 additional offer/answer exchange becomes necessary. For example, 572 if both sides decide to upgrade the call from a single screen to a 573 multi-screen call and more bandwidth is required for the additional 574 video channels compared to what was previously negotiated using 575 offer/answer, a new O/A exchange is REQUIRED. 577 One aspect of the protocol outlined herein and specified in more 578 detail in companion documents is that it makes available 579 information regarding the Provider's capabilities to deliver Media, 580 and attributes related to that Media such as their spatial 581 relationship, to the Consumer. The operation of the renderer 582 inside the Consumer is unspecified in that it can choose to ignore 583 some information provided by the Provider, and/or not render media 584 streams available from the Provider (although it MUST follow the 585 CLUE protocol and, therefore, MUST gracefully receive and respond 586 (through a Configure) to the Provider's information). All CLUE 587 protocol mechanisms are OPTIONAL in the Consumer in the sense that, 588 while the Consumer MUST be able to receive (and, potentially, 589 gracefully acknowledge) CLUE messages, it is free to ignore the 590 information provided therein. 592 A CLUE-implementing device interoperates with a device that does 593 not support CLUE, because the non-CLUE device does, by definition, 594 not understand the offer of a CLUE channel in the initial 595 offer/answer exchange and, therefore, will reject it. This 596 rejection MUST be used as the indication to the CLUE-implementing 597 device that the other side of the communication is not compliant 598 with CLUE, and to fall back to behavior that does not require CLUE. 600 As for the media, Provider and Consumer have an end-to-end 601 communication relationship with respect to (RTP transported) media; 602 and the mechanisms described herein and in companion documents do 603 not change the aspects of setting up those RTP flows and sessions. 604 In other words, the RTP media sessions conform to the negotiated 605 SDP whether or not CLUE is used. 607 6. Spatial Relationships 609 In order for a Consumer to perform a proper rendering, it is often 610 necessary or at least helpful for the Consumer to have received 611 spatial information about the streams it is receiving. CLUE 612 defines a coordinate system that allows Media Providers to describe 613 the spatial relationships of their Media Captures to enable proper 614 scaling and spatially sensible rendering of their streams. The 615 coordinate system is based on a few principles: 617 o Simple systems which do not have multiple Media Captures to 618 associate spatially need not use the coordinate model. 620 o Coordinates can be either in real, physical units (millimeters), 621 have an unknown scale or have no physical scale. Systems which 622 know their physical dimensions (for example professionally 623 installed Telepresence room systems) MUST always provide those 624 real-world measurements. Systems which don't know specific 625 physical dimensions but still know relative distances MUST use 626 'unknown scale'. 'No scale' is intended to be used where Media 627 Captures from different devices (with potentially different 628 scales) will be forwarded alongside one another (e.g. in the 629 case of an MCU). 631 * "Millimeters" means the scale is in millimeters. 633 * "Unknown" means the scale is not necessarily millimeters, but 634 the scale is the same for every Capture in the Capture Scene. 636 * "No Scale" means the scale could be different for each 637 capture- an MCU Provider that advertises two adjacent 638 captures and picks sources (which can change quickly) from 639 different endpoints might use this value; the scale could be 640 different and changing for each capture. But the areas of 641 capture still represent a spatial relation between captures. 643 o The coordinate system is right-handed Cartesian X, Y, Z with the 644 origin at a spatial location of the Provider's choosing. The 645 Provider MUST use the same coordinate system with the same scale 646 and origin for all coordinates within the same Capture Scene. 648 The direction of increasing coordinate values is: 649 X increases from left to right, from the point of view of an 650 observer at the front of the room looking toward the back 651 Y increases from the front of the room to the back of the room 652 Z increases from low to high (i.e. floor to ceiling) 654 Cameras in a scene typically point in the direction of increasing 655 Y, from front to back. But there could be multiple cameras 656 pointing in different directions. If the physical space does not 657 have a well-defined front and back, the provider chooses any 658 direction for X and Y consistent with right-handed coordinates. 660 7. Media Captures and Capture Scenes 662 This section describes how Providers can describe the content of 663 media to Consumers. 665 7.1. Media Captures 667 Media Captures are the fundamental representations of streams that 668 a device can transmit. What a Media Capture actually represents is 669 flexible: 671 o It can represent the immediate output of a physical source (e.g. 672 camera, microphone) or 'synthetic' source (e.g. laptop computer, 673 DVD player). 675 o It can represent the output of an audio mixer or video composer 677 o It can represent a concept such as 'the loudest speaker' 679 o It can represent a conceptual position such as 'the leftmost 680 stream' 682 To identify and distinguish between multiple Capture instances 683 Captures have a unique identity. For instance: VC1, VC2 and AC1, 684 AC2, where VC1 and VC2 refer to two different video captures and 685 AC1 and AC2 refer to two different audio captures. 687 Some key points about Media Captures: 689 . A Media Capture is of a single media type (e.g. audio or 690 video) 691 . A Media Capture is defined in a Capture Scene and is given an 692 advertisement unique identity. The identity may be referenced 693 outside the Capture Scene that defines it through a Multiple 694 Content Capture (MCC) 695 . A Media Capture may be associated with one or more Capture 696 Scene Views 697 . A Media Capture has exactly one set of spatial information 698 . A Media Capture can be the source of at most one Capture 699 Encoding 701 Each Media Capture can be associated with attributes to describe 702 what it represents. 704 7.1.1. Media Capture Attributes 706 Media Capture Attributes describe information about the Captures. 707 A Provider can use the Media Capture Attributes to describe the 708 Captures for the benefit of the Consumer of the Advertisement 709 message. Media Capture Attributes include: 711 . Spatial information, such as point of capture, point on line 712 of capture, and area of capture, all of which, in combination 713 define the capture field of, for example, a camera 714 . Other descriptive information to help the Consumer choose 715 between captures (description, presentation, view, priority, 716 language, person information and type) 717 . Control information for use inside the CLUE protocol suite 719 The sub-sections below define the Capture attributes. 721 7.1.1.1. Point of Capture 723 The Point of Capture attribute is a field with a single Cartesian 724 (X, Y, Z) point value which describes the spatial location of the 725 capturing device (such as camera). For an Audio Capture with 726 multiple microphones, the Point of Capture defines the nominal mid- 727 point of the microphones. 729 7.1.1.2. Point on Line of Capture 731 The Point on Line of Capture attribute is a field with a single 732 Cartesian (X, Y, Z) point value which describes a position in space 733 of a second point on the axis of the capturing device, toward the 734 direction it is pointing; the first point being the Point of 735 Capture (see above). 737 Together, the Point of Capture and Point on Line of Capture define 738 the direction and axis of the capturing device, for example the 739 optical axis of a camera or the axis of a microphone. The Media 740 Consumer can use this information to adjust how it renders the 741 received media if it so chooses. 743 For an Audio Capture, the Media Consumer can use this information 744 along with the Audio Capture Sensitivity Pattern to define a 3- 745 dimensional volume of capture where sounds can be expected to be 746 picked up by the microphone providing this specific audio capture. 747 If the Consumer wants to associate an Audio Capture with a Video 748 Capture, it can compare this volume with the area of capture for 749 video media to provide a check on whether the audio capture is 750 indeed spatially associated with the video capture. For example, a 751 video area of capture that fails to intersect at all with the audio 752 volume of capture, or is at such a long radial distance from the 753 microphone point of capture that the audio level would be very low, 754 would be inappropriate. 756 7.1.1.3. Area of Capture 758 The Area of Capture is a field with a set of four (X, Y, Z) points 759 as a value which describes the spatial location of what is being 760 "captured". This attribute applies only to video captures, not 761 other types of media. By comparing the Area of Capture for 762 different Video Captures within the same Capture Scene a Consumer 763 can determine the spatial relationships between them and render 764 them correctly. 766 The four points MUST be co-planar, forming a quadrilateral, which 767 defines the Plane of Interest for the particular media capture. 769 If the Area of Capture is not specified, it means the Video Capture 770 is not spatially related to any other Video Capture. 772 For a switched capture that switches between different sections 773 within a larger area, the area of capture MUST use coordinates for 774 the larger potential area. 776 7.1.1.4. Mobility of Capture 778 The Mobility of Capture attribute indicates whether or not the 779 point of capture, line on point of capture, and area of capture 780 values stay the same over time, or are expected to change 781 (potentially frequently). Possible values are static, dynamic, and 782 highly dynamic. 784 An example for "dynamic" is a camera mounted on a stand which is 785 occasionally hand-carried and placed at different positions in 786 order to provide the best angle to capture a work task. A camera 787 worn by a person who moves around the room is an example for 788 "highly dynamic". In either case, the effect is that the capture 789 point, capture axis and area of capture change with time. 791 The capture point of a static capture MUST NOT move for the life of 792 the conference. The capture point of dynamic captures is 793 categorized by a change in position followed by a reasonable period 794 of stability--in the order of magnitude of minutes. High dynamic 795 captures are categorized by a capture point that is constantly 796 moving. If the "area of capture", "capture point" and "line of 797 capture" attributes are included with dynamic or highly dynamic 798 captures they indicate spatial information at the time of the 799 Advertisement. 801 7.1.1.5. Audio Capture Sensitivity Pattern 803 The Audio Capture Sensitivity Pattern attribute applies only to 804 audio captures. This is an optional attribute. This attribute 805 gives information about the nominal sensitivity pattern of the 806 microphone which is the source of the capture. Possible values 807 include patterns such as omni, shotgun, cardioid, hyper-cardioid. 809 7.1.1.6. Description 811 The Description attribute is a human-readable description (which 812 could be in multiple languages) of the Capture. 814 7.1.1.7. Presentation 816 The Presentation attribute indicates that the capture originates 817 from a presentation device, that is one that provides supplementary 818 information to a conference through slides, video, still images, 819 data etc. Where more information is known about the capture it MAY 820 be expanded hierarchically to indicate the different types of 821 presentation media, e.g. presentation.slides, presentation.image 822 etc. 824 Note: It is expected that a number of keywords will be defined that 825 provide more detail on the type of presentation. 827 7.1.1.8. View 829 The View attribute is a field with enumerated values, indicating 830 what type of view the Capture relates to. The Consumer can use 831 this information to help choose which Media Captures it wishes to 832 receive. The value MUST be one of: 834 Room - Captures the entire scene 835 Table - Captures the conference table with seated people 837 Individual - Captures an individual person 839 Lectern - Captures the region of the lectern including the 840 presenter, for example in a classroom style conference room 842 Audience - Captures a region showing the audience in a classroom 843 style conference room 845 7.1.1.9. Language 847 The language attribute indicates one or more languages used in the 848 content of the Media Capture. Captures MAY be offered in different 849 languages in case of multilingual and/or accessible conferences. A 850 Consumer can use this attribute to differentiate between them and 851 pick the appropriate one. 853 Note that the Language attribute is defined and meaningful both for 854 audio and video captures. In case of audio captures, the meaning 855 is obvious. For a video capture, "Language" could, for example, be 856 sign interpretation or text. 858 7.1.1.10. Person Information 860 The person information attribute allows a Provider to provide 861 specific information regarding the people in a Capture (regardless 862 of whether or not the capture has a Presentation attribute). The 863 Provider may gather the information automatically or manually from 864 a variety of sources however the xCard [RFC6351] format is used to 865 convey the information. This allows various information such as 866 Identification information (section 6.2/[RFC6350]), Communication 867 Information (section 6.4/[RFC6350]) and Organizational information 868 (section 6.6/[RFC6350]) to be communicated. A Consumer may then 869 automatically (i.e. via a policy) or manually select Captures 870 based on information about who is in a Capture. It also allows a 871 Consumer to render information regarding the people participating 872 in the conference or to use it for further processing. 874 The Provider may supply a minimal set of information or a larger 875 set of information. However it MUST be compliant to [RFC6350] and 876 supply a "VERSION" and "FN" property. A Provider may supply 877 multiple xCards per Capture of any KIND (section 6.1.4/[RFC6350]). 879 In order to keep CLUE messages compact the Provider SHOULD use a 880 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 881 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 882 message. 884 7.1.1.11. Person Type 886 The person type attribute indicates the type of people contained in 887 the capture in the conference with respect to the meeting agenda 888 (regardless of whether or not the capture has a Presentation 889 attribute). As a capture may include multiple people the attribute 890 may contain multiple values. However values shall not be repeated 891 within the attribute. 893 An Advertiser associates the person type with an individual capture 894 when it knows that a particular type is in the capture. If an 895 Advertiser cannot link a particular type with some certainty to a 896 capture then it is not included. A Consumer on reception of a 897 capture with a person type attribute knows with some certainly that 898 the capture contains that person type. The capture may contain 899 other person types but the Advertiser has not been able to 900 determine that this is the case. 902 The types of Captured people include: 904 . Chairman - the person responsible for running the conference 905 according to the agenda. 906 . Vice-Chairman - the person responsible for assisting the 907 chairman in running the meeting. 908 . Minute Taker - the person responsible for recording the 909 minutes of the conference 910 . Member - the person has no particular responsibilities with 911 respect to running the meeting. 912 . Presenter - the person is scheduled on the agenda to make a 913 presentation in the meeting. Note: This is not related to any 914 "active speaker" functionality. 915 . Translator - the person is providing some form of translation 916 or commentary in the meeting. 917 . Timekeeper - the person is responsible for maintaining the 918 meeting schedule. 920 Furthermore the person type attribute may contain one or more 921 strings allowing the Provider to indicate custom meeting specific 922 roles. 924 7.1.1.12. Priority 926 The priority attribute indicates a relative priority between 927 different Media Captures. The Provider sets this priority, and the 928 Consumer MAY use the priority to help decide which captures it 929 wishes to receive. 931 The "priority" attribute is an integer which indicates a relative 932 priority between Captures. For example it is possible to assign a 933 priority between two presentation Captures that would allow a 934 remote endpoint to determine which presentation is more important. 935 Priority is assigned at the individual capture level. It represents 936 the Provider's view of the relative priority between Captures with 937 a priority. The same priority number MAY be used across multiple 938 Captures. It indicates they are equally important. If no priority 939 is assigned no assumptions regarding relative important of the 940 Capture can be assumed. 942 7.1.1.13. Embedded Text 944 The Embedded Text attribute indicates that a Capture provides 945 embedded textual information. For example the video Capture MAY 946 contain speech to text information composed with the video image. 947 This attribute is only applicable to video Captures and 948 presentation streams with visual information. 950 7.1.1.14. Related To 952 The Related To attribute indicates the Capture contains additional 953 complementary information related to another Capture. The value 954 indicates the identity of the other Capture to which this Capture 955 is providing additional information. 957 For example, a conference can utilize translators or facilitators 958 that provide an additional audio stream (i.e. a translation or 959 description or commentary of the conference). Where multiple 960 captures are available, it may be advantageous for a Consumer to 961 select a complementary Capture instead of or in addition to a 962 Capture it relates to. 964 7.2. Multiple Content Capture 966 The MCC indicates that one or more Single Media Captures are 967 contained in one Media Capture. Only one Capture type (i.e. audio, 968 video, etc.) is allowed in each MCC instance. The MCC may contain 969 a reference to the Single Media Captures (which may have their own 970 attributes) as well as attributes associated with the MCC itself. 971 A MCC may also contain other MCCs. The MCC MAY reference Captures 972 from within the Capture Scene that defines it or from other Capture 973 Scenes. No ordering is implied by the order that Captures appear 974 within a MCC. A MCC MAY contain no references to other Captures to 975 indicate that the MCC contains content from multiple sources but no 976 information regarding those sources is given. 978 One or more MCCs may also be specified in a CSV. This allows an 979 Advertiser to indicate that several MCC captures are used to 980 represent a capture scene. Table 14 provides an example of this 981 case. 983 As outlined in section 7.1. each instance of the MCC has its own 984 Capture identity i.e. MCC1. It allows all the individual captures 985 contained in the MCC to be referenced by a single MCC identity. 987 The example below shows the use of a Multiple Content Capture: 989 +-----------------------+---------------------------------+ 990 | Capture Scene #1 | | 991 +-----------------------|---------------------------------+ 992 | VC1 | {attributes} | 993 | VC2 | {attributes} | 994 | VCn | {attributes} | 995 | MCC1(VC1,VC2,...VCn) | {attributes} | 996 | CSV(MCC1) | | 997 +---------------------------------------------------------+ 999 Table 1: Multiple Content Capture concept 1001 This indicates that MCC1 is a single capture that contains the 1002 Captures VC1, VC2 and VC3 according to any MCC1 attributes. 1004 7.2.1. MCC Attributes 1006 Attributes may be associated with the MCC instance and the Single 1007 Media Captures that the MCC references. A Provider should avoid 1008 providing conflicting attribute values between the MCC and Single 1009 Media Captures. Where there is conflict the attributes of the MCC 1010 override any that may be present in the individual captures. 1012 A Provider MAY include as much or as little of the original source 1013 Capture information as it requires. 1015 There are MCC specific attributes that MUST only be used with 1016 Multiple Content Captures. These are described in the sections 1017 below. The attributes described in section 7.1.1. MAY also be used 1018 with MCCs. 1020 The spatial related attributes of an MCC indicate its area of 1021 capture and point of capture within the scene, just like any other 1022 media capture. The spatial information does not imply anything 1023 about how other captures are composed within an MCC. 1025 For example: A virtual scene could be constructed for the MCC 1026 capture with two Video Captures with a "MaxCaptures" attribute set 1027 to 2 and an "Area of Capture" attribute provided with an overall 1028 area. Each of the individual Captures could then also include an 1029 "Area of Capture" attribute with a sub-set of the overall area. 1030 The Consumer would then know how each capture is related to others 1031 within the scene, but not the relative position of the individual 1032 captures within the composed capture. 1034 +-----------------------+---------------------------------+ 1035 | Capture Scene #1 | | 1036 +-----------------------|---------------------------------+ 1037 | VC1 | AreaofCapture=(0,0,0)(9,0,0) | 1038 | | (0,0,9)(9,0,9) | 1039 | VC2 | AreaofCapture=(10,0,0)(19,0,0) | 1040 | | (10,0,9)(19,0,9) | 1041 | MCC1(VC1,VC2) | MaxCaptures=2 | 1042 | | AreaofCapture=(0,0,0)(19,0,0) | 1043 | | (0,0,9)(19,0,9) | 1044 | CSV(MCC1) | | 1045 +---------------------------------------------------------+ 1047 Table 2: Example of MCC and Single Media Capture attributes 1049 The sections below describe the MCC only attributes. 1051 7.2.1.1. Maximum Number of Captures within a MCC 1053 The Maximum Number of Captures MCC attribute indicates the maximum 1054 number of individual captures that may appear in a Capture Encoding 1055 at a time. The actual number at any given time can be less than 1056 this maximum. It may be used to derive how the Single Media 1057 Captures within the MCC are composed / switched with regards to 1058 space and time. 1060 A Provider can indicate that the number of captures in a MCC 1061 capture encoding is equal "=" to the MaxCaptures value or that 1062 there may be any number of captures up to and including "<=" the 1063 MaxCaptures value. This allows a Provider to distinguish between a 1064 MCC that purely represents a composition of sources versus a MCC 1065 that represents switched or switched and composed sources. 1067 MaxCaptures MAY be set to one so that only content related to one 1068 of the sources are shown in the MCC Capture Encoding at a time or 1069 it may be set to any value up to the total number of Source Media 1070 Captures in the MCC. 1072 The bullets below describe how the setting of MaxCapture versus the 1073 number of captures in the MCC affects how sources appear in a 1074 capture encoding: 1076 . When MaxCaptures is set to <= 1 and the number of captures in 1077 the MCC is greater than 1 (or not specified) in the MCC this 1078 is a switched case. Zero or 1 captures may be switched into 1079 the capture encoding. Note: zero is allowed because of the 1080 "<=". 1081 . When MaxCaptures is set to = 1 and the number of captures in 1082 the MCC is greater than 1 (or not specified) in the MCC this 1083 is a switched case. Only one capture source is contained in a 1084 capture encoding at a time. 1085 . When MaxCaptures is set to <= N (with N > 1) and the number of 1086 captures in the MCC is greater than N (or not specified) this 1087 is a switched and composed case. The capture encoding may 1088 contain purely switched sources (i.e. <=2 allows for 1 source 1089 on its own), or may contain composed and switched sources 1090 (i.e. a composition of 2 sources switched between the 1091 sources). 1092 . When MaxCaptures is set to = N (with N > 1) and the number of 1093 captures in the MCC is greater than N (or not specified) this 1094 is a switched and composed case. The capture encoding contains 1095 composed and switched sources (i.e. a composition of N sources 1096 switched between the sources). It is not possible to have a 1097 single source. 1098 . When MaxCaptures is set to <= to the number of captures in the 1099 MCC this is a switched and composed case. The capture encoding 1100 may contain media switched between any number (up to the 1101 MaxCaptures) of composed sources. 1102 . When MaxCaptures is set to = to the number of captures in the 1103 MCC this is a composed case. All the sources are composed into 1104 a single capture encoding. 1106 If this attribute is not set then as default it is assumed that all 1107 source content can appear concurrently in the Capture Encoding 1108 associated with the MCC. 1110 For example: The use of MaxCaptures equal to 1 on a MCC with three 1111 Video Captures VC1, VC2 and VC3 would indicate that the Advertiser 1112 in the capture encoding would switch between VC1, VC2 or VC3 as 1113 there may be only a maximum of one capture at a time. 1115 7.2.1.2. Policy 1117 The Policy MCC Attribute indicates the criteria that the Provider 1118 uses to determine when and/or where media content appears in the 1119 Capture Encoding related to the MCC. 1121 The attribute is in the form of a token that indicates the policy 1122 and index representing an instance of the policy. 1124 The tokens are: 1126 SoundLevel - This indicates that the content of the MCC is 1127 determined by a sound level detection algorithm. For example: the 1128 loudest (active) speaker is contained in the MCC. 1130 RoundRobin - This indicates that the content of the MCC is 1131 determined by a time based algorithm. For example: the Provider 1132 provides content from a particular source for a period of time and 1133 then provides content from another source and so on. 1135 An index is used to represent an instance in the policy setting. A 1136 index of 0 represents the most current instance of the policy, i.e. 1137 the active speaker, 1 represents the previous instance, i.e. the 1138 previous active speaker and so on. 1140 The following example shows a case where the Provider provides two 1141 media streams, one showing the active speaker and a second stream 1142 showing the previous speaker. 1144 +-----------------------+---------------------------------+ 1145 | Capture Scene #1 | | 1146 +-----------------------|---------------------------------+ 1147 | VC1 | | 1148 | VC2 | | 1149 | MCC1(VC1,VC2) | Policy=SoundLevel:0 | 1150 | | MaxCaptures=1 | 1151 | MCC2(VC1,VC2) | Policy=SoundLevel:1 | 1152 | | MaxCaptures=1 | 1153 | CSV(MCC1,MCC2) | | 1154 +---------------------------------------------------------+ 1156 Table 3: Example Policy MCC attribute usage 1158 7.2.1.3. Synchronisation Identity 1160 The Synchronisation Identity MCC attribute indicates how the 1161 individual captures in multiple MCC captures are synchronised. To 1162 indicate that the Capture Encodings associated with MCCs contain 1163 captures from the same source at the same time a Provider should 1164 set the same Synchronisation Identity on each of the concerned 1165 MCCs. It is the Provider that determines what the source for the 1166 Captures is, so a Provider can choose how to group together Single 1167 Media Captures into a combined "source" for the purpose of 1168 switching them together to keep them synchronized according to the 1169 SynchronisationID attribute. For example when the Provider is in 1170 an MCU it may determine that each separate CLUE Endpoint is a 1171 remote source of media. The Synchronisation Identity may be used 1172 across media types, i.e. to synchronize audio and video related 1173 MCCs. 1175 Without this attribute it is assumed that multiple MCCs may provide 1176 content from different sources at any particular point in time. 1178 For example: 1180 +=======================+=================================+ 1181 | Capture Scene #1 | | 1182 +-----------------------|---------------------------------+ 1183 | VC1 | Description=Left | 1184 | VC2 | Description=Centre | 1185 | VC3 | Description=Right | 1186 | AC1 | Description=room | 1187 | CSV(VC1,VC2,VC3) | | 1188 | CSV(AC1) | | 1189 +=======================+=================================+ 1190 | Capture Scene #2 | | 1191 +-----------------------|---------------------------------+ 1192 | VC4 | Description=Left | 1193 | VC5 | Description=Centre | 1194 | VC6 | Description=Right | 1195 | AC2 | Description=room | 1196 | CSV(VC4,VC5,VC6) | | 1197 | CSV(AC2) | | 1198 +=======================+=================================+ 1199 | Capture Scene #3 | | 1200 +-----------------------|---------------------------------+ 1201 | VC7 | | 1202 | AC3 | | 1203 +=======================+=================================+ 1204 | Capture Scene #4 | | 1205 +-----------------------|---------------------------------+ 1206 | VC8 | | 1207 | AC4 | | 1208 +=======================+=================================+ 1209 | Capture Scene #3 | | 1210 +-----------------------|---------------------------------+ 1211 | MCC1(VC1,VC4,VC7) | SynchronisationID=1 | 1212 | | MaxCaptures=1 | 1213 | MCC2(VC2,VC5,VC8) | SynchronisationID=1 | 1214 | | MaxCaptures=1 | 1215 | MCC3(VC3,VC6) | MaxCaptures=1 | 1216 | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1 | 1217 | | MaxCaptures=1 | 1218 | CSV(MCC1,MCC2,MCC3) | | 1219 | CSV(MCC4) | | 1220 +=======================+=================================+ 1222 Table 4: Example Synchronisation Identity MCC attribute usage 1224 The above Advertisement would indicate that MCC1, MCC2, MCC3 and 1225 MCC4 make up a Capture Scene. There would be four capture 1226 encodings (one for each MCC). Because MCC1 and MCC2 have the same 1227 SynchronisationID, each encoding from MCC1 and MCC2 respectively 1228 would together have content from only Capture Scene 1 or only 1229 Capture Scene 2 or the combination of VC7 and VC8 at a particular 1230 point in time. In this case the Provider has decided the sources 1231 to be synchronized are Scene #1, Scene #2, and Scene #3 and #4 1232 together. The encoding from MCC3 would not be synchronised with 1233 MCC1 or MCC2. As MCC4 also has the same Synchronisation Identity 1234 as MCC1 and MCC2 the content of the audio encoding will be 1235 synchronised with the video content. 1237 7.3. Capture Scene 1239 In order for a Provider's individual Captures to be used 1240 effectively by a Consumer, the Provider organizes the Captures into 1241 one or more Capture Scenes, with the structure and contents of 1242 these Capture Scenes being sent from the Provider to the Consumer 1243 in the Advertisement. 1245 A Capture Scene is a structure representing a spatial region 1246 containing one or more Capture Devices, each capturing media 1247 representing a portion of the region. A Capture Scene includes one 1248 or more Capture Scene Views (CSV), with each CSV including one or 1249 more Media Captures of the same media type. There can also be 1250 Media Captures that are not included in a Capture Scene View. A 1251 Capture Scene represents, for example, the video image of a group 1252 of people seated next to each other, along with the sound of their 1253 voices, which could be represented by some number of VCs and ACs in 1254 the Capture Scene Views. An MCU can also describe in Capture 1255 Scenes what it constructs from media Streams it receives. 1257 A Provider MAY advertise one or more Capture Scenes. What 1258 constitutes an entire Capture Scene is up to the Provider. A 1259 simple Provider might typically use one Capture Scene for 1260 participant media (live video from the room cameras) and another 1261 Capture Scene for a computer generated presentation. In more 1262 complex systems, the use of additional Capture Scenes is also 1263 sensible. For example, a classroom may advertise two Capture 1264 Scenes involving live video, one including only the camera 1265 capturing the instructor (and associated audio), the other 1266 including camera(s) capturing students (and associated audio). 1268 A Capture Scene MAY (and typically will) include more than one type 1269 of media. For example, a Capture Scene can include several Capture 1270 Scene Views for Video Captures, and several Capture Scene Views for 1271 Audio Captures. A particular Capture MAY be included in more than 1272 one Capture Scene View. 1274 A Provider MAY express spatial relationships between Captures that 1275 are included in the same Capture Scene. However, there is no 1276 spatial relationship between Media Captures from different Capture 1277 Scenes. In other words, Capture Scenes each use their own spatial 1278 measurement system as outlined above in section 6. 1280 A Provider arranges Captures in a Capture Scene to help the 1281 Consumer choose which captures it wants to render. The Capture 1282 Scene Views in a Capture Scene are different alternatives the 1283 Provider is suggesting for representing the Capture Scene. Each 1284 Capture Scene View is given an advertisement unique identity. The 1285 order of Capture Scene Views within a Capture Scene has no 1286 significance. The Media Consumer can choose to receive all Media 1287 Captures from one Capture Scene View for each media type (e.g. 1288 audio and video), or it can pick and choose Media Captures 1289 regardless of how the Provider arranges them in Capture Scene 1290 Views. Different Capture Scene Views of the same media type are 1291 not necessarily mutually exclusive alternatives. Also note that 1292 the presence of multiple Capture Scene Views (with potentially 1293 multiple encoding options in each view) in a given Capture Scene 1294 does not necessarily imply that a Provider is able to serve all the 1295 associated media simultaneously (although the construction of such 1296 an over-rich Capture Scene is probably not sensible in many cases). 1297 What a Provider can send simultaneously is determined through the 1298 Simultaneous Transmission Set mechanism, described in section 8. 1300 Captures within the same Capture Scene View MUST be of the same 1301 media type - it is not possible to mix audio and video captures in 1302 the same Capture Scene View, for instance. The Provider MUST be 1303 capable of encoding and sending all Captures (that have an encoding 1304 group) in a single Capture Scene View simultaneously. The order of 1305 Captures within a Capture Scene View has no significance. A 1306 Consumer can decide to receive all the Captures in a single Capture 1307 Scene View, but a Consumer could also decide to receive just a 1308 subset of those captures. A Consumer can also decide to receive 1309 Captures from different Capture Scene Views, all subject to the 1310 constraints set by Simultaneous Transmission Sets, as discussed in 1311 section 8. 1313 When a Provider advertises a Capture Scene with multiple CSVs, it 1314 is essentially signaling that there are multiple representations of 1315 the same Capture Scene available. In some cases, these multiple 1316 views would typically be used simultaneously (for instance a "video 1317 view" and an "audio view"). In some cases the views would 1318 conceptually be alternatives (for instance a view consisting of 1319 three Video Captures covering the whole room versus a view 1320 consisting of just a single Video Capture covering only the center 1321 of a room). In this latter example, one sensible choice for a 1322 Consumer would be to indicate (through its Configure and possibly 1323 through an additional offer/answer exchange) the Captures of that 1324 Capture Scene View that most closely matched the Consumer's number 1325 of display devices or screen layout. 1327 The following is an example of 4 potential Capture Scene Views for 1328 an endpoint-style Provider: 1330 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 1331 2. (VC3) - Video Capture associated with loudest room segment 1333 3. (VC4) - Video Capture zoomed out view of all people in the room 1335 4. (AC0) - main audio 1337 The first view in this Capture Scene example is a list of Video 1338 Captures which have a spatial relationship to each other. 1339 Determination of the order of these captures (VC0, VC1 and VC2) for 1340 rendering purposes is accomplished through use of their Area of 1341 Capture attributes. The second view (VC3) and the third view (VC4) 1342 are alternative representations of the same room's video, which 1343 might be better suited to some Consumers' rendering capabilities. 1344 The inclusion of the Audio Capture in the same Capture Scene 1345 indicates that AC0 is associated with all of those Video Captures, 1346 meaning it comes from the same spatial region. Therefore, if audio 1347 were to be rendered at all, this audio would be the correct choice 1348 irrespective of which Video Captures were chosen. 1350 7.3.1. Capture Scene attributes 1352 Capture Scene Attributes can be applied to Capture Scenes as well 1353 as to individual media captures. Attributes specified at this 1354 level apply to all constituent Captures. Capture Scene attributes 1355 include 1357 . Human-readable description of the Capture Scene, which could 1358 be in multiple languages; 1359 . xCard scene information 1360 . Scale information (millimeters, unknown, no scale), as 1361 described in Section 6. 1363 7.3.1.1. Scene Information 1365 The Scene information attribute provides information regarding the 1366 Capture Scene rather than individual participants. The Provider 1367 may gather the information automatically or manually from a 1368 variety of sources. The scene information attribute allows a 1369 Provider to indicate information such as: organizational or 1370 geographic information allowing a Consumer to determine which 1371 Capture Scenes are of interest in order to then perform Capture 1372 selection. It also allows a Consumer to render information 1373 regarding the Scene or to use it for further processing. 1375 As per 7.1.1.10. the xCard format is used to convey this 1376 information and the Provider may supply a minimal set of 1377 information or a larger set of information. 1379 In order to keep CLUE messages compact the Provider SHOULD use a 1380 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 1381 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 1382 message. 1384 7.3.2. Capture Scene View attributes 1386 A Capture Scene can include one or more Capture Scene Views in 1387 addition to the Capture Scene wide attributes described above. 1388 Capture Scene View attributes apply to the Capture Scene View as a 1389 whole, i.e. to all Captures that are part of the Capture Scene 1390 View. 1392 Capture Scene View attributes include: 1394 . Human-readable description (which could be in multiple 1395 languages) of the Capture Scene View 1397 7.3.3. Global View List 1399 An Advertisement can include an optional Global View list. Each 1400 item in this list is a Global View. A Global View is a set of 1401 references to one or more Capture Scene Views of the same media 1402 type that are defined within scenes of the same advertisement. 1403 Each Global View in the list is a suggestion from the Provider to 1404 the Consumer for which CSVs provide a complete representation of 1405 the simultaneous captures provided by the Provider, across 1406 multiple scenes. The Provider can include multiple Global Views, 1407 to allow a Consumer to choose sets of captures appropriate to its 1408 capabilities or application. The choice of how to make these 1409 suggestions in the Global View list for what represents all the 1410 scenes for which the Provider can send media is up to the 1411 Provider. This is very similar to how each CSV represents a 1412 particular scene. 1414 As an example, suppose an advertisement has three scenes, and each 1415 scene has three CSVs, ranging from one to three video captures in 1416 each CSV. The Provider is advertising a total of nine video 1417 Captures across three scenes. The Provider can use the Global 1418 View list to suggest alternatives for Consumers that can't receive 1419 all nine video Captures as separate media streams. For 1420 accommodating a Consumer that wants to receive three video 1421 Captures, a Provider might suggest a Global View containing just a 1422 single CSV with three Captures and nothing from the other two 1423 scenes. Or a Provider might suggest a Global View containing 1424 three different CSVs, one from each scene, with a single video 1425 Capture in each. 1427 Some additional rules: 1429 . The ordering of Global Views in the Global View list is not 1430 important. 1431 . The ordering of CSVs within each Global View is not 1432 important. 1433 . A particular CSV may be used in multiple Global Views. 1434 . The Provider must be capable of encoding and sending all 1435 Captures within the CSVs of a given Global View 1436 simultaneously. 1438 The following figure shows an example of the structure of Global 1439 Views in a Global View List. 1441 ........................................................ 1442 . Advertisement . 1443 . . 1444 . +--------------+ +-------------------------+ . 1445 . |Scene 1 | |Global View List | . 1446 . | | | | . 1447 . | CSV1 (v)<----------------- Global View (csv 1) | . 1448 . | <-------. | | . 1449 . | | *--------- Global View (csv 1,5) | . 1450 . | CSV2 (v) | | | | . 1451 . | | | | | . 1452 . | CSV3 (v)<---------*------- Global View (csv 3,5) | . 1453 . | | | | | | . 1454 . | CSV4 (a)<----------------- Global View (csv 4) | . 1455 . | <-----------. | | . 1456 . +--------------+ | | *----- Global View (csv 4,6) | . 1457 . | | | | | . 1458 . +--------------+ | | | +-------------------------+ . 1459 . |Scene 2 | | | | . 1460 . | | | | | . 1461 . | CSV5 (v)<-------' | | . 1462 . | <---------' | . 1463 . | | | (v) = video . 1464 . | CSV6 (a)<-----------' (a) = audio . 1465 . | | . 1466 . +--------------+ . 1467 `......................................................' 1469 Figure 3: Global View List Structure 1471 8. Simultaneous Transmission Set Constraints 1473 In many practical cases, a Provider has constraints or limitations 1474 on its ability to send Captures simultaneously. One type of 1475 limitation is caused by the physical limitations of capture 1476 mechanisms; these constraints are represented by a simultaneous 1477 transmission set. The second type of limitation reflects the 1478 encoding resources available, such as bandwidth or video encoding 1479 throughput (macroblocks/second). This type of constraint is 1480 captured by encoding groups, discussed below. 1482 Some Endpoints or MCUs can send multiple Captures simultaneously; 1483 however sometimes there are constraints that limit which Captures 1484 can be sent simultaneously with other Captures. A device may not 1485 be able to be used in different ways at the same time. Provider 1486 Advertisements are made so that the Consumer can choose one of 1487 several possible mutually exclusive usages of the device. This 1488 type of constraint is expressed in a Simultaneous Transmission Set, 1489 which lists all the Captures of a particular media type (e.g. 1490 audio, video, text) that can be sent at the same time. There are 1491 different Simultaneous Transmission Sets for each media type in the 1492 Advertisement. This is easier to show in an example. 1494 Consider the example of a room system where there are three cameras 1495 each of which can send a separate capture covering two persons 1496 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1497 optical zoom lens) and show all six persons, VC3. But the middle 1498 camera cannot be used in both modes at the same time - it has to 1499 either show the space where two participants sit or the whole six 1500 seats, but not both at the same time. As a result, VC1 and VC3 1501 cannot be sent simultaneously. 1503 Simultaneous Transmission Sets are expressed as sets of the Media 1504 Captures that the Provider could transmit at the same time (though, 1505 in some cases, it is not intuitive to do so). If a Multiple 1506 Content Capture is included in a Simultaneous Transmission Set it 1507 indicates that the Capture Encoding associated with it could be 1508 transmitted as the same time as the other Captures within the 1509 Simultaneous Transmission Set. It does not imply that the Single 1510 Media Captures contained in the Multiple Content Capture could all 1511 be transmitted at the same time. 1513 In this example the two simultaneous sets are shown in Table 5. If 1514 a Provider advertises one or more mutually exclusive Simultaneous 1515 Transmission Sets, then for each media type the Consumer MUST 1516 ensure that it chooses Media Captures that lie wholly within one of 1517 those Simultaneous Transmission Sets. 1519 +-------------------+ 1520 | Simultaneous Sets | 1521 +-------------------+ 1522 | {VC0, VC1, VC2} | 1523 | {VC0, VC3, VC2} | 1524 +-------------------+ 1526 Table 5: Two Simultaneous Transmission Sets 1528 A Provider OPTIONALLY can include the simultaneous sets in its 1529 Advertisement. These simultaneous set constraints apply across all 1530 the Capture Scenes in the Advertisement. It is a syntax 1531 conformance requirement that the simultaneous transmission sets 1532 MUST allow all the media captures in any particular Capture Scene 1533 View to be used simultaneously. Similarly, the simultaneous 1534 transmission sets MUST reflect the simultaneity expressed by any 1535 Global View. 1537 For shorthand convenience, a Provider MAY describe a Simultaneous 1538 Transmission Set in terms of Capture Scene Views and Capture 1539 Scenes. If a Capture Scene View is included in a Simultaneous 1540 Transmission Set, then all Media Captures in the Capture Scene View 1541 are included in the Simultaneous Transmission Set. If a Capture 1542 Scene is included in a Simultaneous Transmission Set, then all its 1543 Capture Scene Views (of the corresponding media type) are included 1544 in the Simultaneous Transmission Set. The end result reduces to a 1545 set of Media Captures, of a particular media type, in either case. 1547 If an Advertisement does not include Simultaneous Transmission 1548 Sets, then the Provider MUST be able to simultaneously provide all 1549 the captures from any one CSV of each media type from each capture 1550 scene. Likewise, if there are no Simultaneous Transmission Sets 1551 and there is a Global View list, then the Provider MUST be able to 1552 simultaneously provide all the captures from any particular Global 1553 View (of each media type) from the Global View list. 1555 If an Advertisement includes multiple Capture Scene Views in a 1556 Capture Scene then the Consumer MAY choose one Capture Scene View 1557 for each media type, or MAY choose individual Captures based on the 1558 Simultaneous Transmission Sets. 1560 9. Encodings 1562 Individual encodings and encoding groups are CLUE's mechanisms 1563 allowing a Provider to signal its limitations for sending Captures, 1564 or combinations of Captures, to a Consumer. Consumers can map the 1565 Captures they want to receive onto the Encodings, with encoding 1566 parameters they want. As for the relationship between the CLUE- 1567 specified mechanisms based on Encodings and the SIP Offer-Answer 1568 exchange, please refer to section 5. 1570 9.1. Individual Encodings 1572 An Individual Encoding represents a way to encode a Media Capture 1573 to become a Capture Encoding, to be sent as an encoded media stream 1574 from the Provider to the Consumer. An Individual Encoding has a 1575 set of parameters characterizing how the media is encoded. 1577 Different media types have different parameters, and different 1578 encoding algorithms may have different parameters. An Individual 1579 Encoding can be assigned to at most one Capture Encoding at any 1580 given time. 1582 Individual Encoding parameters are represented in SDP [RFC4566], 1583 not in CLUE messages. For example, for a video encoding using 1584 H.26x compression technologies, this can include parameters such 1585 as: 1587 . Maximum bandwidth; 1588 . Maximum picture size in pixels; 1589 . Maxmimum number of pixels to be processed per second; 1591 The bandwidth parameter is the only one that specifically relates 1592 to a CLUE Advertisement, as it can be further constrained by the 1593 maximum group bandwidth in an Encoding Group. 1595 9.2. Encoding Group 1597 An Encoding Group includes a set of one or more Individual 1598 Encodings, and parameters that apply to the group as a whole. By 1599 grouping multiple individual Encodings together, an Encoding Group 1600 describes additional constraints on bandwidth for the group. A 1601 single Encoding Group MAY refer to encodings for different media 1602 types. 1604 The Encoding Group data structure contains: 1606 . Maximum bitrate for all encodings in the group combined; 1607 . A list of identifiers for the Individual Encodings belonging 1608 to the group. 1610 When the Individual Encodings in a group are instantiated into 1611 Capture Encodings, each Capture Encoding has a bitrate that MUST be 1612 less than or equal to the max bitrate for the particular individual 1613 encoding. The "maximum bitrate for all encodings in the group" 1614 parameter gives the additional restriction that the sum of all the 1615 individual capture encoding bitrates MUST be less than or equal to 1616 this group value. 1618 The following diagram illustrates one example of the structure of a 1619 media Provider's Encoding Groups and their contents. 1621 ,-------------------------------------------------. 1622 | Media Provider | 1623 | | 1624 | ,--------------------------------------. | 1625 | | ,--------------------------------------. | 1626 | | | ,--------------------------------------. | 1627 | | | | Encoding Group | | 1628 | | | | ,-----------. | | 1629 | | | | | | ,---------. | | 1630 | | | | | | | | ,---------.| | 1631 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1632 | `.| | | | | | `---------'| | 1633 | `.| `-----------' `---------' | | 1634 | `--------------------------------------' | 1635 `-------------------------------------------------' 1637 Figure 4: Encoding Group Structure 1639 A Provider advertises one or more Encoding Groups. Each Encoding 1640 Group includes one or more Individual Encodings. Each Individual 1641 Encoding can represent a different way of encoding media. For 1642 example one Individual Encoding may be 1080p60 video, another could 1643 be 720p30, with a third being CIF, all in, for example, H.264 1644 format. 1645 While a typical three codec/display system might have one Encoding 1646 Group per "codec box" (physical codec, connected to one camera and 1647 one screen), there are many possibilities for the number of 1648 Encoding Groups a Provider may be able to offer and for the 1649 encoding values in each Encoding Group. 1651 There is no requirement for all Encodings within an Encoding Group 1652 to be instantiated at the same time. 1654 9.3. Associating Captures with Encoding Groups 1656 Each Media Capture, including MCCs, MAY be associated with one or 1657 more Encoding Groups. To be eligible for configuration, a Media 1658 Capture MUST be associated with at least one Encoding Group, which 1659 is used to instantiate that Capture into a Capture Encoding. When 1660 an MCC is configured all the Media Captures referenced by the MCC 1661 will appear in the Capture Encoding according to the attributes of 1662 the chosen encoding of the MCC. This allows an Advertiser to 1663 specify encoding attributes associated with the Media Captures 1664 without the need to provide an individual Capture Encoding for each 1665 of the inputs. 1667 If an Encoding Group is assigned to a Media Capture referenced by 1668 the MCC it indicates that this Capture may also have an individual 1669 Capture Encoding. 1671 For example: 1673 +--------------------+------------------------------------+ 1674 | Capture Scene #1 | | 1675 +--------------------+------------------------------------+ 1676 | VC1 | EncodeGroupID=1 | 1677 | VC2 | | 1678 | MCC1(VC1,VC2) | EncodeGroupID=2 | 1679 | CSV(VC1) | | 1680 | CSV(MCC1) | | 1681 +--------------------+------------------------------------+ 1683 Table 6: Example usage of Encoding with MCC and source Captures 1685 This would indicate that VC1 may be sent as its own Capture 1686 Encoding from EncodeGroupID=1 or that it may be sent as part of a 1687 Capture Encoding from EncodeGroupID=2 along with VC2. 1689 More than one Capture MAY use the same Encoding Group. 1691 The maximum number of Capture Encodings that can result from a 1692 particular Encoding Group constraint is equal to the number of 1693 individual Encodings in the group. The actual number of Capture 1694 Encodings used at any time MAY be less than this maximum. Any of 1695 the Captures that use a particular Encoding Group can be encoded 1696 according to any of the Individual Encodings in the group. 1698 It is a protocol conformance requirement that the Encoding Groups 1699 MUST allow all the Captures in a particular Capture Scene View to 1700 be used simultaneously. 1702 10. Consumer's Choice of Streams to Receive from the Provider 1704 After receiving the Provider's Advertisement message (that includes 1705 media captures and associated constraints), the Consumer composes 1706 its reply to the Provider in the form of a Configure message. The 1707 Consumer is free to use the information in the Advertisement as it 1708 chooses, but there are a few obviously sensible design choices, 1709 which are outlined below. 1711 If multiple Providers connect to the same Consumer (i.e. in a n 1712 MCU-less multiparty call), it is the responsibility of the Consumer 1713 to compose Configures for each Provider that both fulfill each 1714 Provider's constraints as expressed in the Advertisement, as well 1715 as its own capabilities. 1717 In an MCU-based multiparty call, the MCU can logically terminate 1718 the Advertisement/Configure negotiation in that it can hide the 1719 characteristics of the receiving endpoint and rely on its own 1720 capabilities (transcoding/transrating/...) to create Media Streams 1721 that can be decoded at the Endpoint Consumers. The timing of an 1722 MCU's sending of Advertisements (for its outgoing ports) and 1723 Configures (for its incoming ports, in response to Advertisements 1724 received there) is up to the MCU and implementation dependent. 1726 As a general outline, a Consumer can choose, based on the 1727 Advertisement it has received, which Captures it wishes to receive, 1728 and which Individual Encodings it wants the Provider to use to 1729 encode the Captures. 1731 On receipt of an Advertisement with an MCC the Consumer treats the 1732 MCC as per other non-MCC Captures with the following differences: 1734 - The Consumer would understand that the MCC is a Capture that 1735 includes the referenced individual Captures and that these 1736 individual Captures are delivered as part of the MCC's Capture 1737 Encoding. 1739 - The Consumer may utilise any of the attributes associated with 1740 the referenced individual Captures and any Capture Scene attributes 1741 from where the individual Captures were defined to choose Captures 1742 and for rendering decisions. 1744 - The Consumer may or may not choose to receive all the indicated 1745 captures. Therefore it can choose to receive a sub-set ofCaptures 1746 indicated by the MCC. 1748 For example if the Consumer receives: 1750 MCC1(VC1,VC2,VC3){attributes} 1752 A Consumer could choose all the Captures within a MCCs however if 1753 the Consumer determines that it doesn't want VC3 it can return 1754 MCC1(VC1,VC2). If it wants all the individual Captures then it 1755 returns only the MCC identity (i.e. MCC1). If the MCC in the 1756 advertisement does not reference any individual captures, then the 1757 Consumer cannot choose what is included in the MCC, it is up to the 1758 Provider to decide. 1760 A Configure Message includes a list of Capture Encodings. These 1761 are the Capture Encodings the Consumer wishes to receive from the 1762 Provider. Each Capture Encoding refers to one Media Capture and 1763 one Individual Encoding. A Configure Message does not include 1764 references to Capture Scenes or Capture Scene Views. 1766 For each Capture the Consumer wants to receive, it configures one 1767 of the Encodings in that Capture's Encoding Group. The Consumer 1768 does this by telling the Provider, in its Configure Message, which 1769 Encoding to use for each chosen Capture. Upon receipt of this 1770 Configure from the Consumer, common knowledge is established 1771 between Provider and Consumer regarding sensible choices for the 1772 media streams. The setup of the actual media channels, at least in 1773 the simplest case, is left to a following offer-answer exchange. 1774 Optimized implementations MAY speed up the reaction to the offer- 1775 answer exchange by reserving the resources at the time of 1776 finalization of the CLUE handshake. 1778 CLUE advertisements and configure messages don't necessarily 1779 require a new SDP offer-answer for every CLUE message 1780 exchange. But the resulting encodings sent via RTP must conform to 1781 the most recent SDP offer-answer result. 1783 In order to meaningfully create and send an initial Configure, the 1784 Consumer needs to have received at least one Advertisement, and an 1785 SDP offer defining the Individual Encodings, from the Provider. 1787 In addition, the Consumer can send a Configure at any time during 1788 the call. The Configure MUST be valid according to the most 1789 recently received Advertisement. The Consumer can send a Configure 1790 either in response to a new Advertisement from the Provider or on 1791 its own, for example because of a local change in conditions 1792 (people leaving the room, connectivity changes, multipoint related 1793 considerations). 1795 When choosing which Media Streams to receive from the Provider, and 1796 the encoding characteristics of those Media Streams, the Consumer 1797 advantageously takes several things into account: its local 1798 preference, simultaneity restrictions, and encoding limits. 1800 10.1. Local preference 1802 A variety of local factors influence the Consumer's choice of 1803 Media Streams to be received from the Provider: 1805 o if the Consumer is an Endpoint, it is likely that it would 1806 choose, where possible, to receive video and audio Captures that 1807 match the number of display devices and audio system it has 1809 o if the Consumer is an MCU, it MAY choose to receive loudest 1810 speaker streams (in order to perform its own media composition) 1811 and avoid pre-composed video Captures 1813 o user choice (for instance, selection of a new layout) MAY result 1814 in a different set of Captures, or different encoding 1815 characteristics, being required by the Consumer 1817 10.2. Physical simultaneity restrictions 1819 Often there are physical simultaneity constraints of the Provider 1820 that affect the Provider's ability to simultaneously send all of 1821 the captures the Consumer would wish to receive. For instance, an 1822 MCU, when connected to a multi-camera room system, might prefer to 1823 receive both individual video streams of the people present in the 1824 room and an overall view of the room from a single camera. Some 1825 Endpoint systems might be able to provide both of these sets of 1826 streams simultaneously, whereas others might not (if the overall 1827 room view were produced by changing the optical zoom level on the 1828 center camera, for instance). 1830 10.3. Encoding and encoding group limits 1832 Each of the Provider's encoding groups has limits on bandwidth, 1833 and the constituent potential encodings have limits on the 1834 bandwidth, computational complexity, video frame rate, and 1835 resolution that can be provided. When choosing the Captures to be 1836 received from a Provider, a Consumer device MUST ensure that the 1837 encoding characteristics requested for each individual Capture 1838 fits within the capability of the encoding it is being configured 1839 to use, as well as ensuring that the combined encoding 1840 characteristics for Captures fit within the capabilities of their 1841 associated encoding groups. In some cases, this could cause an 1842 otherwise "preferred" choice of capture encodings to be passed 1843 over in favor of different Capture Encodings--for instance, if a 1844 set of three Captures could only be provided at a low resolution 1845 then a three screen device could switch to favoring a single, 1846 higher quality, Capture Encoding. 1848 11. Extensibility 1850 One important characteristics of the Framework is its 1851 extensibility. The standard for interoperability and handling 1852 multiple streams must be future-proof. The framework itself is 1853 inherently extensible through expanding the data model types. For 1854 example: 1856 o Adding more types of media, such as telemetry, can done by 1857 defining additional types of Captures in addition to audio and 1858 video. 1860 o Adding new functionalities, such as 3-D, say, may require 1861 additional attributes describing the Captures. 1863 The infrastructure is designed to be extended rather than 1864 requiring new infrastructure elements. Extension comes through 1865 adding to defined types. 1867 12. Examples - Using the Framework (Informative) 1869 This section gives some examples, first from the point of view of 1870 the Provider, then the Consumer, then some multipoint scenarios 1872 12.1. Provider Behavior 1874 This section shows some examples in more detail of how a Provider 1875 can use the framework to represent a typical case for telepresence 1876 rooms. First an endpoint is illustrated, then an MCU case is 1877 shown. 1879 12.1.1. Three screen Endpoint Provider 1881 Consider an Endpoint with the following description: 1883 3 cameras, 3 displays, a 6 person table 1885 o Each camera can provide one Capture for each 1/3 section of the 1886 table 1888 o A single Capture representing the active speaker can be provided 1889 (voice activity based camera selection to a given encoder input 1890 port implemented locally in the Endpoint) 1892 o A single Capture representing the active speaker with the other 1893 2 Captures shown picture in picture within the stream can be 1894 provided (again, implemented inside the endpoint) 1896 o A Capture showing a zoomed out view of all 6 seats in the room 1897 can be provided 1899 The audio and video Captures for this Endpoint can be described as 1900 follows. 1902 Video Captures: 1904 o VC0- (the left camera stream), encoding group=EG0, view=table 1906 o VC1- (the center camera stream), encoding group=EG1, view=table 1908 o VC2- (the right camera stream), encoding group=EG2, view=table 1910 o MCC3- (the loudest panel stream), encoding group=EG1, 1911 view=table, MaxCaptures=1 1913 o MCC4- (the loudest panel stream with PiPs), encoding group=EG1, 1914 view=room, MaxCaptures=3 1916 o VC5- (the zoomed out view of all people in the room), encoding 1917 group=EG1, view=room 1919 o VC6- (presentation stream), encoding group=EG1, presentation 1921 The following diagram is a top view of the room with 3 cameras, 3 1922 displays, and 6 seats. Each camera is capturing 2 people. The 1923 six seats are not all in a straight line. 1925 ,-. d 1926 ( )`--.__ +---+ 1927 `-' / `--.__ | | 1928 ,-. | `-.._ |_-+Camera 2 (VC2) 1929 ( ).' <--(AC1)-+-''`+-+ 1930 `-' |_...---'' | | 1931 ,-.c+-..__ +---+ 1932 ( )| ``--..__ | | 1933 `-' | ``+-..|_-+Camera 1 (VC1) 1934 ,-. | <--(AC2)..--'|+-+ ^ 1935 ( )| __..--' | | | 1936 `-'b|..--' +---+ |X 1937 ,-. |``---..___ | | | 1938 ( )\ ```--..._|_-+Camera 0 (VC0) | 1939 `-' \ <--(AC0) ..-''`-+ | 1940 ,-. \ __.--'' | | <----------+ 1941 ( ) |..-'' +---+ Y 1942 `-' a (0,0,0) origin is under Camera 1 1944 Figure 5: Room Layout Top View 1946 The two points labeled b and c are intended to be at the midpoint 1947 between the seating positions, and where the fields of view of the 1948 cameras intersect. 1950 The plane of interest for VC0 is a vertical plane that intersects 1951 points 'a' and 'b'. 1953 The plane of interest for VC1 intersects points 'b' and 'c'. The 1954 plane of interest for VC2 intersects points 'c' and 'd'. 1956 This example uses an area scale of millimeters. 1958 Areas of capture: 1960 bottom left bottom right top left top right 1961 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1962 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1963 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1964 MCC3(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1965 MCC4(-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1966 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1967 VC6 none 1969 Points of capture: 1970 VC0 (-1678,0,800) 1971 VC1 (0,0,800) 1972 VC2 (1678,0,800) 1973 MCC3 none 1974 MCC4 none 1975 VC5 (0,0,800) 1976 VC6 none 1978 In this example, the right edge of the VC0 area lines up with the 1979 left edge of the VC1 area. It doesn't have to be this way. There 1980 could be a gap or an overlap. One additional thing to note for 1981 this example is the distance from a to b is equal to the distance 1982 from b to c and the distance from c to d. All these distances are 1983 1346 mm. This is the planar width of each area of capture for VC0, 1984 VC1, and VC2. 1986 Note the text in parentheses (e.g. "the left camera stream") is 1987 not explicitly part of the model, it is just explanatory text for 1988 this example, and is not included in the model with the media 1989 captures and attributes. Also, MCC4 doesn't say anything about 1990 how a capture is composed, so the media consumer can't tell based 1991 on this capture that MCC4 is composed of a "loudest panel with 1992 PiPs". 1994 Audio Captures: 1996 Three ceiling microphones are located between the cameras and the 1997 table, at the same height as the cameras. The microphones point 1998 down at an angle toward the seating positions. 2000 o AC0 (left), encoding group=EG3 2002 o AC1 (right), encoding group=EG3 2004 o AC2 (center) encoding group=EG3 2006 o AC3 being a simple pre-mixed audio stream from the room (mono), 2007 encoding group=EG3 2009 o AC4 audio stream associated with the presentation video (mono) 2010 encoding group=EG3, presentation 2012 Point of capture: Point on Line of Capture: 2014 AC0 (-1342,2000,800) (-1342,2925,379) 2015 AC1 ( 1342,2000,800) ( 1342,2925,379) 2016 AC2 ( 0,2000,800) ( 0,3000,379) 2017 AC3 ( 0,2000,800) ( 0,3000,379) 2018 AC4 none 2020 The physical simultaneity information is: 2022 Simultaneous transmission set #1 {VC0, VC1, VC2, MCC3, MCC4, 2023 VC6} 2025 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 2027 This constraint indicates it is not possible to use all the VCs at 2028 the same time. VC5 cannot be used at the same time as VC1 or MCC3 2029 or MCC4. Also, using every member in the set simultaneously may 2030 not make sense - for example MCC3(loudest) and MCC4 (loudest with 2031 PIP). (In addition, there are encoding constraints that make 2032 choosing all of the VCs in a set impossible. VC1, MCC3, MCC4, 2033 VC5, VC6 all use EG1 and EG1 has only 3 ENCs. This constraint 2034 shows up in the encoding groups, not in the simultaneous 2035 transmission sets.) 2037 In this example there are no restrictions on which audio captures 2038 can be sent simultaneously. 2040 Encoding Groups: 2042 This example has three encoding groups associated with the video 2043 captures. Each group can have 3 encodings, but with each 2044 potential encoding having a progressively lower specification. In 2045 this example, 1080p60 transmission is possible (as ENC0 has a 2046 maxPps value compatible with that). Significantly, as up to 3 2047 encodings are available per group, it is possible to transmit some 2048 video captures simultaneously that are not in the same view in the 2049 capture scene. For example VC1 and MCC3 at the same time. 2051 encodeGroupID=EG0, maxGroupBandwidth=6000000 2052 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2053 maxPps=124416000, maxBandwidth=4000000 2054 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2055 maxPps=27648000, maxBandwidth=4000000 2056 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 2057 maxPps=15552000, maxBandwidth=4000000 2058 encodeGroupID=EG1 maxGroupBandwidth=6000000 2059 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2060 maxPps=124416000, maxBandwidth=4000000 2061 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2062 maxPps=27648000, maxBandwidth=4000000 2063 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 2064 maxPps=15552000, maxBandwidth=4000000 2065 encodeGroupID=EG2 maxGroupBandwidth=6000000 2066 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 2067 maxPps=124416000, maxBandwidth=4000000 2068 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 2069 maxPps=27648000, maxBandwidth=4000000 2070 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 2071 maxPps=15552000, maxBandwidth=4000000 2073 Figure 6: Example Encoding Groups for Video 2075 For audio, there are five potential encodings available, so all 2076 five audio captures can be encoded at the same time. 2078 encodeGroupID=EG3, maxGroupBandwidth=320000 2079 encodeID=ENC9, maxBandwidth=64000 2080 encodeID=ENC10, maxBandwidth=64000 2081 encodeID=ENC11, maxBandwidth=64000 2082 encodeID=ENC12, maxBandwidth=64000 2083 encodeID=ENC13, maxBandwidth=64000 2085 Figure 7: Example Encoding Group for Audio 2087 Capture Scenes: 2089 The following table represents the capture scenes for this 2090 provider. Recall that a capture scene is composed of alternative 2091 capture scene views covering the same spatial region. Capture 2092 Scene #1 is for the main people captures, and Capture Scene #2 is 2093 for presentation. 2095 Each row in the table is a separate Capture Scene View 2097 +------------------+ 2098 | Capture Scene #1 | 2099 +------------------+ 2100 | VC0, VC1, VC2 | 2101 | MCC3 | 2102 | MCC4 | 2103 | VC5 | 2104 | AC0, AC1, AC2 | 2105 | AC3 | 2106 +------------------+ 2108 +------------------+ 2109 | Capture Scene #2 | 2110 +------------------+ 2111 | VC6 | 2112 | AC4 | 2113 +------------------+ 2115 Table 7: Example Capture Scene Views 2117 Different capture scenes are unique to each other, non- 2118 overlapping. A consumer can choose a view from each capture scene. 2119 In this case the three captures VC0, VC1, and VC2 are one way of 2120 representing the video from the endpoint. These three captures 2121 should appear adjacent next to each other. Alternatively, another 2122 way of representing the Capture Scene is with the capture MCC3, 2123 which automatically shows the person who is talking. Similarly 2124 for the MCC4 and VC5 alternatives. 2126 As in the video case, the different views of audio in Capture 2127 Scene #1 represent the "same thing", in that one way to receive 2128 the audio is with the 3 audio captures (AC0, AC1, AC2), and 2129 another way is with the mixed AC3. The Media Consumer can choose 2130 an audio CSV it is capable of receiving. 2132 The spatial ordering is understood by the media capture attributes 2133 Area of Capture and Point of Capture and Point on Line of Capture. 2135 A Media Consumer would likely want to choose a capture scene view 2136 to receive based in part on how many streams it can simultaneously 2137 receive. A consumer that can receive three people streams would 2138 probably prefer to receive the first view of Capture Scene #1 2139 (VC0, VC1, VC2) and not receive the other views. A consumer that 2140 can receive only one people stream would probably choose one of 2141 the other views. 2143 If the consumer can receive a presentation stream too, it would 2144 also choose to receive the only view from Capture Scene #2 (VC6). 2146 12.1.2. Encoding Group Example 2148 This is an example of an encoding group to illustrate how it can 2149 express dependencies between encodings. 2151 encodeGroupID=EG0 maxGroupBandwidth=6000000 2152 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2153 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2154 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2155 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2156 encodeID=AUDENC0, maxBandwidth=96000 2157 encodeID=AUDENC1, maxBandwidth=96000 2158 encodeID=AUDENC2, maxBandwidth=96000 2160 Here, the encoding group is EG0. Although the encoding group is 2161 capable of transmitting up to 6Mbit/s, no individual video 2162 encoding can exceed 4Mbit/s. 2164 This encoding group also allows up to 3 audio encodings, AUDENC<0- 2165 2>. It is not required that audio and video encodings reside 2166 within the same encoding group, but if so then the group's overall 2167 maxBandwidth value is a limit on the sum of all audio and video 2168 encodings configured by the consumer. A system that does not wish 2169 or need to combine bandwidth limitations in this way should 2170 instead use separate encoding groups for audio and video in order 2171 for the bandwidth limitations on audio and video to not interact. 2173 Audio and video can be expressed in separate encoding groups, as 2174 in this illustration. 2176 encodeGroupID=EG0 maxGroupBandwidth=6000000 2177 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2178 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2179 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2180 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2181 encodeGroupID=EG1 maxGroupBandwidth=500000 2182 encodeID=AUDENC0, maxBandwidth=96000 2183 encodeID=AUDENC1, maxBandwidth=96000 2184 encodeID=AUDENC2, maxBandwidth=96000 2186 12.1.3. The MCU Case 2188 This section shows how an MCU might express its Capture Scenes, 2189 intending to offer different choices for consumers that can handle 2190 different numbers of streams. A single audio capture stream is 2191 provided for all single and multi-screen configurations that can 2192 be associated (e.g. lip-synced) with any combination of video 2193 captures at the consumer. 2195 +-----------------------+---------------------------------+ 2196 | Capture Scene #1 | | 2197 +-----------------------|---------------------------------+ 2198 | VC0 | VC for a single screen consumer | 2199 | VC1, VC2 | VCs for a two screen consumer | 2200 | VC3, VC4, VC5 | VCs for a three screen consumer | 2201 | VC6, VC7, VC8, VC9 | VCs for a four screen consumer | 2202 | AC0 | AC representing all participants| 2203 | CSV(VC0) | | 2204 | CSV(VC1,VC2) | | 2205 | CSV(VC3,VC4,VC5) | | 2206 | CSV(VC6,VC7,VC8,VC9) | | 2207 | CSV(AC0) | | 2208 +-----------------------+---------------------------------+ 2210 Table 8: MCU main Capture Scenes 2212 If / when a presentation stream becomes active within the 2213 conference the MCU might re-advertise the available media as: 2215 +------------------+--------------------------------------+ 2216 | Capture Scene #2 | note | 2217 +------------------+--------------------------------------+ 2218 | VC10 | video capture for presentation | 2219 | AC1 | presentation audio to accompany VC10 | 2220 | CSV(VC10) | | 2221 | CSV(AC1) | | 2222 +------------------+--------------------------------------+ 2224 Table 9: MCU presentation Capture Scene 2226 12.2. Media Consumer Behavior 2228 This section gives an example of how a Media Consumer might behave 2229 when deciding how to request streams from the three screen 2230 endpoint described in the previous section. 2232 The receive side of a call needs to balance its requirements, 2233 based on number of screens and speakers, its decoding capabilities 2234 and available bandwidth, and the provider's capabilities in order 2235 to optimally configure the provider's streams. Typically it would 2236 want to receive and decode media from each Capture Scene 2237 advertised by the Provider. 2239 A sane, basic, algorithm might be for the consumer to go through 2240 each Capture Scene View in turn and find the collection of Video 2241 Captures that best matches the number of screens it has (this 2242 might include consideration of screens dedicated to presentation 2243 video display rather than "people" video) and then decide between 2244 alternative views in the video Capture Scenes based either on 2245 hard-coded preferences or user choice. Once this choice has been 2246 made, the consumer would then decide how to configure the 2247 provider's encoding groups in order to make best use of the 2248 available network bandwidth and its own decoding capabilities. 2250 12.2.1. One screen Media Consumer 2252 MCC3, MCC4 and VC5 are all different views by themselves, not 2253 grouped together in a single view, so the receiving device should 2254 choose between one of those. The choice would come down to 2255 whether to see the greatest number of participants simultaneously 2256 at roughly equal precedence (VC5), a switched view of just the 2257 loudest region (MCC3) or a switched view with PiPs (MCC4). An 2258 endpoint device with a small amount of knowledge of these 2259 differences could offer a dynamic choice of these options, in- 2260 call, to the user. 2262 12.2.2. Two screen Media Consumer configuring the example 2264 Mixing systems with an even number of screens, "2n", and those 2265 with "2n+1" cameras (and vice versa) is always likely to be the 2266 problematic case. In this instance, the behavior is likely to be 2267 determined by whether a "2 screen" system is really a "2 decoder" 2268 system, i.e., whether only one received stream can be displayed 2269 per screen or whether more than 2 streams can be received and 2270 spread across the available screen area. To enumerate 3 possible 2271 behaviors here for the 2 screen system when it learns that the far 2272 end is "ideally" expressed via 3 capture streams: 2274 1. Fall back to receiving just a single stream (MCC3, MCC4 or VC5 2275 as per the 1 screen consumer case above) and either leave one 2276 screen blank or use it for presentation if / when a 2277 presentation becomes active. 2279 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 2280 screens (either with each capture being scaled to 2/3 of a 2281 screen and the center capture being split across 2 screens) or, 2282 as would be necessary if there were large bezels on the 2283 screens, with each stream being scaled to 1/2 the screen width 2284 and height and there being a 4th "blank" panel. This 4th panel 2285 could potentially be used for any presentation that became 2286 active during the call. 2288 3. Receive 3 streams, decode all 3, and use control information 2289 indicating which was the most active to switch between showing 2290 the left and center streams (one per screen) and the center and 2291 right streams. 2293 For an endpoint capable of all 3 methods of working described 2294 above, again it might be appropriate to offer the user the choice 2295 of display mode. 2297 12.2.3. Three screen Media Consumer configuring the example 2299 This is the most straightforward case - the Media Consumer would 2300 look to identify a set of streams to receive that best matched its 2301 available screens and so the VC0 plus VC1 plus VC2 should match 2302 optimally. The spatial ordering would give sufficient information 2303 for the correct video capture to be shown on the correct screen, 2304 and the consumer would either need to divide a single encoding 2305 group's capability by 3 to determine what resolution and frame 2306 rate to configure the provider with or to configure the individual 2307 video captures' encoding groups with what makes most sense (taking 2308 into account the receive side decode capabilities, overall call 2309 bandwidth, the resolution of the screens plus any user preferences 2310 such as motion vs sharpness). 2312 12.3. Multipoint Conference utilizing Multiple Content Captures 2314 The use of MCCs allows the MCU to construct outgoing Advertisements 2315 describing complex and media switching and composition scenarios. 2316 The following sections provide several examples. 2318 Note: In the examples the identities of the CLUE elements (e.g. 2319 Captures, Capture Scene) in the incoming Advertisements overlap. 2320 This is because there is no co-ordination between the endpoints. 2321 The MCU is responsible for making these unique in the outgoing 2322 advertisement. 2324 12.3.1. Single Media Captures and MCC in the same Advertisement 2326 Four endpoints are involved in a Conference where CLUE is used. An 2327 MCU acts as a middlebox between the endpoints with a CLUE channel 2328 between each endpoint and the MCU. The MCU receives the following 2329 Advertisements. 2331 +-----------------------+---------------------------------+ 2332 | Capture Scene #1 | Description=AustralianConfRoom | 2333 +-----------------------|---------------------------------+ 2334 | VC1 | Description=Audience | 2335 | | EncodeGroupID=1 | 2336 | CSV(VC1) | | 2337 +---------------------------------------------------------+ 2339 Table 10: Advertisement received from Endpoint A 2341 +-----------------------+---------------------------------+ 2342 | Capture Scene #1 | Description=ChinaConfRoom | 2343 +-----------------------|---------------------------------+ 2344 | VC1 | Description=Speaker | 2345 | | EncodeGroupID=1 | 2346 | VC2 | Description=Audience | 2347 | | EncodeGroupID=1 | 2348 | CSV(VC1, VC2) | | 2349 +---------------------------------------------------------+ 2351 Table 11: Advertisement received from Endpoint B 2353 +-----------------------+---------------------------------+ 2354 | Capture Scene #1 | Description=USAConfRoom | 2355 +-----------------------|---------------------------------+ 2356 | VC1 | Description=Audience | 2357 | | EncodeGroupID=1 | 2358 | CSV(VC1) | | 2359 +---------------------------------------------------------+ 2361 Table 12: Advertisement received from Endpoint C 2363 Note: Endpoint B above indicates that it sends two streams. 2365 If the MCU wanted to provide a Multiple Content Capture containing 2366 a round robin switched view of the audience from the 3 endpoints 2367 and the speaker it could construct the following advertisement: 2369 Advertisement sent to Endpoint F 2371 +=======================+=================================+ 2372 | Capture Scene #1 | Description=AustralianConfRoom | 2373 +-----------------------|---------------------------------+ 2374 | VC1 | Description=Audience | 2375 | CSV(VC1) | | 2376 +=======================+=================================+ 2377 | Capture Scene #2 | Description=ChinaConfRoom | 2378 +-----------------------|---------------------------------+ 2379 | VC2 | Description=Speaker | 2380 | VC3 | Description=Audience | 2381 | CSV(VC2, VC3) | | 2382 +=======================+=================================+ 2383 | Capture Scene #3 | Description=USAConfRoom | 2384 +-----------------------|---------------------------------+ 2385 | VC4 | Description=Audience | 2386 | CSV(VC4) | | 2387 +=======================+=================================+ 2388 | Capture Scene #4 | | 2389 +-----------------------|---------------------------------+ 2390 | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1 | 2391 | | MaxCaptures=1 | 2392 | | EncodingGroup=1 | 2393 | CSV(MCC1) | | 2394 +=======================+=================================+ 2396 Table 13: Advertisement sent to Endpoint F - One Encoding 2398 Alternatively if the MCU wanted to provide the speaker as one media 2399 stream and the audiences as another it could assign an encoding 2400 group to VC2 in Capture Scene 2 and provide a CSV in Capture Scene 2401 #4 as per the example below. 2403 Advertisement sent to Endpoint F 2405 +=======================+=================================+ 2406 | Capture Scene #1 | Description=AustralianConfRoom | 2407 +-----------------------|---------------------------------+ 2408 | VC1 | Description=Audience | 2409 | CSV(VC1) | | 2410 +=======================+=================================+ 2411 | Capture Scene #2 | Description=ChinaConfRoom | 2412 +-----------------------|---------------------------------+ 2413 | VC2 | Description=Speaker | 2414 | | EncodingGroup=1 | 2415 | VC3 | Description=Audience | 2416 | CSV(VC2, VC3) | | 2417 +=======================+=================================+ 2418 | Capture Scene #3 | Description=USAConfRoom | 2419 +-----------------------|---------------------------------+ 2420 | VC4 | Description=Audience | 2421 | CSV(VC4) | | 2422 +=======================+=================================+ 2423 | Capture Scene #4 | | 2424 +-----------------------|---------------------------------+ 2425 | MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1 | 2426 | | MaxCaptures=1 | 2427 | | EncodingGroup=1 | 2428 | MCC2(VC2) | MaxCaptures=1 | 2429 | | EncodingGroup=1 | 2430 | CSV2(MCC1,MCC2) | | 2431 +=======================+=================================+ 2433 Table 14: Advertisement sent to Endpoint F - Two Encodings 2435 Therefore a Consumer could choose whether or not to have a separate 2436 speaker related stream and could choose which endpoints to see. If 2437 it wanted the second stream but not the Australian conference room 2438 it could indicate the following captures in the Configure message: 2440 +-----------------------+---------------------------------+ 2441 | MCC1(VC3,VC4) | Encoding | 2442 | VC2 | Encoding | 2443 +-----------------------|---------------------------------+ 2444 Table 15: MCU case: Consumer Response 2446 12.3.2. Several MCCs in the same Advertisement 2448 Multiple MCCs can be used where multiple streams are used to carry 2449 media from multiple endpoints. For example: 2451 A conference has three endpoints D, E and F. Each end point has 2452 three video captures covering the left, middle and right regions of 2453 each conference room. The MCU receives the following 2454 advertisements from D and E. 2456 +-----------------------+---------------------------------+ 2457 | Capture Scene #1 | Description=AustralianConfRoom | 2458 +-----------------------|---------------------------------+ 2459 | VC1 | CaptureArea=Left | 2460 | | EncodingGroup=1 | 2461 | VC2 | CaptureArea=Centre | 2462 | | EncodingGroup=1 | 2463 | VC3 | CaptureArea=Right | 2464 | | EncodingGroup=1 | 2465 | CSV(VC1,VC2,VC3) | | 2466 +---------------------------------------------------------+ 2468 Table 16: Advertisement received from Endpoint D 2470 +-----------------------+---------------------------------+ 2471 | Capture Scene #1 | Description=ChinaConfRoom | 2472 +-----------------------|---------------------------------+ 2473 | VC1 | CaptureArea=Left | 2474 | | EncodingGroup=1 | 2475 | VC2 | CaptureArea=Centre | 2476 | | EncodingGroup=1 | 2477 | VC3 | CaptureArea=Right | 2478 | | EncodingGroup=1 | 2479 | CSV(VC1,VC2,VC3) | | 2480 +---------------------------------------------------------+ 2481 Table 17: Advertisement received from Endpoint E 2483 The MCU wants to offer Endpoint F three Capture Encodings. Each 2484 Capture Encoding would contain all the Captures from either 2485 Endpoint D or Endpoint E depending based on the active speaker. 2486 The MCU sends the following Advertisement: 2488 +=======================+=================================+ 2489 | Capture Scene #1 | Description=AustralianConfRoom | 2490 +-----------------------|---------------------------------+ 2491 | VC1 | | 2492 | VC2 | | 2493 | VC3 | | 2494 | CSV(VC1,VC2,VC3) | | 2495 +=======================+=================================+ 2496 | Capture Scene #2 | Description=ChinaConfRoom | 2497 +-----------------------|---------------------------------+ 2498 | VC4 | | 2499 | VC5 | | 2500 | VC6 | | 2501 | CSV(VC4,VC5,VC6) | | 2502 +=======================+=================================+ 2503 | Capture Scene #3 | | 2504 +-----------------------|---------------------------------+ 2505 | MCC1(VC1,VC4) | CaptureArea=Left | 2506 | | MaxCaptures=1 | 2507 | | SynchronisationID=1 | 2508 | | EncodingGroup=1 | 2509 | MCC2(VC2,VC5) | CaptureArea=Centre | 2510 | | MaxCaptures=1 | 2511 | | SynchronisationID=1 | 2512 | | EncodingGroup=1 | 2513 | MCC3(VC3,VC6) | CaptureArea=Right | 2514 | | MaxCaptures=1 | 2515 | | SynchronisationID=1 | 2516 | | EncodingGroup=1 | 2517 | CSV(MCC1,MCC2,MCC3) | | 2518 +=======================+=================================+ 2520 Table 17: Advertisement received from Endpoint E 2522 12.3.3. Heterogeneous conference with switching and composition 2524 Consider a conference between endpoints with the following 2525 characteristics: 2527 Endpoint A - 4 screens, 3 cameras 2529 Endpoint B - 3 screens, 3 cameras 2531 Endpoint C - 3 screens, 3 cameras 2533 Endpoint D - 3 screens, 3 cameras 2535 Endpoint E - 1 screen, 1 camera 2537 Endpoint F - 2 screens, 1 camera 2539 Endpoint G - 1 screen, 1 camera 2541 This example focuses on what the user in one of the 3-camera multi- 2542 screen endpoints sees. Call this person User A, at Endpoint A. 2543 There are 4 large display screens at Endpoint A. Whenever somebody 2544 at another site is speaking, all the video captures from that 2545 endpoint are shown on the large screens. If the talker is at a 3- 2546 camera site, then the video from those 3 cameras fills 3 of the 2547 screens. If the talker is at a single-camera site, then video from 2548 that camera fills one of the screens, while the other screens show 2549 video from other single-camera endpoints. 2551 User A hears audio from the 4 loudest talkers. 2553 User A can also see video from other endpoints, in addition to the 2554 current talker, although much smaller in size. Endpoint A has 4 2555 screens, so one of those screens shows up to 9 other Media Captures 2556 in a tiled fashion. When video from a 3 camera endpoint appears in 2557 the tiled area, video from all 3 cameras appears together across 2558 the screen with correct spatial relationship among those 3 images. 2560 +---+---+---+ +-------------+ +-------------+ +-------------+ 2561 | | | | | | | | | | 2562 +---+---+---+ | | | | | | 2563 | | | | | | | | | | 2564 +---+---+---+ | | | | | | 2565 | | | | | | | | | | 2566 +---+---+---+ +-------------+ +-------------+ +-------------+ 2567 Figure 8: Endpoint A - 4 Screen Display 2569 User B at Endpoint B sees a similar arrangement, except there are 2570 only 3 screens, so the 9 other Media Captures are spread out across 2571 the bottom of the 3 displays, in a picture-in-picture (PIP) format. 2573 When video from a 3 camera endpoint appears in the PIP area, video 2574 from all 3 cameras appears together across a single screen with 2575 correct spatial relationship. 2577 +-------------+ +-------------+ +-------------+ 2578 | | | | | | 2579 | | | | | | 2580 | | | | | | 2581 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2582 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2583 +-------------+ +-------------+ +-------------+ 2584 Figure 9: Endpoint B - 3 Screen Display with PiPs 2586 When somebody at a different endpoint becomes the current talker, 2587 then User A and User B both see the video from the new talker 2588 appear on their large screen area, while the previous talker takes 2589 one of the smaller tiled or PIP areas. The person who is the 2590 current talker doesn't see themselves; they see the previous talker 2591 in their large screen area. 2593 One of the points of this example is that endpoints A and B each 2594 want to receive 3 capture encodings for their large display areas, 2595 and 9 encodings for their smaller areas. A and B are be able to 2596 each send the same Configure message to the MCU, and each receive 2597 the same conceptual Media Captures from the MCU. The differences 2598 are in how they are rendered and are purely a local matter at A and 2599 B. 2601 The Advertisements for such a scenario are described below. 2603 +-----------------------+---------------------------------+ 2604 | Capture Scene #1 | Description=Endpoint x | 2605 +-----------------------|---------------------------------+ 2606 | VC1 | EncodingGroup=1 | 2607 | VC2 | EncodingGroup=1 | 2608 | VC3 | EncodingGroup=1 | 2609 | AC1 | EncodingGroup=2 | 2610 | CSV1(VC1, VC2, VC3) | | 2611 | CSV2(AC1) | | 2612 +---------------------------------------------------------+ 2614 Table 19: Advertisement received at the MCU from Endpoints A to D 2616 +-----------------------+---------------------------------+ 2617 | Capture Scene #1 | Description=Endpoint y | 2618 +-----------------------|---------------------------------+ 2619 | VC1 | EncodingGroup=1 | 2620 | AC1 | EncodingGroup=2 | 2621 | CSV1(VC1) | | 2622 | CSV2(AC1) | | 2623 +---------------------------------------------------------+ 2625 Table 20: Advertisement received at the MCU from Endpoints E to F 2627 Rather than considering what is displayed CLUE concentrates more 2628 on what the MCU sends. The MCU doesn't know anything about the 2629 number of screens an endpoint has. 2631 As Endpoints A to D each advertise that three Captures make up a 2632 Capture Scene, the MCU offers these in a "site" switching mode. 2633 That is that there are three Multiple Content Captures (and 2634 Capture Encodings) each switching between Endpoints. The MCU 2635 switches in the applicable media into the stream based on voice 2636 activity. Endpoint A will not see a capture from itself. 2638 Using the MCC concept the MCU would send the following 2639 Advertisement to endpoint A: 2641 +=======================+=================================+ 2642 | Capture Scene #1 | Description=Endpoint B | 2643 +-----------------------|---------------------------------+ 2644 | VC4 | Left | 2645 | VC5 | Center | 2646 | VC6 | Right | 2647 | AC1 | | 2648 | CSV(VC4,VC5,VC6) | | 2649 | CSV(AC1) | | 2650 +=======================+=================================+ 2651 | Capture Scene #2 | Description=Endpoint C | 2652 +-----------------------|---------------------------------+ 2653 | VC7 | Left | 2654 | VC8 | Center | 2655 | VC9 | Right | 2656 | AC2 | | 2657 | CSV(VC7,VC8,VC9) | | 2658 | CSV(AC2) | | 2659 +=======================+=================================+ 2660 | Capture Scene #3 | Description=Endpoint D | 2661 +-----------------------|---------------------------------+ 2662 | VC10 | Left | 2663 | VC11 | Center | 2664 | VC12 | Right | 2665 | AC3 | | 2666 | CSV(VC10,VC11,VC12) | | 2667 | CSV(AC3) | | 2668 +=======================+=================================+ 2669 | Capture Scene #4 | Description=Endpoint E | 2670 +-----------------------|---------------------------------+ 2671 | VC13 | | 2672 | AC4 | | 2673 | CSV(VC13) | | 2674 | CSV(AC4) | | 2675 +=======================+=================================+ 2676 | Capture Scene #5 | Description=Endpoint F | 2677 +-----------------------|---------------------------------+ 2678 | VC14 | | 2679 | AC5 | | 2680 | CSV(VC14) | | 2681 | CSV(AC5) | | 2682 +=======================+=================================+ 2683 | Capture Scene #6 | Description=Endpoint G | 2684 +-----------------------|---------------------------------+ 2685 | VC15 | | 2686 | AC6 | | 2687 | CSV(VC15) | | 2688 | CSV(AC6) | | 2689 +=======================+=================================+ 2691 Table 21: Advertisement sent to endpoint A - Source Part 2693 The above part of the Advertisement presents information about the 2694 sources to the MCC. The information is effectively the same as the 2695 received Advertisements except that there are no Capture Encodings 2696 associated with them and the identities have been re-numbered. 2698 In addition to the source Capture information the MCU advertises 2699 "site" switching of Endpoints B to G in three streams. 2701 +=======================+=================================+ 2702 | Capture Scene #7 | Description=Output3streammix | 2703 +-----------------------|---------------------------------+ 2704 | MCC1(VC4,VC7,VC10, | CaptureArea=Left | 2705 | VC13) | MaxCaptures=1 | 2706 | | SynchronisationID=1 | 2707 | | Policy=SoundLevel:0 | 2708 | | EncodingGroup=1 | 2709 | | | 2710 | MCC2(VC5,VC8,VC11, | CaptureArea=Center | 2711 | VC14) | MaxCaptures=1 | 2712 | | SynchronisationID=1 | 2713 | | Policy=SoundLevel:0 | 2714 | | EncodingGroup=1 | 2715 | | | 2716 | MCC3(VC6,VC9,VC12, | CaptureArea=Right | 2717 | VC15) | MaxCaptures=1 | 2718 | | SynchronisationID=1 | 2719 | | Policy=SoundLevel:0 | 2720 | | EncodingGroup=1 | 2721 | | | 2722 | MCC4() (for audio) | CaptureArea=whole scene | 2723 | | MaxCaptures=1 | 2724 | | Policy=SoundLevel:0 | 2725 | | EncodingGroup=2 | 2726 | | | 2727 | MCC5() (for audio) | CaptureArea=whole scene | 2728 | | MaxCaptures=1 | 2729 | | Policy=SoundLevel:1 | 2730 | | EncodingGroup=2 | 2731 | | | 2732 | MCC6() (for audio) | CaptureArea=whole scene | 2733 | | MaxCaptures=1 | 2734 | | Policy=SoundLevel:2 | 2735 | | EncodingGroup=2 | 2736 | | | 2737 | MCC7() (for audio) | CaptureArea=whole scene | 2738 | | MaxCaptures=1 | 2739 | | Policy=SoundLevel:3 | 2740 | | EncodingGroup=2 | 2741 | | | 2742 | CSV(MCC1,MCC2,MCC3) | | 2743 | CSV(MCC4,MCC5,MCC6, | | 2744 | MCC7) | | 2745 +=======================+=================================+ 2747 Table 22: Advertisement send to endpoint A - switching part 2749 The above part describes the switched 3 main streams that relate to 2750 site switching. MaxCaptures=1 indicates that only one Capture from 2751 the MCC is sent at a particular time. SynchronisationID=1 indicates 2752 that the source sending is synchronised. The provider can choose to 2753 group together VC13, VC14, and VC15 for the purpose of switching 2754 according to the SynchronisationID. Therefore when the provider 2755 switches one of them into an MCC, it can also switch the others 2756 even though they are not part of the same Capture Scene. 2758 All the audio for the conference is included in this Scene #7. 2759 There isn't necessarily a one to one relation between any audio 2760 capture and video capture in this scene. Typically a change in 2761 loudest talker will cause the MCU to switch the audio streams more 2762 quickly than switching video streams. 2764 The MCU can also supply nine media streams showing the active and 2765 previous eight speakers. It includes the following in the 2766 Advertisement: 2768 +=======================+=================================+ 2769 | Capture Scene #8 | Description=Output9stream | 2770 +-----------------------|---------------------------------+ 2771 | MCC8(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2772 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:0 | 2773 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2774 | | | 2775 | MCC9(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2776 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:1 | 2777 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2778 | | | 2779 to to | 2780 | | | 2781 | MCC16(VC4,VC5,VC6,VC7,| MaxCaptures=1 | 2782 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:8 | 2783 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2784 | | | 2785 | CSV(MCC8,MCC9,MCC10, | | 2786 | MCC11,MCC12,MCC13,| | 2787 | MCC14,MCC15,MCC16)| | 2788 +=======================+=================================+ 2790 Table 23: Advertisement sent to endpoint A - 9 switched part 2792 The above part indicates that there are 9 capture encodings. Each 2793 of the Capture Encodings may contain any captures from any source 2794 site with a maximum of one Capture at a time. Which Capture is 2795 present is determined by the policy. The MCCs in this scene do not 2796 have any spatial attributes. 2798 Note: The Provider alternatively could provide each of the MCCs 2799 above in its own Capture Scene. 2801 If the MCU wanted to provide a composed Capture Encoding containing 2802 all of the 9 captures it could Advertise in addition: 2804 +=======================+=================================+ 2805 | Capture Scene #9 | Description=NineTiles | 2806 +-----------------------|---------------------------------+ 2807 | MCC13(MCC8,MCC9,MCC10,| MaxCaptures=9 | 2808 | MCC11,MCC12,MCC13,| EncodingGroup=1 | 2809 | MCC14,MCC15,MCC16)| | 2810 | | | 2811 | CSV(MCC13) | | 2812 +=======================+=================================+ 2814 Table 24: Advertisement sent to endpoint A - 9 composed part 2816 As MaxCaptures is 9 it indicates that the capture encoding contains 2817 information from 9 sources at a time. 2819 The Advertisement to Endpoint B is identical to the above other 2820 than the captures from Endpoint A would be added and the captures 2821 from Endpoint B would be removed. Whether the Captures are rendered 2822 on a four screen display or a three screen display is up to the 2823 Consumer to determine. The Consumer wants to place video captures 2824 from the same original source endpoint together, in the correct 2825 spatial order, but the MCCs do not have spatial attributes. So the 2826 Consumer needs to associate incoming media packets with the 2827 original individual captures in the advertisement (such as VC4, 2828 VC5, and VC6) in order to know the spatial information it needs for 2829 correct placement on the screens. The Provider can use the RTCP 2830 CaptureId SDES item and associated RTP header extension, as 2831 described in [I-D.ietf-clue-rtp-mapping], to convey this 2832 information to the Consumer. 2834 12.3.4. Heterogeneous conference with voice activated switching 2836 This example illustrates how multipoint "voice activated switching" 2837 behavior can be realized, with an endpoint making its own decision 2838 about which of its outgoing video streams is considered the "active 2839 talker" from that endpoint. Then an MCU can decide which is the 2840 active talker among the whole conference. 2842 Consider a conference between endpoints with the following 2843 characteristics: 2845 Endpoint A - 3 screens, 3 cameras 2847 Endpoint B - 3 screens, 3 cameras 2849 Endpoint C - 1 screen, 1 camera 2851 This example focuses on what the user at endpoint C sees. The 2852 user would like to see the video capture of the current talker, 2853 without composing it with any other video capture. In this 2854 example endpoint C is capable of receiving only a single video 2855 stream. The following tables describe advertisements from A and B 2856 to the MCU, and from the MCU to C, that can be used to accomplish 2857 this. 2859 +-----------------------+---------------------------------+ 2860 | Capture Scene #1 | Description=Endpoint x | 2861 +-----------------------|---------------------------------+ 2862 | VC1 | CaptureArea=Left | 2863 | | EncodingGroup=1 | 2864 | VC2 | CaptureArea=Center | 2865 | | EncodingGroup=1 | 2866 | VC3 | CaptureArea=Right | 2867 | | EncodingGroup=1 | 2868 | MCC1(VC1,VC2,VC3) | MaxCaptures=1 | 2869 | | CaptureArea=whole scene | 2870 | | Policy=SoundLevel:0 | 2871 | | EncodingGroup=1 | 2872 | AC1 | CaptureArea=whole scene | 2873 | | EncodingGroup=2 | 2874 | CSV1(VC1, VC2, VC3) | | 2875 | CSV2(MCC1) | | 2876 | CSV3(AC1) | | 2877 +---------------------------------------------------------+ 2879 Table 25: Advertisement received at the MCU from Endpoints A and B 2881 Endpoints A and B are advertising each individual video capture, 2882 and also a switched capture MCC1 which switches between the other 2883 three based on who is the active talker. These endpoints do not 2884 advertise distinct audio captures associated with each individual 2885 video capture, so it would be impossible for the MCU (as a media 2886 consumer) to make its own determination of which video capture is 2887 the active talker based just on information in the audio streams. 2889 +-----------------------+---------------------------------+ 2890 | Capture Scene #1 | Description=conference | 2891 +-----------------------|---------------------------------+ 2892 | MCC1() | CaptureArea=Left | 2893 | | MaxCaptures=1 | 2894 | | SynchronisationID=1 | 2895 | | Policy=SoundLevel:0 | 2896 | | EncodingGroup=1 | 2897 | | | 2898 | MCC2() | CaptureArea=Center | 2899 | | MaxCaptures=1 | 2900 | | SynchronisationID=1 | 2901 | | Policy=SoundLevel:0 | 2902 | | EncodingGroup=1 | 2903 | | | 2904 | MCC3() | CaptureArea=Right | 2905 | | MaxCaptures=1 | 2906 | | SynchronisationID=1 | 2907 | | Policy=SoundLevel:0 | 2908 | | EncodingGroup=1 | 2909 | | | 2910 | MCC4() | CaptureArea=whole scene | 2911 | | MaxCaptures=1 | 2912 | | Policy=SoundLevel:0 | 2913 | | EncodingGroup=1 | 2914 | | | 2915 | MCC5() (for audio) | CaptureArea=whole scene | 2916 | | MaxCaptures=1 | 2917 | | Policy=SoundLevel:0 | 2918 | | EncodingGroup=2 | 2919 | | | 2920 | MCC6() (for audio) | CaptureArea=whole scene | 2921 | | MaxCaptures=1 | 2922 | | Policy=SoundLevel:1 | 2923 | | EncodingGroup=2 | 2924 | CSV1(MCC1,MCC2,MCC3 | | 2925 | CSV2(MCC4) | | 2926 | CSV3(MCC5,MCC6) | | 2927 +---------------------------------------------------------+ 2929 Table 26: Advertisement sent from the MCU to C 2931 The MCU advertises one scene, with four video MCCs. Three of them 2932 in CSV1 give a left, center, right view of the conference, with 2933 "site switching". MCC4 provides a single video capture 2934 representing a view of the whole conference. The MCU intends for 2935 MCC4 to be switched between all the other original source 2936 captures. In this example advertisement the MCU is not giving all 2937 the information about all the other endpoints' scenes and which of 2938 those captures is included in the MCCs. The MCU could include all 2939 that information if it wants to give the consumers more 2940 information, but it is not necessary for this example scenario. 2942 The Provider advertises MCC5 and MCC6 for audio. Both are 2943 switched captures, with different SoundLevel policies indicating 2944 they are the top two dominant talkers. The Provider advertises 2945 CSV3 with both MCCs, suggesting the Consumer should use both if it 2946 can. 2948 Endpoint C, in its configure message to the MCU, requests to 2949 receive MCC4 for video, and MCC5 and MCC6 for audio. In order for 2950 the MCU to get the information it needs to construct MCC4, it has 2951 to send configure messages to A and B asking to receive MCC1 from 2952 each of them, along with their AC1 audio. Now the MCU can use 2953 audio energy information from the two incoming audio streams from 2954 A and B to determine which of those alternatives is the current 2955 talker. Based on that, the MCU uses either MCC1 from A or MCC1 2956 from B as the source of MCC4 to send to C. 2958 13. Acknowledgements 2960 Allyn Romanow and Brian Baldino were authors of early versions. 2961 Mark Gorzynski also contributed much to the initial approach. 2962 Many others also contributed, including Christian Groves, Jonathan 2963 Lennox, Paul Kyzivat, Rob Hansen, Roni Even, Christer Holmberg, 2964 Stephen Botzko, Mary Barnes, John Leslie, Paul Coverdale. 2966 14. IANA Considerations 2968 None. 2970 15. Security Considerations 2972 There are several potential attacks related to telepresence, and 2973 specifically the protocols used by CLUE, in the case of 2974 conferencing sessions, due to the natural involvement of multiple 2975 endpoints and the many, often user-invoked, capabilities provided 2976 by the systems. 2978 An MCU involved in a CLUE session can experience many of the same 2979 attacks as that of a conferencing system such as that enabled by 2980 the XCON framework [RFC 6503]. Examples of attacks include the 2981 following: an endpoint attempting to listen to sessions in which 2982 it is not authorized to participate, an endpoint attempting to 2983 disconnect or mute other users, and theft of service by an 2984 endpoint in attempting to create telepresence sessions it is not 2985 allowed to create. Thus, it is RECOMMENDED that an MCU 2986 implementing the protocols necessary to support CLUE, follow the 2987 security recommendations specified in the conference control 2988 protocol documents. In the case of CLUE, SIP is the default 2989 conferencing protocol, thus the security considerations in RFC 2990 4579 MUST be followed. 2992 One primary security concern, surrounding the CLUE framework 2993 introduced in this document, involves securing the actual 2994 protocols and the associated authorization mechanisms. These 2995 concerns apply to endpoint to endpoint sessions, as well as 2996 sessions involving multiple endpoints and MCUs. Figure 2 in 2997 section 5 provides a basic flow of information exchange for CLUE 2998 and the protocols involved. 3000 As described in section 5, CLUE uses SIP/SDP to establish the 3001 session prior to exchanging any CLUE specific information. Thus 3002 the security mechanisms recommended for SIP [RFC 3261], including 3003 user authentication and authorization, SHOULD be followed. In 3004 addition, the media is based on RTP and thus existing RTP security 3005 mechanisms, such as DTLS/SRTP, MUST be supported. 3007 A separate data channel is established to transport the CLUE 3008 protocol messages. The contents of the CLUE protocol messages are 3009 based on information introduced in this document, which is 3010 represented by an XML schema for this information defined in the 3011 CLUE data model [ref]. Some of the information which could 3012 possibly introduce privacy concerns is the xCard information as 3013 described in section 7.1.1.11. In addition, the (text) 3014 description field in the Media Capture attribute (section 7.1.1.7) 3015 could possibly reveal sensitive information or specific 3016 identities. The same would be true for the descriptions in the 3017 Capture Scene (section 7.3.1) and Capture Scene View (7.3.2) 3018 attributes. One other important consideration for the 3019 information in the xCard as well as the description field in the 3020 Media Capture and Capture Scene View attributes is that while the 3021 endpoints involved in the session have been authenticated, there 3022 is no assurance that the information in the xCard or description 3023 fields is authentic. Thus, this information SHOULD not be used to 3024 make any authorization decisions and the participants in the 3025 sessions SHOULD be made aware of this. 3027 While other information in the CLUE protocol messages does not 3028 reveal specific identities, it can reveal characteristics and 3029 capabilities of the endpoints. That information could possibly 3030 uniquely identify specific endpoints. It might also be possible 3031 for an attacker to manipulate the information and disrupt the CLUE 3032 sessions. It would also be possible to mount a DoS attack on the 3033 CLUE endpoints if a malicious agent has access to the data 3034 channel. Thus, It MUST be possible for the endpoints to establish 3035 a channel which is secure against both message recovery and 3036 message modification. Further details on this are provided in the 3037 CLUE data channel solution document. 3039 There are also security issues associated with the authorization 3040 to perform actions at the CLUE endpoints to invoke specific 3041 capabilities (e.g., re-arranging screens, sharing content, etc.). 3042 However, the policies and security associated with these actions 3043 are outside the scope of this document and the overall CLUE 3044 solution. 3046 16. Changes Since Last Version 3048 NOTE TO THE RFC-Editor: Please remove this section prior to 3049 publication as an RFC. 3051 Changes from 18 to 19: 3053 1. Remove the Max Capture Encodings media capture attribute. 3054 2. Refer to RTP mapping document in the MCC example section. 3055 3. Update references to current versions of drafts in progress. 3056 Changes from 17 to 18: 3058 1. Add separate definition of Global View List. 3059 2. Add diagram for Global View List structure. 3060 3. Tweak definitions of Media Consumer and Provider. 3062 Changes from 16 to 17: 3064 1. Ticket #59 - rename Capture Scene Entry (CSE) to Capture 3065 Scene View (CSV) 3067 2. Ticket #60 - rename Global CSE List to Global View List 3069 3. Ticket #61 - Proposal for describing the coordinate system. 3070 Describe it better, without conflicts if cameras point in 3071 different directions. 3073 4. Minor clarifications and improved wording for Synchronisation 3074 Identity, MCC, Simultaneous Transmission Set. 3076 5. Add definitions for CLUE-capable device and CLUE-enabled 3077 call, taken from the signaling draft. 3079 6. Update definitions of Capture Device, Media Consumer, Media 3080 Provider, Endpoint, MCU, MCC. 3082 7. Replace "middle box" with "MCU". 3084 8. Explicitly state there can also be Media Captures that are 3085 not included in a Capture Scene View. 3087 9. Explicitly state "A single Encoding Group MAY refer to 3088 encodings for different media types." 3090 10. In example 12.1.1 add axes and audio captures to the 3091 diagram, and describe placement of microphones. 3093 11. Add references to data model and signaling drafts. 3095 12. Split references into Normative and Informative sections. 3096 Add heading number for references section. 3098 Changes from 15 to 16: 3100 1. Remove Audio Channel Format attribute 3102 2. Add Audio Capture Sensitivity Pattern attribute 3104 3. Clarify audio spatial information regarding point of capture 3105 and point on line of capture. Area of capture does not apply 3106 to audio. 3108 4. Update section 12 example for new treatment of audio spatial 3109 information. 3111 5. Clean up wording of some definitions, and various places in 3112 sections 5 and 10. 3114 6. Remove individual encoding parameter paragraph from section 3115 9. 3117 7. Update Advertisement diagram. 3119 8. Update Acknowledgements. 3121 9. References to use cases and requirements now refer to RFCs. 3123 10. Minor editorial changes. 3125 Changes from 14 to 15: 3127 1. Add "=" and "<=" qualifiers to MaxCaptures attribute, and 3128 clarify the meaning regarding switched and composed MCC. 3130 2. Add section 7.3.3 Global Capture Scene Entry List, and a few 3131 other sentences elsewhere that refer to global CSE sets. 3133 3. Clarify: The Provider MUST be capable of encoding and sending 3134 all Captures (*that have an encoding group*) in a single 3135 Capture Scene Entry simultaneously. 3137 4. Add voice activated switching example in section 12. 3139 5. Change name of attributes Participant Info/Type to Person 3140 Info/Type. 3142 6. Clarify the Person Info/Type attributes have the same meaning 3143 regardless of whether or not the capture has a Presentation 3144 attribute. 3146 7. Update example section 12.1 to be consistent with the rest of 3147 the document, regarding MCC and capture attributes. 3149 8. State explicitly each CSE has a unique ID. 3151 Changes from 13 to 14: 3153 1. Fill in section for Security Considerations. 3155 2. Replace Role placeholder with Participant Information, 3156 Participant Type, and Scene Information attributes. 3158 3. Spatial information implies nothing about how constituent 3159 media captures are combined into a composed MCC. 3161 4. Clean up MCC example in Section 12.3.3. Clarify behavior of 3162 tiled and PIP display windows. Add audio. Add new open 3163 issue about associating incoming packets to original source 3164 capture. 3166 5. Remove editor's note and associated statement about RTP 3167 multiplexing at end of section 5. 3169 6. Remove editor's note and associated paragraph about 3170 overloading media channel with both CLUE and non-CLUE usage, 3171 in section 5. 3173 7. In section 10, clarify intent of media encodings conforming 3174 to SDP, even with multiple CLUE message exchanges. Remove 3175 associated editor's note. 3177 Changes from 12 to 13: 3179 1. Added the MCC concept including updates to existing sections 3180 to incorporate the MCC concept. New MCC attributes: 3181 MaxCaptures, SynchronisationID and Policy. 3183 2. Removed the "composed" and "switched" Capture attributes due 3184 to overlap with the MCC concept. 3186 3. Removed the "Scene-switch-policy" CSE attribute, replaced by 3187 MCC and SynchronisationID. 3189 4. Editorial enhancements including numbering of the Capture 3190 attribute sections, tables, figures etc. 3192 Changes from 11 to 12: 3194 1. Ticket #44. Remove note questioning about requiring a 3195 Consumer to send a Configure after receiving Advertisement. 3197 2. Ticket #43. Remove ability for consumer to choose value of 3198 attribute for scene-switch-policy. 3200 3. Ticket #36. Remove computational complexity parameter, 3201 MaxGroupPps, from Encoding Groups. 3203 4. Reword the Abstract and parts of sections 1 and 4 (now 5) 3204 based on Mary's suggestions as discussed on the list. Move 3205 part of the Introduction into a new section Overview & 3206 Motivation. 3208 5. Add diagram of an Advertisement, in the Overview of the 3209 Framework/Model section. 3211 6. Change Intended Status to Standards Track. 3213 7. Clean up RFC2119 keyword language. 3215 Changes from 10 to 11: 3217 1. Add description attribute to Media Capture and Capture Scene 3218 Entry. 3220 2. Remove contradiction and change the note about open issue 3221 regarding always responding to Advertisement with a Configure 3222 message. 3224 3. Update example section, to cleanup formatting and make the 3225 media capture attributes and encoding parameters consistent 3226 with the rest of the document. 3228 Changes from 09 to 10: 3230 1. Several minor clarifications such as about SDP usage, Media 3231 Captures, Configure message. 3233 2. Simultaneous Set can be expressed in terms of Capture Scene 3234 and Capture Scene Entry. 3236 3. Removed Area of Scene attribute. 3238 4. Add attributes from draft-groves-clue-capture-attr-01. 3240 5. Move some of the Media Capture attribute descriptions back 3241 into this document, but try to leave detailed syntax to the 3242 data model. Remove the OUTSOURCE sections, which are already 3243 incorporated into the data model document. 3245 Changes from 08 to 09: 3247 1. Use "document" instead of "memo". 3249 2. Add basic call flow sequence diagram to introduction. 3251 3. Add definitions for Advertisement and Configure messages. 3253 4. Add definitions for Capture and Provider. 3255 5. Update definition of Capture Scene. 3257 6. Update definition of Individual Encoding. 3259 7. Shorten definition of Media Capture and add key points in the 3260 Media Captures section. 3262 8. Reword a bit about capture scenes in overview. 3264 9. Reword about labeling Media Captures. 3266 10. Remove the Consumer Capability message. 3268 11. New example section heading for media provider behavior 3270 12. Clarifications in the Capture Scene section. 3272 13. Clarifications in the Simultaneous Transmission Set section. 3274 14. Capitalize defined terms. 3276 15. Move call flow example from introduction to overview section 3278 16. General editorial cleanup 3280 17. Add some editors' notes requesting input on issues 3281 18. Summarize some sections, and propose details be outsourced 3282 to other documents. 3284 Changes from 06 to 07: 3286 1. Ticket #9. Rename Axis of Capture Point attribute to Point 3287 on Line of Capture. Clarify the description of this 3288 attribute. 3290 2. Ticket #17. Add "capture encoding" definition. Use this new 3291 term throughout document as appropriate, replacing some usage 3292 of the terms "stream" and "encoding". 3294 3. Ticket #18. Add Max Capture Encodings media capture 3295 attribute. 3297 4. Add clarification that different capture scene entries are 3298 not necessarily mutually exclusive. 3300 Changes from 05 to 06: 3302 1. Capture scene description attribute is a list of text strings, 3303 each in a different language, rather than just a single string. 3305 2. Add new Axis of Capture Point attribute. 3307 3. Remove appendices A.1 through A.6. 3309 4. Clarify that the provider must use the same coordinate system 3310 with same scale and origin for all coordinates within the same 3311 capture scene. 3313 Changes from 04 to 05: 3315 1. Clarify limitations of "composed" attribute. 3317 2. Add new section "capture scene entry attributes" and add the 3318 attribute "scene-switch-policy". 3320 3. Add capture scene description attribute and description 3321 language attribute. 3323 4. Editorial changes to examples section for consistency with the 3324 rest of the document. 3326 Changes from 03 to 04: 3328 1. Remove sentence from overview - "This constitutes a significant 3329 change ..." 3331 2. Clarify a consumer can choose a subset of captures from a 3332 capture scene entry or a simultaneous set (in section "capture 3333 scene" and "consumer's choice..."). 3335 3. Reword first paragraph of Media Capture Attributes section. 3337 4. Clarify a stereo audio capture is different from two mono audio 3338 captures (description of audio channel format attribute). 3340 5. Clarify what it means when coordinate information is not 3341 specified for area of capture, point of capture, area of scene. 3343 6. Change the term "producer" to "provider" to be consistent (it 3344 was just in two places). 3346 7. Change name of "purpose" attribute to "content" and refer to 3347 RFC4796 for values. 3349 8. Clarify simultaneous sets are part of a provider advertisement, 3350 and apply across all capture scenes in the advertisement. 3352 9. Remove sentence about lip-sync between all media captures in a 3353 capture scene. 3355 10. Combine the concepts of "capture scene" and "capture set" 3356 into a single concept, using the term "capture scene" to 3357 replace the previous term "capture set", and eliminating the 3358 original separate capture scene concept. 3360 17. Normative References 3362 [I-D.ietf-clue-datachannel] 3363 Holmberg, C., "CLUE Protocol Data Channel", draft- 3364 ietf-clue-datachannel-05 (work in progress), November 3365 2014. 3367 [I-D.ietf-clue-data-model-schema] 3368 Presta, R., Romano, S P., "An XML Schema for the CLUE 3369 data model", draft-ietf-clue-data-model-schema-07 (work 3370 in progress), September 2014. 3372 [I-D.ietf-clue-protocol] 3373 Presta, R. and S. Romano, "CLUE protocol", draft- 3374 ietf-clue-protocol-02 (work in progress), October 2014. 3376 [I-D.ietf-clue-signaling] 3377 Kyzivat, P., Xiao, L., Groves, C., Hansen, R., "CLUE 3378 Signaling", draft-ietf-clue-signaling-04 (work in 3379 progress), October 2014. 3381 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3382 Requirement Levels", BCP 14, RFC 2119, March 1997. 3384 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 3385 Johnston, 3386 A., Peterson, J., Sparks, R., Handley, M., and E. 3387 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 3388 June 2002. 3390 [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model 3391 with the Session Description Protocol (SDP)", RFC 3264, 3392 June 2002. 3394 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 3395 Jacobson, "RTP: A Transport Protocol for Real-Time 3396 Applications", STD 64, RFC 3550, July 2003. 3398 [RFC4579] Johnston, A., Levin, O., "SIP Call Control - 3399 Conferencing for User Agents", RFC 4579, August 2006 3401 18. Informative References 3403 [I-D.ietf-clue-rtp-mapping] 3404 Even, R., Lennox, J., "Mapping RP streams to CLUE media 3405 captures", draft-ietf-clue-rtp-mapping-03 (work in 3406 progress), October 2014. 3408 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 3409 Session Initiation Protocol (SIP)", RFC 4353, 3410 February 2006. 3412 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 3413 5117, January 2008. 3415 [RFC7205] Romanow, A., Botzko, S., Duckworth, M., Even, R., 3416 "Use Cases for Telepresence Multistreams", RFC 7205, 3417 April 2014. 3419 [RFC7262] Romanow, A., Botzko, S., Barnes, M., "Requirements 3420 for Telepresence Multistreams", RFC 7262, June 2014. 3422 19. Authors' Addresses 3424 Mark Duckworth (editor) 3425 Polycom 3426 Andover, MA 01810 3427 USA 3429 Email: mark.duckworth@polycom.com 3431 Andrew Pepperell 3432 Acano 3433 Uxbridge, England 3434 UK 3436 Email: apeppere@gmail.com 3438 Stephan Wenger 3439 Vidyo, Inc. 3440 433 Hackensack Ave. 3441 Hackensack, N.J. 07601 3442 USA 3444 Email: stewe@stewe.org