idnits 2.17.1 draft-ietf-clue-framework-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1046 has weird spacing: '... switch betwe...' == Line 1843 has weird spacing: '...om left bot...' == Line 1897 has weird spacing: '...om left bot...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: A separate data channel is established to transport the CLUE protocol messages. The contents of the CLUE protocol messages are based on information introduced in this document, which is represented by an XML schema for this information defined in the CLUE data model [ref]. Some of the information which could possibly introduce privacy concerns is the xCard information as described in section x. In addition, the (text) description field in the Media Capture attribute (section 7.1.1.7) could possibly reveal sensitive information or specific identities. The same would be true for the descriptions in the Capture Scene (section 7.3.1) and Capture Scene Entry (7.3.2) attributes. One other important consideration for the information in the xCard as well as the description field in the Media Capture and Capture Scene Entry attributes is that while the endpoints involved in the session have been authenticated, there is no assurance that the information in the xCard or description fields is authentic. Thus, this information SHOULD not be used to make any authorization decisions and the participants in the sessions SHOULD be made aware of this. -- The document date (February 10, 2014) is 3726 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC6351' is mentioned on line 840, but not defined == Missing Reference: 'RFC6350' is mentioned on line 851, but not defined == Missing Reference: 'RFC4566' is mentioned on line 1450, but not defined ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866) == Missing Reference: 'RFC 6503' is mentioned on line 2746, but not defined == Missing Reference: 'RFC 3261' is mentioned on line 2768, but not defined == Unused Reference: 'RFC4579' is defined on line 3052, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 12 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Standards Track A. Pepperell 4 Expires: August 10, 2014 Acano 5 S. Wenger 6 Vidyo 7 February 10, 2014 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-14.txt 12 Abstract 14 This document defines a framework for a protocol to enable devices 15 in a telepresence conference to interoperate. The protocol enables 16 communication of information about multiple media streams so a 17 sending system and receiving system can make reasonable decisions 18 about transmitting, selecting and rendering the media streams. 19 This protocol is used in addition to SIP signaling for setting up a 20 telepresence session. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current 30 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other 34 documents at any time. It is inappropriate to use Internet-Drafts 35 as reference material or to cite them other than as "work in 36 progress." 38 This Internet-Draft will expire on August 10, 2014. 40 Copyright Notice 42 Copyright (c) 2013 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with 50 respect to this document. Code Components extracted from this 51 document must include Simplified BSD License text as described in 52 Section 4.e of the Trust Legal Provisions and are provided without 53 warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction...................................................3 58 2. Terminology....................................................4 59 3. Definitions....................................................4 60 4. Overview & Motivation..........................................7 61 5. Overview of the Framework/Model................................9 62 6. Spatial Relationships.........................................15 63 7. Media Captures and Capture Scenes.............................16 64 7.1. Media Captures...........................................16 65 7.1.1. Media Capture Attributes............................17 66 7.2. Multiple Content Capture.................................22 67 7.2.1. MCC Attributes......................................23 68 7.3. Capture Scene............................................27 69 7.3.1. Capture Scene attributes............................30 70 7.3.2. Capture Scene Entry attributes......................31 71 8. Simultaneous Transmission Set Constraints.....................31 72 9. Encodings.....................................................33 73 9.1. Individual Encodings.....................................33 74 9.2. Encoding Group...........................................34 75 9.3. Associating Captures with Encoding Groups................35 76 10. Consumer's Choice of Streams to Receive from the Provider....36 77 10.1. Local preference........................................39 78 10.2. Physical simultaneity restrictions......................39 79 10.3. Encoding and encoding group limits......................39 80 11. Extensibility................................................40 81 12. Examples - Using the Framework (Informative).................40 82 12.1. Provider Behavior.......................................40 83 12.1.1. Three screen Endpoint Provider.....................41 84 12.1.2. Encoding Group Example.............................47 85 12.1.3. The MCU Case.......................................48 86 12.2. Media Consumer Behavior.................................49 87 12.2.1. One screen Media Consumer..........................50 88 12.2.2. Two screen Media Consumer configuring the example..50 89 12.2.3. Three screen Media Consumer configuring the example51 90 12.3. Multipoint Conference utilizing Multiple Content Captures51 91 12.3.1. Single Media Captures and MCC in the same 92 Advertisement..............................................51 93 12.3.2. Several MCCs in the same Advertisement.............54 94 12.3.3. Heterogeneous conference with switching and 95 composition................................................56 96 13. Acknowledgements.............................................63 97 14. IANA Considerations..........................................63 98 15. Security Considerations......................................63 99 16. Changes Since Last Version...................................65 100 17. Authors' Addresses...........................................70 102 1. Introduction 104 Current telepresence systems, though based on open standards such 105 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 106 each other. A major factor limiting the interoperability of 107 telepresence systems is the lack of a standardized way to describe 108 and negotiate the use of the multiple streams of audio and video 109 comprising the media flows. This document provides a framework for 110 protocols to enable interoperability by handling multiple streams 111 in a standardized way. The framework is intended to support the 112 use cases described in draft-ietf-clue-telepresence-use-cases and 113 to meet the requirements in draft-ietf-clue-telepresence- 114 requirements. 116 The basic session setup for the use cases is based on SIP [RFC3261] 117 and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP 118 offer/answer, CLUE specific signaling is required to exchange the 119 information describing the multiple media streams. The motivation 120 for this framework, an overview of the signaling, and information 121 required to be exchanged is described in subsequent sections of 122 this document. The signaling details and data model are provided 123 in subsequent documents. 125 2. Terminology 127 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 128 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 129 this document are to be interpreted as described in RFC 2119 130 [RFC2119]. 132 3. Definitions 134 The terms defined below are used throughout this document and 135 companion documents and they are normative. In order to easily 136 identify the use of a defined term, those terms are capitalized. 138 Advertisement: a CLUE message a Media Provider sends to a Media 139 Consumer describing specific aspects of the content of the media, 140 the formatting of the media streams it can send, and any 141 restrictions it has in terms of being able to provide certain 142 Streams simultaneously. 144 Audio Capture: Media Capture for audio. Denoted as ACn in the 145 example cases in this document. 147 Camera-Left and Right: For Media Captures, camera-left and camera- 148 right are from the point of view of a person observing the rendered 149 media. They are the opposite of Stage-Left and Stage-Right. 151 Capture: Same as Media Capture. 153 Capture Device: A device that converts audio and video input into 154 an electrical signal, in most cases to be fed into a media encoder. 156 Capture Encoding: A specific encoding of a Media Capture, to be 157 sent by a Media Provider to a Media Consumer via RTP. 159 Capture Scene: a structure representing a spatial region containing 160 one or more Capture Devices, each capturing media representing a 161 portion of the region. The spatial region represented by a Capture 162 Scene MAY or may not correspond to a real region in physical space, 163 such as a room. A Capture Scene includes attributes and one or 164 more Capture Scene Entries, with each entry including one or more 165 Media Captures. 167 Capture Scene Entry: a list of Media Captures of the same media 168 type that together form one way to represent the entire Capture 169 Scene. 171 Conference: used as defined in [RFC4353], A Framework for 172 Conferencing within the Session Initiation Protocol (SIP). 174 Configure Message: A CLUE message a Media Consumer sends to a Media 175 Provider specifying which content and media streams it wants to 176 receive, based on the information in a corresponding Advertisement 177 message. 179 Consumer: short for Media Consumer. 181 Encoding or Individual Encoding: a set of parameters representing a 182 way to encode a Media Capture to become a Capture Encoding. 184 Encoding Group: A set of encoding parameters representing a total 185 media encoding capability to be sub-divided across potentially 186 multiple Individual Encodings. 188 Endpoint: The logical point of final termination through receiving, 189 decoding and rendering, and/or initiation through capturing, 190 encoding, and sending of media streams. An endpoint consists of 191 one or more physical devices which source and sink media streams, 192 and exactly one [RFC4353] Participant (which, in turn, includes 193 exactly one SIP User Agent). Endpoints can be anything from 194 multiscreen/multicamera rooms to handheld devices. 196 Front: the portion of the room closest to the cameras. In going 197 towards back you move away from the cameras. 199 MCU: Multipoint Control Unit (MCU) - a device that connects two or 200 more endpoints together into one single multimedia conference 201 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 202 [RFC4353] requirement to send media to each participant. 204 Media: Any data that, after suitable encoding, can be conveyed over 205 RTP, including audio, video or timed text. 207 Media Capture: a source of Media, such as from one or more Capture 208 Devices or constructed from other Media streams. 210 Media Consumer: an Endpoint or middle box that receives Media 211 streams 213 Media Provider: an Endpoint or middle box that sends Media streams 214 Model: a set of assumptions a telepresence system of a given vendor 215 adheres to and expects the remote telepresence system(s) also to 216 adhere to. 218 Multiple Content Capture: A Capture for audio or video that 219 indicates that the Capture contains multiple audio or video 220 Captures. Single Media Captures may or may not be present in the 221 resultant Capture Encoding depending on time or space. Denoted as 222 MCCn in the example cases in this document. 224 Plane of Interest: The spatial plane containing the most relevant 225 subject matter. 227 Provider: Same as Media Provider. 229 Render: the process of generating a representation from a media, 230 such as displayed motion video or sound emitted from loudspeakers. 232 Simultaneous Transmission Set: a set of Media Captures that can be 233 transmitted simultaneously from a Media Provider. 235 Single Media Capture: A capture which contains media from a single 236 source capture device, i.e. audio capture, video capture. 238 Spatial Relation: The arrangement in space of two objects, in 239 contrast to relation in time or other relationships. See also 240 Camera-Left and Right. 242 Stage-Left and Right: For Media Captures, Stage-left and Stage- 243 right are the opposite of Camera-left and Camera-right. For the 244 case of a person facing (and captured by) a camera, Stage-left and 245 Stage-right are from the point of view of that person. 247 Stream: a Capture Encoding sent from a Media Provider to a Media 248 Consumer via RTP [RFC3550]. 250 Stream Characteristics: the media stream attributes commonly used 251 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 252 resolution, profile/level etc.) as well as CLUE specific 253 attributes, such as the Capture ID or a spatial location. 255 Video Capture: Media Capture for video. Denoted as VCn in the 256 example cases in this document. 258 Video Composite: A single image that is formed, normally by an RTP 259 mixer inside an MCU, by combining visual elements from separate 260 sources. 262 4. Overview & Motivation 264 This section provides an overview of the functional elements 265 defined in this document to represent a telepresence system. The 266 motivations for the framework described in this document are also 267 provided. 269 Two key concepts introduced in this document are the terms "Media 270 Provider" and "Media Consumer". A Media Provider represents the 271 entity that is sending the media and a Media Consumer represents 272 the entity that is receiving the media. A Media Provider provides 273 Media in the form of RTP packets, a Media Consumer consumes those 274 RTP packets. Media Providers and Media Consumers can reside in 275 Endpoints or in middleboxes such as Multipoint Control Units 276 (MCUs). A Media Provider in an Endpoint is usually associated 277 with the generation of media for Media Captures; these Media 278 Captures are typically sourced from cameras, microphones, and the 279 like. Similarly, the Media Consumer in an Endpoint is usually 280 associated with renderers, such as screens and loudspeakers. In 281 middleboxes, Media Providers and Consumers can have the form of 282 outputs and inputs, respectively, of RTP mixers, RTP translators, 283 and similar devices. Typically, telepresence devices such as 284 Endpoints and middleboxes would perform as both Media Providers 285 and Media Consumers, the former being concerned with those 286 devices' transmitted media and the latter with those devices' 287 received media. In a few circumstances, a CLUE Endpoint middlebox 288 includes only Consumer or Provider functionality, such as 289 recorder-type Consumers or webcam-type Providers. 291 The motivations for the framework outlined in this document 292 include the following: 294 (1) Endpoints in telepresence systems typically have multiple Media 295 Capture and Media Render devices, e.g., multiple cameras and 296 screens. While previous system designs were able to set up calls 297 that would capture media using all cameras and display media on all 298 screens, for example, there is no mechanism that can associate 299 these Media Captures with each other in space and time. 301 (2) The mere fact that there are multiple capture and rendering 302 devices, each of which may be configurable in aspects such as zoom, 303 leads to the difficulty that a variable number of such devices can 304 be used to capture different aspects of a region. The Capture 305 Scene concept allows for the description of multiple setups for 306 those multiple capture devices that could represent sensible 307 operation points of the physical capture devices in a room, chosen 308 by the operator. A Consumer can pick and choose from those 309 configurations based on its rendering abilities and inform the 310 Provider about its choices. Details are provided in section 7. 312 (3) In some cases, physical limitations or other reasons disallow 313 the concurrent use of a device in more than one setup. For 314 example, the center camera in a typical three-camera conference 315 room can set its zoom objective either to capture only the middle 316 few seats, or all seats of a room, but not both concurrently. The 317 Simultaneous Transmission Set concept allows a Provider to signal 318 such limitations. Simultaneous Transmission Sets are part of the 319 Capture Scene description, and discussed in section 8. 321 (4) Often, the devices in a room do not have the computational 322 complexity or connectivity to deal with multiple encoding options 323 simultaneously, even if each of these options is sensible in 324 certain scenarios, and even if the simultaneous transmission is 325 also sensible (i.e. in case of multicast media distribution to 326 multiple endpoints). Such constraints can be expressed by the 327 Provider using the Encoding Group concept, described in section 9. 329 (5) Due to the potentially large number of RTP flows required for a 330 Multimedia Conference involving potentially many Endpoints, each of 331 which can have many Media Captures and media renderers, it has 332 become common to multiplex multiple RTP media flows onto the same 333 transport address, so to avoid using the port number as a 334 multiplexing point and the associated shortcomings such as 335 NAT/firewall traversal. While the actual mapping of those RTP 336 flows to the header fields of the RTP packets is not subject of 337 this specification, the large number of possible permutations of 338 sensible options a Media Provider can make available to a Media 339 Consumer makes a mechanism desirable that allows to narrow down the 340 number of possible options that a SIP offer-answer exchange has to 341 consider. Such information is made available using protocol 342 mechanisms specified in this document and companion documents, 343 although it should be stressed that its use in an implementation is 344 OPTIONAL. Also, there are aspects of the control of both Endpoints 345 and middleboxes/MCUs that dynamically change during the progress of 346 a call, such as audio-level based screen switching, layout changes, 347 and so on, which need to be conveyed. Note that these control 348 aspects are complementary to those specified in traditional SIP 349 based conference management such as BFCP. An exemplary call flow 350 can be found in section 5. 352 Finally, all this information needs to be conveyed, and the notion 353 of support for it needs to be established. This is done by the 354 negotiation of a "CLUE channel", a data channel negotiated early 355 during the initiation of a call. An Endpoint or MCU that rejects 356 the establishment of this data channel, by definition, is not 357 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 358 accepts it is REQUIRED to use it to the extent specified in this 359 document and its companion documents. 361 5. Overview of the Framework/Model 363 The CLUE framework specifies how multiple media streams are to be 364 handled in a telepresence conference. 366 A Media Provider (transmitting Endpoint or MCU) describes specific 367 aspects of the content of the media and the formatting of the media 368 streams it can send in an Advertisement; and the Media Consumer 369 responds to the Media Provider by specifying which content and 370 media streams it wants to receive in a Configure message. The 371 Provider then transmits the asked-for content in the specified 372 streams. 374 This Advertisement and Configure MUST occur during call initiation 375 but MAY also happen at any time throughout the call, whenever there 376 is a change in what the Consumer wants to receive or (perhaps less 377 common) the Provider can send. 379 An Endpoint or MCU typically act as both Provider and Consumer at 380 the same time, sending Advertisements and sending Configurations in 381 response to receiving Advertisements. (It is possible to be just 382 one or the other.) 384 The data model is based around two main concepts: a Capture and an 385 Encoding. A Media Capture (MC), such as audio or video, describes 386 the content a Provider can send. Media Captures are described in 387 terms of CLUE-defined attributes, such as spatial relationships and 388 purpose of the capture. Providers tell Consumers which Media 389 Captures they can provide, described in terms of the Media Capture 390 attributes. 392 A Provider organizes its Media Captures into one or more Capture 393 Scenes, each representing a spatial region, such as a room. A 394 Consumer chooses which Media Captures it wants to receive from each 395 Capture Scene. 397 In addition, the Provider can send the Consumer a description of 398 the Individual Encodings it can send in terms of the media 399 attributes of the Encodings, in particular, audio and video 400 parameters such as bandwidth, frame rate, macroblocks per second. 401 Note that this is OPTIONAL, and intended to minimize the number of 402 options a later SDP offer-answer would have to include in the SDP 403 in case of complex setups, as should become clearer shortly when 404 discussing an outline of the call flow. 406 The Provider can also specify constraints on its ability to provide 407 Media, and a sensible design choice for a Consumer is to take these 408 into account when choosing the content and Capture Encodings it 409 requests in the later offer-answer exchange. Some constraints are 410 due to the physical limitations of devices--for example, a camera 411 may not be able to provide zoom and non-zoom views simultaneously. 412 Other constraints are system based, such as maximum bandwidth and 413 maximum video coding performance measured in macroblocks/second. 415 The following diagram illustrates the information contained in an 416 Advertisement. 418 ................................................................... 419 . Provider Advertisement . 420 . . 421 . +------------------------+ +--------------------+ . 422 . | Capture Scene N | | Simultaneous | . 423 . +-+----------------------+ | +--------------------+ . 424 . | Capture Scene 2 | | . 425 . +-+----------------------+ | | +----------------------+ . 426 . | Capture Scene 1 | | | | Encoding Group N | . 427 . | +---------------+ | | | +-+--------------------+ | . 428 . | | Attributes | | | | | Encoding Group 2 | | . 429 . | +---------------+ | | | +-+--------------------+ | | . 430 . | | | | | Encoding Group 1 | | | . 431 . | +----------------+ | | | | parameters | | | . 432 . | | E n t r i e s | | | | | | | | . 433 . | | +---------+ | | | | | +-------------------+| | | . 434 . | | |Attribute| | | | | | | V i d e o || | | . 435 . | | +---------+ | | | | | | E n c o d i n g s || | | . 436 . | | | | | | | | Encoding 1 || | | . 437 . | | Entry 1 | | | | | | (parameters) || | | . 438 . | | (list of MCs) | | |-+ | +-------------------+| | | . 439 . | +----|-|--|------+ |-+ | | | | . 440 . +---------|-|--|---------+ | +-------------------+| | | . 441 . | | | | | A u d i o || | | . 442 . | | | | | E n c o d i n g s || | | . 443 . v | | | | Encoding 1 || | | . 444 . +---------|--|--------+ | | (ID,maxBandwidth) || | | . 445 . | Media Capture N |------>| +-------------------+| | | . 446 . +-+---------v--|------+ | | | | | . 447 . | Media Capture 2 | | | | |-+ . 448 . +-+--------------v----+ |-------->| | | . 449 . | Media Capture 1 | | | | |-+ . 450 . | +----------------+ |---------->| | . 451 . | | Attributes | | |_+ +----------------------+ . 452 . | +----------------+ |_+ . 453 . +---------------------+ . 454 . . 455 ................................................................... 456 Figure 1: Advertisement Structure 458 A very brief outline of the call flow used by a simple system (two 459 Endpoints) in compliance with this document can be described as 460 follows, and as shown in the following figure. 462 +-----------+ +-----------+ 463 | Endpoint1 | | Endpoint2 | 464 +----+------+ +-----+-----+ 465 | INVITE (BASIC SDP+CLUECHANNEL) | 466 |--------------------------------->| 467 | 200 0K (BASIC SDP+CLUECHANNEL)| 468 |<---------------------------------| 469 | ACK | 470 |--------------------------------->| 471 | | 472 |<################################>| 473 | BASIC SDP MEDIA SESSION | 474 |<################################>| 475 | | 476 | CONNECT (CLUE CTRL CHANNEL) | 477 |=================================>| 478 | ... | 479 |<================================>| 480 | CLUE CTRL CHANNEL ESTABLISHED | 481 |<================================>| 482 | | 483 | ADVERTISEMENT 1 | 484 |*********************************>| 485 | ADVERTISEMENT 2 | 486 |<*********************************| 487 | | 488 | CONFIGURE 1 | 489 |<*********************************| 490 | CONFIGURE 2 | 491 |*********************************>| 492 | | 493 | REINVITE (UPDATED SDP) | 494 |--------------------------------->| 495 | 200 0K (UPDATED SDP)| 496 |<---------------------------------| 497 | ACK | 498 |--------------------------------->| 499 | | 500 |<################################>| 501 | UPDATED SDP MEDIA SESSION | 502 |<################################>| 503 | | 504 v v 506 Figure 2: Basic Information Flow 508 An initial offer/answer exchange establishes a basic media session, 509 for example audio-only, and a CLUE channel between two Endpoints. 510 With the establishment of that channel, the endpoints have 511 consented to use the CLUE protocol mechanisms and, therefore, MUST 512 adhere to the CLUE protocol suite as outlined herein. 514 Over this CLUE channel, the Provider in each Endpoint conveys its 515 characteristics and capabilities by sending an Advertisement as 516 specified herein. The Advertisement is typically not sufficient to 517 set up all media. The Consumer in the Endpoint receives the 518 information provided by the Provider, and can use it for two 519 purposes. First, it MUST construct and send a CLUE Configure 520 message to tell the Provider what the Consumer wishes to receive. 521 Second, it MAY, but is not necessarily REQUIRED to, use the 522 information provided to tailor the SDP it is going to send during 523 the following SIP offer/answer exchange, and its reaction to SDP it 524 receives in that step. It is often a sensible implementation 525 choice to do so, as the representation of the media information 526 conveyed over the CLUE channel can dramatically cut down on the 527 size of SDP messages used in the O/A exchange that follows. 528 Spatial relationships associated with the Media can be included in 529 the Advertisement, and it is often sensible for the Media Consumer 530 to take those spatial relationships into account when tailoring the 531 SDP. 533 This CLUE exchange MUST be followed by an SDP offer answer exchange 534 that not only establishes those aspects of the media that have not 535 been "negotiated" over CLUE, but has also the side effect of 536 setting up the media transmission itself, involving potentially 537 security exchanges, ICE, and whatnot. This step is plain vanilla 538 SIP, with the exception that the SDP used herein, in most (but not 539 necessarily all) cases can be considerably smaller than the SDP a 540 system would typically need to exchange if there were no pre- 541 established knowledge about the Provider and Consumer 542 characteristics. (The need for cutting down SDP size is not quite 543 obvious for a point-to-point call involving simple endpoints; 544 however, when considering a large multipoint conference involving 545 many multi-screen/multi-camera endpoints, each of which can operate 546 using multiple codecs for each camera and microphone, it becomes 547 perhaps somewhat more intuitive.) 549 During the lifetime of a call, further exchanges MAY occur over the 550 CLUE channel. In some cases, those further exchanges lead to a 551 modified system behavior of Provider or Consumer (or both) without 552 any other protocol activity such as further offer/answer exchanges. 553 For example, voice-activated screen switching, signaled over the 554 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 555 re-invites. However, in other cases, after the CLUE negotiation an 556 additional offer/answer exchange becomes necessary. For example, 557 if both sides decide to upgrade the call from a single screen to a 558 multi-screen call and more bandwidth is required for the additional 559 video channels compared to what was previously negotiated using 560 offer/answer, a new O/A exchange is REQUIRED. 562 One aspect of the protocol outlined herein and specified in more 563 detail in companion documents is that it makes available 564 information regarding the Provider's capabilities to deliver Media, 565 and attributes related to that Media such as their spatial 566 relationship, to the Consumer. The operation of the renderer 567 inside the Consumer is unspecified in that it can choose to ignore 568 some information provided by the Provider, and/or not render media 569 streams available from the Provider (although it MUST follow the 570 CLUE protocol and, therefore, MUST gracefully receive and respond 571 (through a Configure) to the Provider's information). All CLUE 572 protocol mechanisms are OPTIONAL in the Consumer in the sense that, 573 while the Consumer MUST be able to receive (and, potentially, 574 gracefully acknowledge) CLUE messages, it is free to ignore the 575 information provided therein. Obviously, this is not a 576 particularly sensible design choice in almost all conceivable 577 cases. 579 A CLUE-implementing device interoperates with a device that does 580 not support CLUE, because the non-CLUE device does, by definition, 581 not understand the offer of a CLUE channel in the initial 582 offer/answer exchange and, therefore, will reject it. This 583 rejection MUST be used as the indication to the CLUE-implementing 584 device that the other side of the communication is not compliant 585 with CLUE, and to fall back to behavior that does not require CLUE. 587 As for the media, Provider and Consumer have an end-to-end 588 communication relationship with respect to (RTP transported) media; 589 and the mechanisms described herein and in companion documents do 590 not change the aspects of setting up those RTP flows and sessions. 591 In other words, the RTP media sessions conform to the negotiated 592 SDP whether or not CLUE is used. 594 6. Spatial Relationships 596 In order for a Consumer to perform a proper rendering, it is often 597 necessary or at least helpful for the Consumer to have received 598 spatial information about the streams it is receiving. CLUE 599 defines a coordinate system that allows Media Providers to describe 600 the spatial relationships of their Media Captures to enable proper 601 scaling and spatially sensible rendering of their streams. The 602 coordinate system is based on a few principles: 604 o Simple systems which do not have multiple Media Captures to 605 associate spatially need not use the coordinate model. 607 o Coordinates can be either in real, physical units (millimeters), 608 have an unknown scale or have no physical scale. Systems which 609 know their physical dimensions (for example professionally 610 installed Telepresence room systems) MUST always provide those 611 real-world measurements. Systems which don't know specific 612 physical dimensions but still know relative distances MUST use 613 'unknown scale'. 'No scale' is intended to be used where Media 614 Captures from different devices (with potentially different 615 scales) will be forwarded alongside one another (e.g. in the 616 case of a middle box). 618 * "Millimeters" means the scale is in millimeters. 620 * "Unknown" means the scale is not necessarily millimeters, but 621 the scale is the same for every Capture in the Capture Scene. 623 * "No Scale" means the scale could be different for each 624 capture- an MCU provider that advertises two adjacent 625 captures and picks sources (which can change quickly) from 626 different endpoints might use this value; the scale could be 627 different and changing for each capture. But the areas of 628 capture still represent a spatial relation between captures. 630 o The coordinate system is Cartesian X, Y, Z with the origin at a 631 spatial location of the provider's choosing. The Provider MUST 632 use the same coordinate system with the same scale and origin 633 for all coordinates within the same Capture Scene. 635 The direction of increasing coordinate values is: 636 X increases from Camera-Left to Camera-Right 637 Y increases from front to back 638 Z increases from low to high (i.e. floor to ceiling) 640 7. Media Captures and Capture Scenes 642 This section describes how Providers can describe the content of 643 media to Consumers. 645 7.1. Media Captures 647 Media Captures are the fundamental representations of streams that 648 a device can transmit. What a Media Capture actually represents is 649 flexible: 651 o It can represent the immediate output of a physical source (e.g. 652 camera, microphone) or 'synthetic' source (e.g. laptop computer, 653 DVD player). 655 o It can represent the output of an audio mixer or video composer 657 o It can represent a concept such as 'the loudest speaker' 659 o It can represent a conceptual position such as 'the leftmost 660 stream' 662 To identify and distinguish between multiple Capture instances 663 Captures have a unique identity. For instance: VC1, VC2 and AC1, 664 AC2, where VC1 and VC2 refer to two different video captures and 665 AC1 and AC2 refer to two different audio captures. 667 Some key points about Media Captures: 669 . A Media Capture is of a single media type (e.g. audio or 670 video) 671 . A Media Capture is defined in a Capture Scene and is given an 672 advertisement unique identity. The identity may be referenced 673 outside the Capture Scene that defines it through a Multiple 674 Content Capture (MCC) 675 . A Media Capture is associated with one or more Capture Scene 676 Entries 677 . A Media Capture has exactly one set of spatial information 678 . A Media Capture can be the source of one or more Capture 679 Encodings 681 Each Media Capture can be associated with attributes to describe 682 what it represents. 684 7.1.1. Media Capture Attributes 686 Media Capture Attributes describe information about the Captures. 687 A Provider can use the Media Capture Attributes to describe the 688 Captures for the benefit of the Consumer in the Advertisement 689 message. Media Capture Attributes include: 691 . Spatial information, such as point of capture, point on line 692 of capture, and area of capture, all of which, in combination 693 define the capture field of, for example, a camera; 694 . Capture multiplexing information (composed/switched video, 695 mono/stereo audio, maximum number of simultaneous encodings 696 per Capture and so on); and 697 . Other descriptive information to help the Consumer choose 698 between captures (description, presentation, view, priority, 699 language, participant information and type). 700 . Control information for use inside the CLUE protocol suite. 702 The sub-sections below define the Capture attributes. 704 7.1.1.1. Point of Capture 706 The Point of Capture attribute is a field with a single Cartesian 707 (X, Y, Z) point value which describes the spatial location of the 708 capturing device (such as camera). 710 7.1.1.2. Point on Line of Capture 712 The Point on Line of Capture attribute is a field with a single 713 Cartesian (X, Y, Z) point value which describes a position in space 714 of a second point on the axis of the capturing device; the first 715 point being the Point of Capture (see above). 717 Together, the Point of Capture and Point on Line of Capture define 718 an axis of the capturing device, for example the optical axis of a 719 camera. The Media Consumer can use this information to adjust how 720 it renders the received media if it so chooses. 722 7.1.1.3. Area of Capture 724 The Area of Capture is a field with a set of four (X, Y, Z) points 725 as a value which describes the spatial location of what is being 726 "captured". By comparing the Area of Capture for different Media 727 Captures within the same Capture Scene a consumer can determine the 728 spatial relationships between them and render them correctly. 730 The four points MUST be co-planar, forming a quadrilateral, which 731 defines the Plane of Interest for the particular media capture. 733 If the Area of Capture is not specified, it means the Media Capture 734 is not spatially related to any other Media Capture. 736 For a switched capture that switches between different sections 737 within a larger area, the area of capture MUST use coordinates for 738 the larger potential area. 740 7.1.1.4. Mobility of Capture 742 The Mobility of Capture attribute indicates whether or not the 743 point of capture, line on point of capture, and area of capture 744 values stay the same over time, or are expected to change 745 (potentially frequently). Possible values are static, dynamic, and 746 highly dynamic. 748 An example for "dynamic" is a camera mounted on a stand which is 749 occasionally hand-carried and placed at different positions in 750 order to provide the best angle to capture a work task. A camera 751 worn by a participant who moves around the room is an example for 752 "highly dynamic". In either case, the effect is that the capture 753 point, capture axis and area of capture change with time. 755 The capture point of a static capture MUST NOT move for the life of 756 the conference. The capture point of dynamic captures is 757 categorized by a change in position followed by a reasonable period 758 of stability--in the order of magnitude of minutes. High dynamic 759 captures are categorized by a capture point that is constantly 760 moving. If the "area of capture", "capture point" and "line of 761 capture" attributes are included with dynamic or highly dynamic 762 captures they indicate spatial information at the time of the 763 Advertisement. 765 7.1.1.5. Audio Channel Format 767 The Audio Channel Format attribute is a field with enumerated 768 values which describes the method of encoding used for audio. A 769 value of 'mono' means the Audio Capture has one channel. 'stereo' 770 means the Audio Capture has two audio channels, left and right. 772 This attribute applies only to Audio Captures. A single stereo 773 capture is different from two mono captures that have a left-right 774 spatial relationship. A stereo capture maps to a single Capture 775 Encoding, while each mono audio capture maps to a separate Capture 776 Encoding. 778 7.1.1.6. Max Capture Encodings 780 The Max Capture Encodings attribute is an optional attribute 781 indicating the maximum number of Capture Encodings that can be 782 simultaneously active for the Media Capture. The number of 783 simultaneous Capture Encodings is also limited by the restrictions 784 of the Encoding Group for the Media Capture. 786 7.1.1.7. Description 788 The Description attribute is a human-readable description of the 789 Capture, which could be in multiple languages. 791 7.1.1.8. Presentation 793 The Presentation attribute indicates that the capture originates 794 from a presentation device, that is one that provides supplementary 795 information to a conference through slides, video, still images, 796 data etc. Where more information is known about the capture it MAY 797 be expanded hierarchically to indicate the different types of 798 presentation media, e.g. presentation.slides, presentation.image 799 etc. 801 Note: It is expected that a number of keywords will be defined that 802 provide more detail on the type of presentation. 804 7.1.1.9. View 806 The View attribute is a field with enumerated values, indicating 807 what type of view the Capture relates to. The Consumer can use 808 this information to help choose which Media Captures it wishes to 809 receive. The value MUST be one of: 811 Room - Captures the entire scene 813 Table - Captures the conference table with seated participants 815 Individual - Captures an individual participant 817 Lectern - Captures the region of the lectern including the 818 presenter, for example in a classroom style conference room 819 Audience - Captures a region showing the audience in a classroom 820 style conference room 822 7.1.1.10. Language 824 The language attribute indicates one or more languages used in the 825 content of the Media Capture. Captures MAY be offered in different 826 languages in case of multilingual and/or accessible conferences. A 827 Consumer can use this attribute to differentiate between them and 828 pick the appropriate one. 830 Note that the Language attribute is defined and meaningful both for 831 audio and video captures. In case of audio captures, the meaning 832 is obvious. For a video capture, "Language" could, for example, be 833 sign interpretation or text. 835 7.1.1.11. Participant Information 837 The participant information attribute allows a Provider to provide 838 specific information regarding the conference participants in a 839 Capture. The Provider may gather the information automatically or 840 manually from a variety of sources however the xCard [RFC6351] 841 format is used to convey the information. This allows various 842 information such as Identification information (section 843 6.2/[RFC6350]), Communication Information (section 6.4/[RFC6350]) 844 and Organizational information (section 6.6/[RFC6350]) to be 845 communicated. A Consumer may then automatically (i.e. via a 846 policy) or manually select Captures based on information about who 847 is in a Capture. It also allows a Consumer to render information 848 regarding the participants or to use it for further processing. 850 The Provider may supply a minimal set of information or a larger 851 set of information. However it MUST be compliant to [RFC6350] and 852 supply a "VERSION" and "FN" property. A Provider may supply 853 multiple xCards per Capture of any KIND (section 6.1.4/[RFC6350]). 855 In order to keep CLUE messages compact the Provider SHOULD use a 856 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 857 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 858 message. 860 7.1.1.12. Participant Type 862 The participant type attribute indicates the type of participant/s 863 contained in the capture in the conference with respect to the 864 meeting agenda. As a capture may include multiple participants the 865 attribute may contain multiple value. However values shall not be 866 repeated within the attribute. 868 An Advertiser associates the participant type with an individual 869 capture when it knows that a particular type is in the capture. If 870 an Advertiser cannot link a particular type with some certainty to 871 a capture then it is not included. A Consumer on reception of a 872 capture with a participant type attribute knows with some certainly 873 that the capture contains that participant type. The capture may 874 contain other participant types but the Advertiser has not been 875 able to determine that this is the case. 877 The types of Captured participants include: 879 . Chairman - the participant responsible for running the 880 conference according to the agenda. 881 . Vice-Chairman - the participant responsible for assisting the 882 chairman in running the meeting. 883 . Minute Taker - the participant responsible for recording the 884 minutes of the conference 4. Member - the participant has no 885 particular responsibilities with respect to running the 886 meeting. 887 . Presenter - the participant is scheduled on the agenda to make 888 a presentation in the meeting. Note: This is not related to 889 any "active speaker" functionality. 890 . Translator - the participant is providing some form of 891 translation or commentary in the meeting. 892 . Timekeeper - the participant is responsible for maintaining 893 the meeting schedule. 895 Furthermore the participant type attribute may contain one or more 896 strings allowing the Provider to indicate custom meeting specific 897 roles. 899 7.1.1.13. Priority 901 The priority attribute indicates a relative priority between 902 different Media Captures. The Provider sets this priority, and the 903 Consumer MAY use the priority to help decide which captures it 904 wishes to receive. 906 The "priority" attribute is an integer which indicates a relative 907 priority between Captures. For example it is possible to assign a 908 priority between two presentation Captures that would allow a 909 remote endpoint to determine which presentation is more important. 910 Priority is assigned at the individual capture level. It represents 911 the Provider's view of the relative priority between Captures with 912 a priority. The same priority number MAY be used across multiple 913 Captures. It indicates they are equally important. If no priority 914 is assigned no assumptions regarding relative important of the 915 Capture can be assumed. 917 7.1.1.14. Embedded Text 919 The Embedded Text attribute indicates that a Capture provides 920 embedded textual information. For example the video Capture MAY 921 contain speech to text information composed with the video image. 922 This attribute is only applicable to video Captures and 923 presentation streams with visual information. 925 7.1.1.15. Related To 927 The Related To attribute indicates the Capture contains additional 928 complementary information related to another Capture. The value 929 indicates the identity of the other Capture to which this Capture 930 is providing additional information. 932 For example, a conference can utilize translators or facilitators 933 that provide an additional audio stream (i.e. a translation or 934 description or commentary of the conference). Where multiple 935 captures are available, it may be advantageous for a Consumer to 936 select a complementary Capture instead of or in addition to a 937 Capture it relates to. 939 7.2. Multiple Content Capture 941 The MCC indicates that one or more Single Media Captures are 942 contained in one Media Capture. Only one Capture type (i.e. audio, 943 video, etc.) is allowed in each MCC instance. The MCC may contain 944 a reference to the Single Media Captures (which may have their own 945 attributes) as well as attributes associated with the MCC itself. 946 A MCC may also contain other MCCs. The MCC MAY reference Captures 947 from within the Capture Scene that defines it or from other Capture 948 Scenes. No ordering is implied by the order that Captures appear 949 within a MCC. A MCC MAY contain no references to other Captures to 950 indicate that the MCC contains content from multiple sources but no 951 information regarding those sources is given. 953 One or more MCCs may also be specified in a CSE. This allows an 954 Advertiser to indicate that several MCC captures are used to 955 represent a capture scene. Table 14 provides an example of this 956 case. 958 As outlined in section 7.1. each instance of the MCC has its own 959 Capture identity i.e. MCC1. It allows all the individual captures 960 contained in the MCC to be referenced by a single MCC identity. 962 The example below shows the use of a Multiple Content Capture: 964 +-----------------------+---------------------------------+ 965 | Capture Scene #1 | | 966 +-----------------------|---------------------------------+ 967 | VC1 | {attributes} | 968 | VC2 | {attributes} | 969 | VCn | {attributes} | 970 | MCC1(VC1,VC2,...VCn) | {attributes} | 971 | CSE(MCC1) | | 972 +---------------------------------------------------------+ 974 Table 1: Multiple Content Capture concept 976 This indicates that MCC1 is a single capture that contains the 977 Captures VC1, VC2 and VC3 according to any MCC1 attributes. 979 7.2.1. MCC Attributes 981 Attributes may be associated with the MCC instance and the Single 982 Media Captures that the MCC references. A provider should avoid 983 providing conflicting attribute values between the MCC and Single 984 Media Captures. Where there is conflict the attributes of the MCC 985 override any that may be present in the individual captures. 987 A Provider MAY include as much or as little of the original source 988 Capture information as it requires. 990 There are MCC specific attributes that MUST only be used with 991 Multiple Content Captures. These are described in the sections 992 below. The attributes described in section 7.1.1. MAY also be used 993 with MCCs. 995 The spatial related attributes of an MCC indicate its area of 996 capture and point of capture within the scene, just like any other 997 media capture. The spatial information does not imply anything 998 about how other captures are composed within an MCC. 1000 For example: A virtual scene could be constructed for the MCC 1001 capture with two Video Captures with a "MaxCaptures" attribute set 1002 to 2 and an "Area of Capture" attribute provided with an overall 1003 area. Each of the individual Captures could then also include an 1004 "Area of Capture" attribute with a sub-set of the overall area. 1005 The Consumer would then know how each capture is related to others 1006 within the scene, but not the relative position of the individual 1007 captures within the composed capture. 1009 +-----------------------+---------------------------------+ 1010 | Capture Scene #1 | | 1011 +-----------------------|---------------------------------+ 1012 | VC1 | AreaofCapture=(0,0,0)(9,0,0) | 1013 | | (0,0,9)(9,0,9) | 1014 | VC2 | AreaofCapture=(10,0,0)(19,0,0) | 1015 | | (10,0,9)(19,0,9) | 1016 | MCC1(VC1,VC2) | MaxCaptures=2 | 1017 | | AreaofCapture=(0,0,0)(19,0,0) | 1018 | | (0,0,9)(19,0,9) | 1019 | CSE(MCC1) | | 1020 +---------------------------------------------------------+ 1022 Table 2: Example of MCC and Single Media Capture attributes 1024 The sections below describe the MCC only attributes. 1026 7.2.1.1. Maximum Number of Captures within a MCC 1028 The Maximum Number of Captures MCC attribute indicates the maximum 1029 number of individual captures that may appear in a Capture Encoding 1030 at a time. The actual number at any given time can be less than 1031 this maximum. It may be used to derive how the Single Media 1032 Captures within the MCC are composed / switched with regards to 1033 space and time. 1035 Max Captures MAY be set to one so that only content related to one 1036 of the sources are shown in the MCC Capture Encoding at a time or 1037 it may be set to any value up to the total number of Source Media 1038 Captures in the MCC. 1040 If this attribute is not set then as default it is assumed that all 1041 source content can appear concurrently in the Capture Encoding 1042 associated with the MCC. 1044 For example: The use of MaxCaptures equal to 1 on a MCC with three 1045 Video Captures VC1, VC2 and VC3 would indicate that the Advertiser 1046 in the capture encoding would switch between VC1, VC2 or VC3 as 1047 there may be only a maximum of one capture at a time. 1049 7.2.1.2. Policy 1051 The Policy MCC Attribute indicates the criteria that the Provider 1052 uses to determine when and/or where media content appears in the 1053 Capture Encoding related to the MCC. 1055 The attribute is in the form of a token that indicates the policy 1056 and index representing an instance of the policy. 1058 The tokens are: 1060 SoundLevel - This indicates that the content of the MCC is 1061 determined by a sound level detection algorithm. For example: the 1062 loudest (active) speaker is contained in the MCC. 1064 RoundRobin - This indicates that the content of the MCC is 1065 determined by a time based algorithm. For example: the Provider 1066 provides content from a particular source for a period of time and 1067 then provides content from another source and so on. 1069 An index is used to represent an instance in the policy setting. A 1070 index of 0 represents the most current instance of the policy, i.e. 1071 the active speaker, 1 represents the previous instance, i.e. the 1072 previous active speaker and so on. 1074 The following example shows a case where the Provider provides two 1075 media streams, one showing the active speaker and a second stream 1076 showing the previous speaker. 1078 +-----------------------+---------------------------------+ 1079 | Capture Scene #1 | | 1080 +-----------------------|---------------------------------+ 1081 | VC1 | | 1082 | VC2 | | 1083 | MCC1(VC1,VC2) | Policy=SoundLevel:0 | 1084 | | MaxCaptures=1 | 1085 | MCC2(VC1,VC2) | Policy=SoundLevel:1 | 1086 | | MaxCaptures=1 | 1087 | CSE(MCC1,MCC2) | | 1088 +---------------------------------------------------------+ 1090 Table 3: Example Policy MCC attribute usage 1092 7.2.1.3. Synchronisation Identity 1094 The Synchronisation Identity MCC attribute indicates how the 1095 individual captures in multiple MCC captures are synchronised. To 1096 indicate that the Capture Encodings associated with MCCs contain 1097 captures from the source at the same time a Provider should set the 1098 same Synchronisation Identity on each of the concerned MCCs. It is 1099 the provider that determines what the source for the Captures is, 1100 so a provider can choose how to group together Single Media 1101 Captures for the purpose of keeping them synchronized according to 1102 the SynchronisationID attribute. For example when the provider is 1103 in an MCU it may determine that each separate CLUE endpoint is a 1104 remote source of media. The Synchronisation Identity may be used 1105 across media types, i.e. to synchronize audio and video related 1106 MCCs. 1108 Without this attribute it is assumed that multiple MCCs may provide 1109 content from different sources at any particular point in time. 1111 For example: 1113 +=======================+=================================+ 1114 | Capture Scene #1 | | 1115 +-----------------------|---------------------------------+ 1116 | VC1 | Description=Left | 1117 | VC2 | Description=Centre | 1118 | VC3 | Description=Right | 1119 | AC1 | Description=room | 1120 | CSE(VC1,VC2,VC3) | | 1121 | CSE(AC1) | | 1122 +=======================+=================================+ 1123 | Capture Scene #2 | | 1124 +-----------------------|---------------------------------+ 1125 | VC4 | Description=Left | 1126 | VC5 | Description=Centre | 1127 | VC6 | Description=Right | 1128 | AC2 | Description=room | 1129 | CSE(VC4,VC5,VC6) | | 1130 | CSE(AC2) | | 1131 +=======================+=================================+ 1132 | Capture Scene #3 | | 1133 +-----------------------|---------------------------------+ 1134 | VC7 | | 1135 | AC3 | | 1136 +=======================+=================================+ 1137 | Capture Scene #4 | | 1138 +-----------------------|---------------------------------+ 1139 | VC8 | | 1140 | AC4 | | 1141 +=======================+=================================+ 1142 | Capture Scene #3 | | 1143 +-----------------------|---------------------------------+ 1144 | MCC1(VC1,VC4,VC7) | SynchronisationID=1 | 1145 | | MaxCaptures=1 | 1146 | MCC2(VC2,VC5,VC8) | SynchronisationID=1 | 1147 | | MaxCaptures=1 | 1148 | MCC3(VC3,VC6) | MaxCaptures=1 | 1149 | MCC4(AC1,AC2,AC3,AC4) | SynchronisationID=1 | 1150 | | MaxCaptures=1 | 1151 | CSE(MCC1,MCC2,MCC3) | | 1152 | CSE(MCC4) | | 1153 +=======================+=================================+ 1155 Table 4: Example Synchronisation Identity MCC attribute usage 1157 The above Advertisement would indicate that MCC1, MCC2, MCC3 and 1158 MCC4 make up a Capture Scene. There would be four capture 1159 encodings (one for each MCC). Because MCC1 and MCC2 have the same 1160 SynchronisationID, each encoding from MCC1 and MCC2 respectively 1161 would together have content from only Capture Scene 1 or only 1162 Capture Scene 2 or the combination of VC7 and VC8 at a particular 1163 point in time. In this case the provider has decided the sources 1164 to be synchronized are Scene #1, Scene #2, and Scene #3 and #4 1165 together. The encoding from MCC3 would not be synchronised with 1166 MCC1 or MCC2. As MCC4 also has the same Synchronisation Identity 1167 as MCC1 and MCC2 the content of the audio encoding will be 1168 synchronised with the video content. 1170 7.3. Capture Scene 1172 In order for a Provider's individual Captures to be used 1173 effectively by a Consumer, the provider organizes the Captures into 1174 one or more Capture Scenes, with the structure and contents of 1175 these Capture Scenes being sent from the Provider to the Consumer 1176 in the Advertisement. 1178 A Capture Scene is a structure representing a spatial region 1179 containing one or more Capture Devices, each capturing media 1180 representing a portion of the region. A Capture Scene includes one 1181 or more Capture Scene entries, with each entry including one or 1182 more Media Captures. A Capture Scene represents, for example, the 1183 video image of a group of people seated next to each other, along 1184 with the sound of their voices, which could be represented by some 1185 number of VCs and ACs in the Capture Scene Entries. A middle box 1186 can also describe in Capture Scenes what it constructs from media 1187 Streams it receives. 1189 A Provider MAY advertise one or more Capture Scenes. What 1190 constitutes an entire Capture Scene is up to the Provider. A 1191 simple Provider might typically use one Capture Scene for 1192 participant media (live video from the room cameras) and another 1193 Capture Scene for a computer generated presentation. In more 1194 complex systems, the use of additional Capture Scenes is also 1195 sensible. For example, a classroom may advertise two Capture 1196 Scenes involving live video, one including only the camera 1197 capturing the instructor (and associated audio), the other 1198 including camera(s) capturing students (and associated audio). 1200 A Capture Scene MAY (and typically will) include more than one type 1201 of media. For example, a Capture Scene can include several Capture 1202 Scene Entries for Video Captures, and several Capture Scene Entries 1203 for Audio Captures. A particular Capture MAY be included in more 1204 than one Capture Scene Entry. 1206 A provider MAY express spatial relationships between Captures that 1207 are included in the same Capture Scene. However, there is not 1208 necessarily the same spatial relationship between Media Captures 1209 that are in different Capture Scenes. In other words, Capture 1210 Scenes can use their own spatial measurement system as outlined 1211 above in section 6. 1213 A Provider arranges Captures in a Capture Scene to help the 1214 Consumer choose which captures it wants to render. The Capture 1215 Scene Entries in a Capture Scene are different alternatives the 1216 Provider is suggesting for representing the Capture Scene. The 1217 order of Capture Scene Entries within a Capture Scene has no 1218 significance. The Media Consumer can choose to receive all Media 1219 Captures from one Capture Scene Entry for each media type (e.g. 1221 audio and video), or it can pick and choose Media Captures 1222 regardless of how the Provider arranges them in Capture Scene 1223 Entries. Different Capture Scene Entries of the same media type 1224 are not necessarily mutually exclusive alternatives. Also note 1225 that the presence of multiple Capture Scene Entries (with 1226 potentially multiple encoding options in each entry) in a given 1227 Capture Scene does not necessarily imply that a Provider is able to 1228 serve all the associated media simultaneously (although the 1229 construction of such an over-rich Capture Scene is probably not 1230 sensible in many cases). What a Provider can send simultaneously 1231 is determined through the Simultaneous Transmission Set mechanism, 1232 described in section 8. 1234 Captures within the same Capture Scene entry MUST be of the same 1235 media type - it is not possible to mix audio and video captures in 1236 the same Capture Scene Entry, for instance. The Provider MUST be 1237 capable of encoding and sending all Captures in a single Capture 1238 Scene Entry simultaneously. The order of Captures within a Capture 1239 Scene Entry has no significance. A Consumer can decide to receive 1240 all the Captures in a single Capture Scene Entry, but a Consumer 1241 could also decide to receive just a subset of those captures. A 1242 Consumer can also decide to receive Captures from different Capture 1243 Scene Entries, all subject to the constraints set by Simultaneous 1244 Transmission Sets, as discussed in section 8. 1246 When a Provider advertises a Capture Scene with multiple entries, 1247 it is essentially signaling that there are multiple representations 1248 of the same Capture Scene available. In some cases, these multiple 1249 representations would typically be used simultaneously (for 1250 instance a "video entry" and an "audio entry"). In some cases the 1251 entries would conceptually be alternatives (for instance an entry 1252 consisting of three Video Captures covering the whole room versus 1253 an entry consisting of just a single Video Capture covering only 1254 the center of a room). In this latter example, one sensible choice 1255 for a Consumer would be to indicate (through its Configure and 1256 possibly through an additional offer/answer exchange) the Captures 1257 of that Capture Scene Entry that most closely matched the 1258 Consumer's number of display devices or screen layout. 1260 The following is an example of 4 potential Capture Scene Entries 1261 for an endpoint-style Provider: 1263 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 1265 2. (VC3) - Video Capture associated with loudest room segment 1266 3. (VC4) - Video Capture zoomed out view of all people in the room 1268 4. (AC0) - main audio 1270 The first entry in this Capture Scene example is a list of Video 1271 Captures which have a spatial relationship to each other. 1272 Determination of the order of these captures (VC0, VC1 and VC2) for 1273 rendering purposes is accomplished through use of their Area of 1274 Capture attributes. The second entry (VC3) and the third entry 1275 (VC4) are alternative representations of the same room's video, 1276 which might be better suited to some Consumers' rendering 1277 capabilities. The inclusion of the Audio Capture in the same 1278 Capture Scene indicates that AC0 is associated with all of those 1279 Video Captures, meaning it comes from the same spatial region. 1280 Therefore, if audio were to be rendered at all, this audio would be 1281 the correct choice irrespective of which Video Captures were 1282 chosen. 1284 7.3.1. Capture Scene attributes 1286 Capture Scene Attributes can be applied to Capture Scenes as well 1287 as to individual media captures. Attributes specified at this 1288 level apply to all constituent Captures. Capture Scene attributes 1289 include 1291 . Human-readable description of the Capture Scene, which could 1292 be in multiple languages; 1293 . xCard scene information 1294 . Scale information (millimeters, unknown, no scale), as 1295 described in Section 6. 1297 7.3.1.1. Scene Information 1299 The Scene information attribute provides information regarding the 1300 Capture Scene rather than individual participants. The Provider 1301 may gather the information automatically or manually from a 1302 variety of sources. The scene information attribute allows a 1303 Provider to indicate information such as: organizational or 1304 geographic information allowing a Consumer to determine which 1305 Capture Scenes are of interest in order to then perform Capture 1306 selection. It also allows a Consumer to render information 1307 regarding the Scene or to use it for further processing. 1309 As per 7.1.1.11. the xCard format is used to convey this 1310 information and the Provider may supply a minimal set of 1311 information or a larger set of information. 1313 In order to keep CLUE messages compact the Provider SHOULD use a 1314 URI to point to any LOGO, PHOTO or SOUND contained in the xCARD 1315 rather than transmitting the LOGO, PHOTO or SOUND data in a CLUE 1316 message. 1318 7.3.2. Capture Scene Entry attributes 1320 A Capture Scene can include one or more Capture Scene Entries in 1321 addition to the Capture Scene wide attributes described above. 1322 Capture Scene Entry attributes apply to the Capture Scene Entry as 1323 a whole, i.e. to all Captures that are part of the Capture Scene 1324 Entry. 1326 Capture Scene Entry attributes include: 1328 . Human-readable description of the Capture Scene Entry, which 1329 could be in multiple languages; 1331 8. Simultaneous Transmission Set Constraints 1333 In many practical cases, a Provider has constraints or limitations 1334 on its ability to send Captures simultaneously. One type of 1335 limitation is caused by the physical limitations of capture 1336 mechanisms; these constraints are represented by a simultaneous 1337 transmission set. The second type of limitation reflects the 1338 encoding resources available, such as bandwidth or video encoding 1339 throughput (macroblocks/second). This type of constraint is 1340 captured by encoding groups, discussed below. 1342 Some Endpoints or MCUs can send multiple Captures simultaneously; 1343 however sometimes there are constraints that limit which Captures 1344 can be sent simultaneously with other Captures. A device may not 1345 be able to be used in different ways at the same time. Provider 1346 Advertisements are made so that the Consumer can choose one of 1347 several possible mutually exclusive usages of the device. This 1348 type of constraint is expressed in a Simultaneous Transmission Set, 1349 which lists all the Captures of a particular media type (e.g. 1350 audio, video, text) that can be sent at the same time. There are 1351 different Simultaneous Transmission Sets for each media type in the 1352 Advertisement. This is easier to show in an example. 1354 Consider the example of a room system where there are three cameras 1355 each of which can send a separate capture covering two persons 1356 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1357 optical zoom lens) and show all six persons, VC3. But the middle 1358 camera cannot be used in both modes at the same time - it has to 1359 either show the space where two participants sit or the whole six 1360 seats, but not both at the same time. As a result, VC1 and VC3 1361 cannot be sent simultaneously. 1363 Simultaneous Transmission Sets are expressed as sets of the Media 1364 Captures that the Provider could transmit at the same time (though, 1365 in some cases, it is not intuitive to do so). If a Multiple 1366 Content Capture is included in a Simultaneous Transmission Set it 1367 indicates that the Capture Encoding associated with it could be 1368 transmitted as the same time as the other Captures within the 1369 Simultaneous Transmission Set. It does not imply that the Single 1370 Media Captures contained in the Multiple Content Capture could all 1371 be transmitted at the same time. 1373 In this example the two simultaneous sets are shown in Table 1. If 1374 a Provider advertises one or more mutually exclusive Simultaneous 1375 Transmission Sets, then for each media type the Consumer MUST 1376 ensure that it chooses Media Captures that lie wholly within one of 1377 those Simultaneous Transmission Sets. 1379 +-------------------+ 1380 | Simultaneous Sets | 1381 +-------------------+ 1382 | {VC0, VC1, VC2} | 1383 | {VC0, VC3, VC2} | 1384 +-------------------+ 1386 Table 5: Two Simultaneous Transmission Sets 1388 A Provider OPTIONALLY can include the simultaneous sets in its 1389 provider Advertisement. These simultaneous set constraints apply 1390 across all the Capture Scenes in the Advertisement. It is a syntax 1391 conformance requirement that the simultaneous transmission sets 1392 MUST allow all the media captures in any particular Capture Scene 1393 Entry to be used simultaneously. 1395 For shorthand convenience, a Provider MAY describe a Simultaneous 1396 Transmission Set in terms of Capture Scene Entries and Capture 1397 Scenes. If a Capture Scene Entry is included in a Simultaneous 1398 Transmission Set, then all Media Captures in the Capture Scene 1399 Entry are included in the Simultaneous Transmission Set. If a 1400 Capture Scene is included in a Simultaneous Transmission Set, then 1401 all its Capture Scene Entries (of the corresponding media type) are 1402 included in the Simultaneous Transmission Set. The end result 1403 reduces to a set of Media Captures in either case. 1405 If an Advertisement does not include Simultaneous Transmission 1406 Sets, then the Provider MUST be able to provide all Capture Scenes 1407 simultaneously. If multiple capture Scene Entries are in a Capture 1408 Scene then the Consumer chooses at most one Capture Scene Entry per 1409 Capture Scene for each media type. 1411 If an Advertisement includes multiple Capture Scene Entries in a 1412 Capture Scene then the Consumer MAY choose one Capture Scene Entry 1413 for each media type, or MAY choose individual Captures based on the 1414 Simultaneous Transmission Sets. 1416 9. Encodings 1418 Individual encodings and encoding groups are CLUE's mechanisms 1419 allowing a Provider to signal its limitations for sending Captures, 1420 or combinations of Captures, to a Consumer. Consumers can map the 1421 Captures they want to receive onto the Encodings, with encoding 1422 parameters they want. As for the relationship between the CLUE- 1423 specified mechanisms based on Encodings and the SIP Offer-Answer 1424 exchange, please refer to section 4. 1426 9.1. Individual Encodings 1428 An Individual Encoding represents a way to encode a Media Capture 1429 to become a Capture Encoding, to be sent as an encoded media stream 1430 from the Provider to the Consumer. An Individual Encoding has a 1431 set of parameters characterizing how the media is encoded. 1433 Different media types have different parameters, and different 1434 encoding algorithms may have different parameters. An Individual 1435 Encoding can be assigned to at most one Capture Encoding at any 1436 given time. 1438 The parameters of an Individual Encoding represent the maximum 1439 values for certain aspects of the encoding. A particular 1440 instantiation into a Capture Encoding MAY use lower values than 1441 these maximums if that is applicable for the media in question. 1442 For example, most video codec specifications require a conformant 1443 decoder to decode resolutions and frame rates smaller than what has 1444 been negotiated as a maximum, so downgrading the CLUE maximum 1445 values for macroblocks/second is appropriate. On the other hand, 1446 downgrading the sample rate of G.711 audio below 8kHz is not 1447 specified in G.711 and therefore not applicable in the sense 1448 described here. 1450 Individual Encoding parameters are represented in SDP [RFC4566], 1451 not in CLUE messages. For example, for a video encoding using 1452 H.26x compression technologies, this can include parameters such 1453 as: 1455 . Maximum bandwidth; 1456 . Maximum picture size in pixels; 1457 . Maxmimum number of pixels to be processed per second; 1459 The bandwidth parameter is the only one that specifically relates 1460 to a CLUE Advertisement, as it can be further constrained by the 1461 maximum group bandwidth in an Encoding Group. 1463 9.2. Encoding Group 1465 An Encoding Group includes a set of one or more Individual 1466 Encodings, and parameters that apply to the group as a whole. By 1467 grouping multiple individual Encodings together, an Encoding Group 1468 describes additional constraints on bandwidth for the group. 1470 The Encoding Group data structure contains: 1472 . Maximum bitrate for all encodings in the group combined; 1473 . A list of identifiers for audio and video encodings, 1474 respectively, belonging to the group. 1476 When the Individual Encodings in a group are instantiated into 1477 Capture Encodings, each Capture Encoding has a bitrate that MUST be 1478 less than or equal to the max bitrate for the particular individual 1479 encoding. The "maximum bitrate for all encodings in the group" 1480 parameter gives the additional restriction that the sum of all the 1481 individual capture encoding bitrates MUST be less than or equal to 1482 the this group value. 1484 The following diagram illustrates one example of the structure of a 1485 media provider's Encoding Groups and their contents. 1487 ,-------------------------------------------------. 1488 | Media Provider | 1489 | | 1490 | ,--------------------------------------. | 1491 | | ,--------------------------------------. | 1492 | | | ,--------------------------------------. | 1493 | | | | Encoding Group | | 1494 | | | | ,-----------. | | 1495 | | | | | | ,---------. | | 1496 | | | | | | | | ,---------.| | 1497 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1498 | `.| | | | | | `---------'| | 1499 | `.| `-----------' `---------' | | 1500 | `--------------------------------------' | 1501 `-------------------------------------------------' 1503 Figure 3: Encoding Group Structure 1505 A Provider advertises one or more Encoding Groups. Each Encoding 1506 Group includes one or more Individual Encodings. Each Individual 1507 Encoding can represent a different way of encoding media. For 1508 example one Individual Encoding may be 1080p60 video, another could 1509 be 720p30, with a third being CIF, all in, for example, H.264 1510 format. 1511 While a typical three codec/display system might have one Encoding 1512 Group per "codec box" (physical codec, connected to one camera and 1513 one screen), there are many possibilities for the number of 1514 Encoding Groups a Provider may be able to offer and for the 1515 encoding values in each Encoding Group. 1517 There is no requirement for all Encodings within an Encoding Group 1518 to be instantiated at the same time. 1520 9.3. Associating Captures with Encoding Groups 1522 Each Media Capture MAY be associated with at least one Encoding 1523 Group, which is used to instantiate that Capture into one or more 1524 Capture Encodings. Typically MCCs are assigned an Encoding Group 1525 and thus become a Capture Encoding. The Captures (including other 1526 MCCs) referenced by the MCC do not need to be assigned to an 1527 Encoding Group. This means that all the Media Captures referenced 1528 by the MCC will appear in the Capture Encoding according to any MCC 1529 attributes. This allows an Advertiser to specify Capture attributes 1530 associated with the Media Captures without the need to provide an 1531 individual Capture Encoding for each of the inputs. 1533 If an Encoding Group is assigned to a Media Capture referenced by 1534 the MCC it indicates that this Capture may also have an individual 1535 Capture Encoding. 1537 For example: 1539 +--------------------+------------------------------------+ 1540 | Capture Scene #1 | | 1541 +--------------------+------------------------------------+ 1542 | VC1 | EncodeGroupID=1 | 1543 | VC2 | | 1544 | MCC1(VC1,VC2) | EncodeGroupID=2 | 1545 | CSE(VC1) | | 1546 | CSE(MCC1) | | 1547 +--------------------+------------------------------------+ 1549 Table 6: Example usage of Encoding with MCC and source Captures 1551 This would indicate that VC1 may be sent as its own Capture 1552 Encoding from EncodeGroupID=1 or that it may be sent as part of a 1553 Capture Encoding from EncodeGroupID=2 along with VC2. 1555 More than one Capture MAY use the same Encoding Group. 1557 The maximum number of streams that can result from a particular 1558 Encoding Group constraint is equal to the number of individual 1559 Encodings in the group. The actual number of Capture Encodings 1560 used at any time MAY be less than this maximum. Any of the 1561 Captures that use a particular Encoding Group can be encoded 1562 according to any of the Individual Encodings in the group. If 1563 there are multiple Individual Encodings in the group, then the 1564 Consumer can configure the Provider, via a Configure message, to 1565 encode a single Media Capture into multiple different Capture 1566 Encodings at the same time, subject to the Max Capture Encodings 1567 constraint, with each capture encoding following the constraints of 1568 a different Individual Encoding. 1570 It is a protocol conformance requirement that the Encoding Groups 1571 MUST allow all the Captures in a particular Capture Scene Entry to 1572 be used simultaneously. 1574 10. Consumer's Choice of Streams to Receive from the Provider 1576 After receiving the Provider's Advertisement message (that includes 1577 media captures and associated constraints), the Consumer composes 1578 its reply to the Provider in the form of a Configure message. The 1579 Consumer is free to use the information in the Advertisement as it 1580 chooses, but there are a few obviously sensible design choices, 1581 which are outlined below. 1583 If multiple Providers connect to the same Consumer (i.e. in a n 1584 MCU-less multiparty call), it is the responsibility of the Consumer 1585 to compose Configures for each Provider that both fulfill each 1586 Provider's constraints as expressed in the Advertisement, as well 1587 as its own capabilities. 1589 In an MCU-based multiparty call, the MCU can logically terminate 1590 the Advertisement/Configure negotiation in that it can hide the 1591 characteristics of the receiving endpoint and rely on its own 1592 capabilities (transcoding/transrating/...) to create Media Streams 1593 that can be decoded at the Endpoint Consumers. The timing of an 1594 MCU's sending of Advertisements (for its outgoing ports) and 1595 Configures (for its incoming ports, in response to Advertisements 1596 received there) is up to the MCU and implementation dependent. 1598 As a general outline, A Consumer can choose, based on the 1599 Advertisement it has received, which Captures it wishes to receive, 1600 and which Individual Encodings it wants the Provider to use to 1601 encode the Captures. 1603 On receipt of an Advertisement with an MCC the Consumer treats the 1604 MCC as per other non-MCC Captures with the following differences: 1606 - The Consumer would understand that the MCC is a Capture that 1607 includes the referenced individual Captures and that these 1608 individual Captures are delivered as part of the MCC's Capture 1609 Encoding. 1611 - The Consumer may utilise any of the attributes associated with 1612 the referenced individual Captures and any Capture Scene attributes 1613 from where the individual Captures were defined to choose Captures 1614 and for rendering decisions. 1616 - The Consumer may or may not choose to receive all the indicated 1617 captures. Therefore it can choose to receive a sub-set ofCaptures 1618 indicated by the MCC. 1620 For example if the Consumer receives: 1622 MCC1(VC1,VC2,VC3){attributes} 1624 A Consumer could choose all the Captures within a MCCs however if 1625 the Consumer determines that it doesn't want VC3 it can return 1626 MCC1(VC1,VC2). If it wants all the individual Captures then it 1627 returns only the MCC identity (i.e. MCC1). If the MCC in the 1628 advertisement does not reference any individual captures, then the 1629 Consumer cannot choose what is included in the MCC, it is up to the 1630 Provider to decide. 1632 A Configure Message includes a list of Capture Encodings. These 1633 are the Capture Encodings the Consumer wishes to receive from the 1634 Provider. Each Capture Encoding refers to one Media Capture, one 1635 Individual Encoding, and includes the encoding parameter values. A 1636 Configure Message does not include references to Capture Scenes or 1637 Capture Scene Entries. 1639 For each Capture the Consumer wants to receive, it configures one 1640 or more of the encodings in that capture's encoding group. The 1641 Consumer does this by telling the Provider, in its Configure 1642 Message, parameters such as the resolution, frame rate, bandwidth, 1643 etc. for each Capture Encodings for its chosen Captures. Upon 1644 receipt of this Configure from the Consumer, common knowledge is 1645 established between Provider and Consumer regarding sensible 1646 choices for the media streams and their parameters. The setup of 1647 the actual media channels, at least in the simplest case, is left 1648 to a following offer-answer exchange. Optimized implementations 1649 MAY speed up the reaction to the offer-answer exchange by reserving 1650 the resources at the time of finalization of the CLUE handshake. 1652 CLUE advertisements and configure messages don't necessarily 1653 require a new SDP offer-answer for every CLUE message 1654 exchange. But the resulting encodings sent via RTP must conform to 1655 the most recent SDP offer-answer result. 1657 In order to meaningfully create and send an initial Configure, the 1658 Consumer needs to have received at least one Advertisement from the 1659 Provider. 1661 In addition, the Consumer can send a Configure at any time during 1662 the call. The Configure MUST be valid according to the most 1663 recently received Advertisement. The Consumer can send a Configure 1664 either in response to a new Advertisement from the Provider or on 1665 its own, for example because of a local change in conditions 1666 (people leaving the room, connectivity changes, multipoint related 1667 considerations). 1669 When choosing which Media Streams to receive from the Provider, and 1670 the encoding characteristics of those Media Streams, the Consumer 1671 advantageously takes several things into account: its local 1672 preference, simultaneity restrictions, and encoding limits. 1674 10.1. Local preference 1676 A variety of local factors influence the Consumer's choice of 1677 Media Streams to be received from the Provider: 1679 o if the Consumer is an Endpoint, it is likely that it would 1680 choose, where possible, to receive video and audio Captures that 1681 match the number of display devices and audio system it has 1683 o if the Consumer is a middle box such as an MCU, it MAY choose to 1684 receive loudest speaker streams (in order to perform its own 1685 media composition) and avoid pre-composed video Captures 1687 o user choice (for instance, selection of a new layout) MAY result 1688 in a different set of Captures, or different encoding 1689 characteristics, being required by the Consumer 1691 10.2. Physical simultaneity restrictions 1693 Often there are physical simultaneity constraints of the Provider 1694 that affect the Provider's ability to simultaneously send all of 1695 the captures the Consumer would wish to receive. For instance, a 1696 middle box such as an MCU, when connected to a multi-camera room 1697 system, might prefer to receive both individual video streams of 1698 the people present in the room and an overall view of the room 1699 from a single camera. Some Endpoint systems might be able to 1700 provide both of these sets of streams simultaneously, whereas 1701 others might not (if the overall room view were produced by 1702 changing the optical zoom level on the center camera, for 1703 instance). 1705 10.3. Encoding and encoding group limits 1707 Each of the Provider's encoding groups has limits on bandwidth and 1708 computational complexity, and the constituent potential encodings 1709 have limits on the bandwidth, computational complexity, video 1710 frame rate, and resolution that can be provided. When choosing 1711 the Captures to be received from a Provider, a Consumer device 1712 MUST ensure that the encoding characteristics requested for each 1713 individual Capture fits within the capability of the encoding it 1714 is being configured to use, as well as ensuring that the combined 1715 encoding characteristics for Captures fit within the capabilities 1716 of their associated encoding groups. In some cases, this could 1717 cause an otherwise "preferred" choice of capture encodings to be 1718 passed over in favor of different Capture Encodings--for instance, 1719 if a set of three Captures could only be provided at a low 1720 resolution then a three screen device could switch to favoring a 1721 single, higher quality, Capture Encoding. 1723 11. Extensibility 1725 One important characteristics of the Framework is its 1726 extensibility. Telepresence is a relatively new industry and 1727 while we can foresee certain directions, we also do not know 1728 everything about how it will develop. The standard for 1729 interoperability and handling multiple streams must be future- 1730 proof. The framework itself is inherently extensible through 1731 expanding the data model types. For example: 1733 o Adding more types of media, such as telemetry, can done by 1734 defining additional types of Captures in addition to audio and 1735 video. 1737 o Adding new functionalities, such as 3-D, say, may require 1738 additional attributes describing the Captures. 1740 o Adding a new codecs, such as H.265, can be accomplished by 1741 defining new encoding variables. 1743 The infrastructure is designed to be extended rather than 1744 requiring new infrastructure elements. Extension comes through 1745 adding to defined types. 1747 12. Examples - Using the Framework (Informative) 1749 This section gives some examples, first from the point of view of 1750 the Provider, then the Consumer. 1752 12.1. Provider Behavior 1754 This section shows some examples in more detail of how a Provider 1755 can use the framework to represent a typical case for telepresence 1756 rooms. First an endpoint is illustrated, then an MCU case is 1757 shown. 1759 12.1.1. Three screen Endpoint Provider 1761 Consider an Endpoint with the following description: 1763 3 cameras, 3 displays, a 6 person table 1765 o Each camera can provide one Capture for each 1/3 section of the 1766 table 1768 o A single Capture representing the active speaker can be provided 1769 (voice activity based camera selection to a given encoder input 1770 port implemented locally in the Endpoint) 1772 o A single Capture representing the active speaker with the other 1773 2 Captures shown picture in picture within the stream can be 1774 provided (again, implemented inside the endpoint) 1776 o A Capture showing a zoomed out view of all 6 seats in the room 1777 can be provided 1779 The audio and video Captures for this Endpoint can be described as 1780 follows. 1782 Video Captures: 1784 o VC0- (the camera-left camera stream), encoding group=EG0, 1785 switched=false, view=table 1787 o VC1- (the center camera stream), encoding group=EG1, 1788 switched=false, view=table 1790 o VC2- (the camera-right camera stream), encoding group=EG2, 1791 switched=false, view=table 1793 o VC3- (the loudest panel stream), encoding group=EG1, 1794 switched=true, view=table 1796 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1797 composed=true, switched=true, view=room 1799 o VC5- (the zoomed out view of all people in the room), encoding 1800 group=EG1, composed=false, switched=false, view=room 1802 o VC6- (presentation stream), encoding group=EG1, presentation, 1803 switched=false 1805 The following diagram is a top view of the room with 3 cameras, 3 1806 displays, and 6 seats. Each camera is capturing 2 people. The 1807 six seats are not all in a straight line. 1809 ,-. d 1810 ( )`--.__ +---+ 1811 `-' / `--.__ | | 1812 ,-. | `-.._ |_-+Camera 2 (VC2) 1813 ( ).' ___..-+-''`+-+ 1814 `-' |_...---'' | | 1815 ,-.c+-..__ +---+ 1816 ( )| ``--..__ | | 1817 `-' | ``+-..|_-+Camera 1 (VC1) 1818 ,-. | __..--'|+-+ 1819 ( )| __..--' | | 1820 `-'b|..--' +---+ 1821 ,-. |``---..___ | | 1822 ( )\ ```--..._|_-+Camera 0 (VC0) 1823 `-' \ _..-''`-+ 1824 ,-. \ __.--'' | | 1825 ( ) |..-'' +---+ 1826 `-' a 1827 Figure 4: Room Layout 1829 The two points labeled b and c are intended to be at the midpoint 1830 between the seating positions, and where the fields of view of the 1831 cameras intersect. 1833 The plane of interest for VC0 is a vertical plane that intersects 1834 points 'a' and 'b'. 1836 The plane of interest for VC1 intersects points 'b' and 'c'. The 1837 plane of interest for VC2 intersects points 'c' and 'd'. 1839 This example uses an area scale of millimeters. 1841 Areas of capture: 1843 bottom left bottom right top left top right 1844 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1845 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1846 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1847 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1848 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1849 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1850 VC6 none 1852 Points of capture: 1853 VC0 (-1678,0,800) 1854 VC1 (0,0,800) 1855 VC2 (1678,0,800) 1856 VC3 none 1857 VC4 none 1858 VC5 (0,0,800) 1859 VC6 none 1861 In this example, the right edge of the VC0 area lines up with the 1862 left edge of the VC1 area. It doesn't have to be this way. There 1863 could be a gap or an overlap. One additional thing to note for 1864 this example is the distance from a to b is equal to the distance 1865 from b to c and the distance from c to d. All these distances are 1866 1346 mm. This is the planar width of each area of capture for VC0, 1867 VC1, and VC2. 1869 Note the text in parentheses (e.g. "the camera-left camera 1870 stream") is not explicitly part of the model, it is just 1871 explanatory text for this example, and is not included in the 1872 model with the media captures and attributes. Also, the 1873 "composed" boolean attribute doesn't say anything about how a 1874 capture is composed, so the media consumer can't tell based on 1875 this attribute that VC4 is composed of a "loudest panel with 1876 PiPs". 1878 Audio Captures: 1880 o AC0 (camera-left), encoding group=EG3, content=main, channel 1881 format=mono 1883 o AC1 (camera-right), encoding group=EG3, content=main, channel 1884 format=mono 1886 o AC2 (center) encoding group=EG3, content=main, channel 1887 format=mono 1889 o AC3 being a simple pre-mixed audio stream from the room (mono), 1890 encoding group=EG3, content=main, channel format=mono 1892 o AC4 audio stream associated with the presentation video (mono) 1893 encoding group=EG3, content=slides, channel format=mono 1895 Areas of capture: 1897 bottom left bottom right top left top right 1899 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1900 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1901 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1902 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1903 AC4 none 1905 The physical simultaneity information is: 1907 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1909 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1911 This constraint indicates it is not possible to use all the VCs at 1912 the same time. VC5 cannot be used at the same time as VC1 or VC3 1913 or VC4. Also, using every member in the set simultaneously may 1914 not make sense - for example VC3(loudest) and VC4 (loudest with 1915 PIP). (In addition, there are encoding constraints that make 1916 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1917 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1918 in the encoding groups, not in the simultaneous transmission 1919 sets.) 1921 In this example there are no restrictions on which audio captures 1922 can be sent simultaneously. 1924 Encoding Groups: 1926 This example has three encoding groups associated with the video 1927 captures. Each group can have 3 encodings, but with each 1928 potential encoding having a progressively lower specification. In 1929 this example, 1080p60 transmission is possible (as ENC0 has a 1930 maxPps value compatible with that). Significantly, as up to 3 1931 encodings are available per group, it is possible to transmit some 1932 video captures simultaneously that are not in the same entry in 1933 the capture scene. For example VC1 and VC3 at the same time. 1935 It is also possible to transmit multiple capture encodings of a 1936 single video capture. For example VC0 can be encoded using ENC0 1937 and ENC1 at the same time, as long as the encoding parameters 1938 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1939 4000000 bps and one at 2000000 bps. 1941 encodeGroupID=EG0, maxGroupBandwidth=6000000 1942 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1943 maxPps=124416000, maxBandwidth=4000000 1944 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1945 maxPps=27648000, maxBandwidth=4000000 1946 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1947 maxPps=15552000, maxBandwidth=4000000 1948 encodeGroupID=EG1 maxGroupBandwidth=6000000 1949 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1950 maxPps=124416000, maxBandwidth=4000000 1951 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1952 maxPps=27648000, maxBandwidth=4000000 1953 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1954 maxPps=15552000, maxBandwidth=4000000 1955 encodeGroupID=EG2 maxGroupBandwidth=6000000 1956 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1957 maxPps=124416000, maxBandwidth=4000000 1958 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1959 maxPps=27648000, maxBandwidth=4000000 1960 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1961 maxPps=15552000, maxBandwidth=4000000 1963 Figure 5: Example Encoding Groups for Video 1965 For audio, there are five potential encodings available, so all 1966 five audio captures can be encoded at the same time. 1968 encodeGroupID=EG3, maxGroupBandwidth=320000 1969 encodeID=ENC9, maxBandwidth=64000 1970 encodeID=ENC10, maxBandwidth=64000 1971 encodeID=ENC11, maxBandwidth=64000 1972 encodeID=ENC12, maxBandwidth=64000 1973 encodeID=ENC13, maxBandwidth=64000 1975 Figure 6: Example Encoding Group for Audio 1977 Capture Scenes: 1979 The following table represents the capture scenes for this 1980 provider. Recall that a capture scene is composed of alternative 1981 capture scene entries covering the same spatial region. Capture 1982 Scene #1 is for the main people captures, and Capture Scene #2 is 1983 for presentation. 1985 Each row in the table is a separate Capture Scene Entry 1987 +------------------+ 1988 | Capture Scene #1 | 1989 +------------------+ 1990 | VC0, VC1, VC2 | 1991 | VC3 | 1992 | VC4 | 1993 | VC5 | 1994 | AC0, AC1, AC2 | 1995 | AC3 | 1996 +------------------+ 1998 +------------------+ 1999 | Capture Scene #2 | 2000 +------------------+ 2001 | VC6 | 2002 | AC4 | 2003 +------------------+ 2005 Table 7: Example Capture Scene Entries 2007 Different capture scenes are unique to each other, non- 2008 overlapping. A consumer can choose an entry from each capture 2009 scene. In this case the three captures VC0, VC1, and VC2 are one 2010 way of representing the video from the endpoint. These three 2011 captures should appear adjacent next to each other. 2012 Alternatively, another way of representing the Capture Scene is 2013 with the capture VC3, which automatically shows the person who is 2014 talking. Similarly for the VC4 and VC5 alternatives. 2016 As in the video case, the different entries of audio in Capture 2017 Scene #1 represent the "same thing", in that one way to receive 2018 the audio is with the 3 audio captures (AC0, AC1, AC2), and 2019 another way is with the mixed AC3. The Media Consumer can choose 2020 an audio capture entry it is capable of receiving. 2022 The spatial ordering is understood by the media capture attributes 2023 Area of Capture and Point of Capture. 2025 A Media Consumer would likely want to choose a capture scene entry 2026 to receive based in part on how many streams it can simultaneously 2027 receive. A consumer that can receive three people streams would 2028 probably prefer to receive the first entry of Capture Scene #1 2029 (VC0, VC1, VC2) and not receive the other entries. A consumer 2030 that can receive only one people stream would probably choose one 2031 of the other entries. 2033 If the consumer can receive a presentation stream too, it would 2034 also choose to receive the only entry from Capture Scene #2 (VC6). 2036 12.1.2. Encoding Group Example 2038 This is an example of an encoding group to illustrate how it can 2039 express dependencies between encodings. 2041 encodeGroupID=EG0 maxGroupBandwidth=6000000 2042 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2043 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2044 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2045 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2046 encodeID=AUDENC0, maxBandwidth=96000 2047 encodeID=AUDENC1, maxBandwidth=96000 2048 encodeID=AUDENC2, maxBandwidth=96000 2050 Here, the encoding group is EG0. Although the encoding group is 2051 capable of transmitting up to 6Mbit/s, no individual video 2052 encoding can exceed 4Mbit/s. 2054 This encoding group also allows up to 3 audio encodings, AUDENC<0- 2055 2>. It is not required that audio and video encodings reside 2056 within the same encoding group, but if so then the group's overall 2057 maxBandwidth value is a limit on the sum of all audio and video 2058 encodings configured by the consumer. A system that does not wish 2059 or need to combine bandwidth limitations in this way should 2060 instead use separate encoding groups for audio and video in order 2061 for the bandwidth limitations on audio and video to not interact. 2063 Audio and video can be expressed in separate encoding groups, as 2064 in this illustration. 2066 encodeGroupID=EG0 maxGroupBandwidth=6000000 2067 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 2068 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2069 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 2070 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 2071 encodeGroupID=EG1 maxGroupBandwidth=500000 2072 encodeID=AUDENC0, maxBandwidth=96000 2073 encodeID=AUDENC1, maxBandwidth=96000 2074 encodeID=AUDENC2, maxBandwidth=96000 2076 12.1.3. The MCU Case 2078 This section shows how an MCU might express its Capture Scenes, 2079 intending to offer different choices for consumers that can handle 2080 different numbers of streams. A single audio capture stream is 2081 provided for all single and multi-screen configurations that can 2082 be associated (e.g. lip-synced) with any combination of video 2083 captures at the consumer. 2085 +-----------------------+---------------------------------+ 2086 | Capture Scene #1 | | 2087 +-----------------------|---------------------------------+ 2088 | VC0 | VC for a single screen consumer | 2089 | VC1, VC2 | VCs for a two screen consumer | 2090 | VC3, VC4, VC5 | VCs for a three screen consumer | 2091 | VC6, VC7, VC8, VC9 | VCs for a four screen consumer | 2092 | AC0 | AC representing all participants| 2093 | CSE(VC0) | | 2094 | CSE(VC1,VC2) | | 2095 | CSE(VC3,VC4,VC5) | | 2096 | CSE(VC6,VC7,VC8,VC9) | | 2097 | CSE(AC0) | | 2098 +-----------------------+---------------------------------+ 2100 Table 8: MCU main Capture Scenes 2102 If / when a presentation stream becomes active within the 2103 conference the MCU might re-advertise the available media as: 2105 +------------------+--------------------------------------+ 2106 | Capture Scene #2 | note | 2107 +------------------+--------------------------------------+ 2108 | VC10 | video capture for presentation | 2109 | AC1 | presentation audio to accompany VC10 | 2110 | CSE(VC10) | | 2111 | CSE(AC1) | | 2112 +------------------+--------------------------------------+ 2114 Table 9: MCU presentation Capture Scene 2116 12.2. Media Consumer Behavior 2118 This section gives an example of how a Media Consumer might behave 2119 when deciding how to request streams from the three screen 2120 endpoint described in the previous section. 2122 The receive side of a call needs to balance its requirements, 2123 based on number of screens and speakers, its decoding capabilities 2124 and available bandwidth, and the provider's capabilities in order 2125 to optimally configure the provider's streams. Typically it would 2126 want to receive and decode media from each Capture Scene 2127 advertised by the Provider. 2129 A sane, basic, algorithm might be for the consumer to go through 2130 each Capture Scene in turn and find the collection of Video 2131 Captures that best matches the number of screens it has (this 2132 might include consideration of screens dedicated to presentation 2133 video display rather than "people" video) and then decide between 2134 alternative entries in the video Capture Scenes based either on 2135 hard-coded preferences or user choice. Once this choice has been 2136 made, the consumer would then decide how to configure the 2137 provider's encoding groups in order to make best use of the 2138 available network bandwidth and its own decoding capabilities. 2140 12.2.1. One screen Media Consumer 2142 VC3, VC4 and VC5 are all different entries by themselves, not 2143 grouped together in a single entry, so the receiving device should 2144 choose between one of those. The choice would come down to 2145 whether to see the greatest number of participants simultaneously 2146 at roughly equal precedence (VC5), a switched view of just the 2147 loudest region (VC3) or a switched view with PiPs (VC4). An 2148 endpoint device with a small amount of knowledge of these 2149 differences could offer a dynamic choice of these options, in- 2150 call, to the user. 2152 12.2.2. Two screen Media Consumer configuring the example 2154 Mixing systems with an even number of screens, "2n", and those 2155 with "2n+1" cameras (and vice versa) is always likely to be the 2156 problematic case. In this instance, the behavior is likely to be 2157 determined by whether a "2 screen" system is really a "2 decoder" 2158 system, i.e., whether only one received stream can be displayed 2159 per screen or whether more than 2 streams can be received and 2160 spread across the available screen area. To enumerate 3 possible 2161 behaviors here for the 2 screen system when it learns that the far 2162 end is "ideally" expressed via 3 capture streams: 2164 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 2165 per the 1 screen consumer case above) and either leave one 2166 screen blank or use it for presentation if / when a 2167 presentation becomes active. 2169 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 2170 screens (either with each capture being scaled to 2/3 of a 2171 screen and the center capture being split across 2 screens) or, 2172 as would be necessary if there were large bezels on the 2173 screens, with each stream being scaled to 1/2 the screen width 2174 and height and there being a 4th "blank" panel. This 4th panel 2175 could potentially be used for any presentation that became 2176 active during the call. 2178 3. Receive 3 streams, decode all 3, and use control information 2179 indicating which was the most active to switch between showing 2180 the left and center streams (one per screen) and the center and 2181 right streams. 2183 For an endpoint capable of all 3 methods of working described 2184 above, again it might be appropriate to offer the user the choice 2185 of display mode. 2187 12.2.3. Three screen Media Consumer configuring the example 2189 This is the most straightforward case - the Media Consumer would 2190 look to identify a set of streams to receive that best matched its 2191 available screens and so the VC0 plus VC1 plus VC2 should match 2192 optimally. The spatial ordering would give sufficient information 2193 for the correct video capture to be shown on the correct screen, 2194 and the consumer would either need to divide a single encoding 2195 group's capability by 3 to determine what resolution and frame 2196 rate to configure the provider with or to configure the individual 2197 video captures' encoding groups with what makes most sense (taking 2198 into account the receive side decode capabilities, overall call 2199 bandwidth, the resolution of the screens plus any user preferences 2200 such as motion vs sharpness). 2202 12.3. Multipoint Conference utilizing Multiple Content Captures 2204 The use of MCCs allows the MCU to construct outgoing Advertisements 2205 describing complex and media switching and composition scenarios. 2206 The following sections provide several examples. 2208 Note: In the examples the identities of the CLUE elements (e.g. 2209 Captures, Capture Scene) in the incoming Advertisements overlap. 2210 This is because there is no co-ordination between the endpoints. 2211 The MCU is responsible for making these unique in the outgoing 2212 advertisement. 2214 12.3.1. Single Media Captures and MCC in the same Advertisement 2216 Four endpoints are involved in a Conference where CLUE is used. An 2217 MCU acts as a middlebox between the endpoints with a CLUE channel 2218 between each endpoint and the MCU. The MCU receives the following 2219 Advertisements. 2221 +-----------------------+---------------------------------+ 2222 | Capture Scene #1 | Description=AustralianConfRoom | 2223 +-----------------------|---------------------------------+ 2224 | VC1 | Description=Audience | 2225 | | EncodeGroupID=1 | 2226 | CSE(VC1) | | 2227 +---------------------------------------------------------+ 2228 Table 10: Advertisement received from Endpoint A 2230 +-----------------------+---------------------------------+ 2231 | Capture Scene #1 | Description=ChinaConfRoom | 2232 +-----------------------|---------------------------------+ 2233 | VC1 | Description=Speaker | 2234 | | EncodeGroupID=1 | 2235 | VC2 | Description=Audience | 2236 | | EncodeGroupID=1 | 2237 | CSE(VC1, VC2) | | 2238 +---------------------------------------------------------+ 2240 Table 11: Advertisement received from Endpoint B 2242 +-----------------------+---------------------------------+ 2243 | Capture Scene #1 | Description=USAConfRoom | 2244 +-----------------------|---------------------------------+ 2245 | VC1 | Description=Audience | 2246 | | EncodeGroupID=1 | 2247 | CSE(VC1) | | 2248 +---------------------------------------------------------+ 2250 Table 12: Advertisement received from Endpoint C 2252 Note: Endpoint B above indicates that it sends two streams. 2254 If the MCU wanted to provide a Multiple Content Capture containing 2255 a round robin switched view of the audience from the 3 endpoints 2256 and the speaker it could construct the following advertisement: 2258 Advertisement sent to Endpoint F 2260 +=======================+=================================+ 2261 | Capture Scene #1 | Description=AustralianConfRoom | 2262 +-----------------------|---------------------------------+ 2263 | VC1 | Description=Audience | 2264 | CSE(VC1) | | 2265 +=======================+=================================+ 2266 | Capture Scene #2 | Description=ChinaConfRoom | 2267 +-----------------------|---------------------------------+ 2268 | VC2 | Description=Speaker | 2269 | VC3 | Description=Audience | 2270 | CSE(VC2, VC3) | | 2271 +=======================+=================================+ 2272 | Capture Scene #3 | Description=USAConfRoom | 2273 +-----------------------|---------------------------------+ 2274 | VC4 | Description=Audience | 2275 | CSE(VC4) | | 2276 +=======================+=================================+ 2277 | Capture Scene #4 | | 2278 +-----------------------|---------------------------------+ 2279 | MCC1(VC1,VC2,VC3,VC4) | Policy=RoundRobin:1 | 2280 | | MaxCaptures=1 | 2281 | | EncodingGroup=1 | 2282 | CSE(MCC1) | | 2283 +=======================+=================================+ 2285 Table 13: Advertisement sent to Endpoint F - One Encoding 2287 Alternatively if the MCU wanted to provide the speaker as one media 2288 stream and the audiences as another it could assign an encoding 2289 group to VC2 in Capture Scene 2 and provide a CSE in Capture Scene 2290 #4 as per the example below. 2292 Advertisement sent to Endpoint F 2294 +=======================+=================================+ 2295 | Capture Scene #1 | Description=AustralianConfRoom | 2296 +-----------------------|---------------------------------+ 2297 | VC1 | Description=Audience | 2298 | CSE(VC1) | | 2299 +=======================+=================================+ 2300 | Capture Scene #2 | Description=ChinaConfRoom | 2301 +-----------------------|---------------------------------+ 2302 | VC2 | Description=Speaker | 2303 | | EncodingGroup=1 | 2304 | VC3 | Description=Audience | 2305 | CSE(VC2, VC3) | | 2306 +=======================+=================================+ 2307 | Capture Scene #3 | Description=USAConfRoom | 2308 +-----------------------|---------------------------------+ 2309 | VC4 | Description=Audience | 2310 | CSE(VC4) | | 2311 +=======================+=================================+ 2312 | Capture Scene #4 | | 2313 +-----------------------|---------------------------------+ 2314 | MCC1(VC1,VC3,VC4) | Policy=RoundRobin:1 | 2315 | | MaxCaptures=1 | 2316 | | EncodingGroup=1 | 2317 | MCC2(VC2) | MaxCaptures=1 | 2318 | | EncodingGroup=1 | 2319 | CSE2(MCC1,MCC2) | | 2320 +=======================+=================================+ 2322 Table 14: Advertisement sent to Endpoint F - Two Encodings 2324 Therefore a Consumer could choose whether or not to have a separate 2325 speaker related stream and could choose which endpoints to see. If 2326 it wanted the second stream but not the Australian conference room 2327 it could indicate the following captures in the Configure message: 2329 +-----------------------+---------------------------------+ 2330 | MCC1(VC3,VC4) | Encoding | 2331 | VC2 | Encoding | 2332 +-----------------------|---------------------------------+ 2333 Table 15: MCU case: Consumer Response 2335 12.3.2. Several MCCs in the same Advertisement 2337 Multiple MCCs can be used where multiple streams are used to carry 2338 media from multiple endpoints. For example: 2340 A conference has three endpoints D, E and F. Each end point has 2341 three video captures covering the left, middle and right regions of 2342 each conference room. The MCU receives the following 2343 advertisements from D and E. 2345 +-----------------------+---------------------------------+ 2346 | Capture Scene #1 | Description=AustralianConfRoom | 2347 +-----------------------|---------------------------------+ 2348 | VC1 | CaptureArea=Left | 2349 | | EncodingGroup=1 | 2350 | VC2 | CaptureArea=Centre | 2351 | | EncodingGroup=1 | 2352 | VC3 | CaptureArea=Right | 2353 | | EncodingGroup=1 | 2354 | CSE(VC1,VC2,VC3) | | 2355 +---------------------------------------------------------+ 2357 Table 16: Advertisement received from Endpoint D 2359 +-----------------------+---------------------------------+ 2360 | Capture Scene #1 | Description=ChinaConfRoom | 2361 +-----------------------|---------------------------------+ 2362 | VC1 | CaptureArea=Left | 2363 | | EncodingGroup=1 | 2364 | VC2 | CaptureArea=Centre | 2365 | | EncodingGroup=1 | 2366 | VC3 | CaptureArea=Right | 2367 | | EncodingGroup=1 | 2368 | CSE(VC1,VC2,VC3) | | 2369 +---------------------------------------------------------+ 2371 Table 17: Advertisement received from Endpoint E 2373 The MCU wants to offer Endpoint F three Capture Encodings. Each 2374 Capture Encoding would contain all the Captures from either 2375 Endpoint D or Endpoint E depending based on the active speaker. 2376 The MCU sends the following Advertisement: 2378 +=======================+=================================+ 2379 | Capture Scene #1 | Description=AustralianConfRoom | 2380 +-----------------------|---------------------------------+ 2381 | VC1 | | 2382 | VC2 | | 2383 | VC3 | | 2384 | CSE(VC1,VC2,VC3) | | 2385 +=======================+=================================+ 2386 | Capture Scene #2 | Description=ChinaConfRoom | 2387 +-----------------------|---------------------------------+ 2388 | VC4 | | 2389 | VC5 | | 2390 | VC6 | | 2391 | CSE(VC4,VC5,VC6) | | 2392 +=======================+=================================+ 2393 | Capture Scene #3 | | 2394 +-----------------------|---------------------------------+ 2395 | MCC1(VC1,VC4) | CaptureArea=Left | 2396 | | MaxCaptures=1 | 2397 | | SynchronisationID=1 | 2398 | | EncodingGroup=1 | 2399 | MCC2(VC2,VC5) | CaptureArea=Centre | 2400 | | MaxCaptures=1 | 2401 | | SynchronisationID=1 | 2402 | | EncodingGroup=1 | 2403 | MCC3(VC3,VC6) | CaptureArea=Right | 2404 | | MaxCaptures=1 | 2405 | | SynchronisationID=1 | 2406 | | EncodingGroup=1 | 2407 | CSE(MCC1,MCC2,MCC3) | | 2408 +=======================+=================================+ 2410 Table 17: Advertisement received from Endpoint E 2412 12.3.3. Heterogeneous conference with switching and composition 2414 Consider a conference between endpoints with the following 2415 characteristics: 2417 Endpoint A - 4 screens, 3 cameras 2419 Endpoint B - 3 screens, 3 cameras 2421 Endpoint C - 3 screens, 3 cameras 2423 Endpoint D - 3 screens, 3 cameras 2425 Endpoint E - 1 screen, 1 camera 2427 Endpoint F - 2 screens, 1 cameras 2429 Endpoint G - 1 screen, 1 camera 2431 This example focuses on what the user in one of the 3-camera multi- 2432 screen endpoints sees. Call this person User A, at Endpoint A. 2433 There are 4 large display screens at Endpoint A. Whenever somebody 2434 at another site is speaking, all the video captures from that 2435 endpoint are shown on the large screens. If the talker is at a 3- 2436 camera site, then the video from those 3 cameras fills 3 of the 2437 screens. If the talker is at a single-camera site, then video from 2438 that camera fills one of the screens, while the other screens show 2439 video from other single-camera endpoints. 2441 User A hears audio from the 4 loudest talkers. 2443 User A can also see video from other endpoints, in addition to the 2444 current talker, although much smaller in size. Endpoint A has 4 2445 screens, so one of those screens shows up to 9 other Media Captures 2446 in a tiled fashion. When video from a 3 camera endpoint appears in 2447 the tiled area, video from all 3 cameras appears together across 2448 the screen with correct spatial relationship among those 3 images. 2450 +---+---+---+ +-------------+ +-------------+ +-------------+ 2451 | | | | | | | | | | 2452 +---+---+---+ | | | | | | 2453 | | | | | | | | | | 2454 +---+---+---+ | | | | | | 2455 | | | | | | | | | | 2456 +---+---+---+ +-------------+ +-------------+ +-------------+ 2457 Figure 7: Endpoint A - 4 Screen Display 2459 User B at Endpoint B sees a similar arrangement, except there are 2460 only 3 screens, so the 9 other Media Captures are spread out across 2461 the bottom of the 3 displays, in a picture-in-picture (PIP) format. 2462 When video from a 3 camera endpoint appears in the PIP area, video 2463 from all 3 cameras appears together across a single screen with 2464 correct spatial relationship. 2466 +-------------+ +-------------+ +-------------+ 2467 | | | | | | 2468 | | | | | | 2469 | | | | | | 2470 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2471 | +-+ +-+ +-+ | | +-+ +-+ +-+ | | +-+ +-+ +-+ | 2472 +-------------+ +-------------+ +-------------+ 2473 Figure 8: Endpoint B - 3 Screen Display with PiPs 2475 When somebody at a different endpoint becomes the current talker, 2476 then User A and User B both see the video from the new talker 2477 appear on their large screen area, while the previous talker takes 2478 one of the smaller tiled or PIP areas. The person who is the 2479 current talker doesn't see themselves; they see the previous talker 2480 in their large screen area. 2482 One of the points of this example is that endpoints A and B each 2483 want to receive 3 capture encodings for their large display areas, 2484 and 9 encodings for their smaller areas. A and B are be able to 2485 each send the same Configure message to the MCU, and each receive 2486 the same conceptual Media Captures from the MCU. The differences 2487 are in how they are rendered and are purely a local matter at A and 2488 B. 2490 The Advertisements for such a scenario are described below. 2492 +-----------------------+---------------------------------+ 2493 | Capture Scene #1 | Description=Endpoint x | 2494 +-----------------------|---------------------------------+ 2495 | VC1 | EncodingGroup=1 | 2496 | VC2 | EncodingGroup=1 | 2497 | VC3 | EncodingGroup=1 | 2498 | AC1 | EncodingGroup=2 | 2499 | CSE1(VC1, VC2, VC3) | | 2500 | CSE2(AC1) | | 2501 +---------------------------------------------------------+ 2503 Table 19: Advertisement received at the MCU from Endpoints A to D 2505 +-----------------------+---------------------------------+ 2506 | Capture Scene #1 | Description=Endpoint y | 2507 +-----------------------|---------------------------------+ 2508 | VC1 | EncodingGroup=1 | 2509 | AC1 | EncodingGroup=2 | 2510 | CSE1(VC1) | | 2511 | CSE2(AC1) | | 2512 +---------------------------------------------------------+ 2514 Table 20: Advertisement received at the MCU from Endpoints E to F 2516 Rather than considering what is displayed the CLUE concentrates 2517 more on what the MCU sends. The MCU doesn't know anything about 2518 the number of screens an endpoint has. 2520 As Endpoints A to D each advertise that three Captures make up a 2521 Capture Scene, the MCU offers these in a "site" switching mode. 2522 That is that there are three Multiple Content Captures (and 2523 Capture Encodings) each switching between Endpoints. The MCU 2524 switches in the applicable media into the stream based on voice 2525 activity. Endpoint A will not see a capture from itself. 2527 Using the MCC concept the MCU would send the following 2528 Advertisement to endpoint A: 2530 +=======================+=================================+ 2531 | Capture Scene #1 | Description=Endpoint B | 2532 +-----------------------|---------------------------------+ 2533 | VC4 | Left | 2534 | VC5 | Center | 2535 | VC6 | Right | 2536 | AC1 | | 2537 | CSE(VC4,VC5,VC6) | | 2538 | CSE(AC1) | | 2539 +=======================+=================================+ 2540 | Capture Scene #2 | Description=Endpoint C | 2541 +-----------------------|---------------------------------+ 2542 | VC7 | Left | 2543 | VC8 | Center | 2544 | VC9 | Right | 2545 | AC2 | | 2546 | CSE(VC7,VC8,VC9) | | 2547 | CSE(AC2) | | 2548 +=======================+=================================+ 2549 | Capture Scene #3 | Description=Endpoint D | 2550 +-----------------------|---------------------------------+ 2551 | VC10 | Left | 2552 | VC11 | Center | 2553 | VC12 | Right | 2554 | AC3 | | 2555 | CSE(VC10,VC11,VC12) | | 2556 | CSE(AC3) | | 2557 +=======================+=================================+ 2558 | Capture Scene #4 | Description=Endpoint E | 2559 +-----------------------|---------------------------------+ 2560 | VC13 | | 2561 | AC4 | | 2562 | CSE(VC13) | | 2563 | CSE(AC4) | | 2564 +=======================+=================================+ 2565 | Capture Scene #5 | Description=Endpoint F | 2566 +-----------------------|---------------------------------+ 2567 | VC14 | | 2568 | AC5 | | 2569 | CSE(VC14) | | 2570 | CSE(AC5) | | 2571 +=======================+=================================+ 2572 | Capture Scene #6 | Description=Endpoint G | 2573 +-----------------------|---------------------------------+ 2574 | VC15 | | 2575 | AC6 | | 2576 | CSE(VC15) | | 2577 | CSE(AC6) | | 2578 +=======================+=================================+ 2580 Table 21: Advertisement sent to endpoint A - Source Part 2582 The above part of the Advertisement presents information about the 2583 sources to the MCC. The information is effectively the same as the 2584 received Advertisements except that there are no Capture Encodings 2585 associated with them and the identities have been re-numbered. 2587 In addition to the source Capture information the MCU advertises 2588 "site" switching of Endpoints B to G in three streams. 2590 +=======================+=================================+ 2591 | Capture Scene #7 | Description=Output3streammix | 2592 +-----------------------|---------------------------------+ 2593 | MCC1(VC4,VC7,VC10, | CaptureArea=Left | 2594 | VC13) | MaxCaptures=1 | 2595 | | SynchronisationID=1 | 2596 | | Policy=SoundLevel:0 | 2597 | | EncodingGroup=1 | 2598 | | | 2599 | MCC2(VC5,VC8,VC11, | CaptureArea=Center | 2600 | VC14) | MaxCaptures=1 | 2601 | | SynchronisationID=1 | 2602 | | Policy=SoundLevel:0 | 2603 | | EncodingGroup=1 | 2604 | | | 2605 | MCC3(VC6,VC9,VC12, | CaptureArea=Right | 2606 | VC15) | MaxCaptures=1 | 2607 | | SynchronisationID=1 | 2608 | | Policy=SoundLevel:0 | 2609 | | EncodingGroup=1 | 2610 | | | 2611 | MCC4() (for audio) | CaptureArea=whole scene | 2612 | | MaxCaptures=1 | 2613 | | Policy=SoundLevel:0 | 2614 | | EncodingGroup=2 | 2615 | | | 2616 | MCC5() (for audio) | CaptureArea=whole scene | 2617 | | MaxCaptures=1 | 2618 | | Policy=SoundLevel:1 | 2619 | | EncodingGroup=2 | 2620 | | | 2621 | MCC6() (for audio) | CaptureArea=whole scene | 2622 | | MaxCaptures=1 | 2623 | | Policy=SoundLevel:2 | 2624 | | EncodingGroup=2 | 2625 | | | 2626 | MCC7() (for audio) | CaptureArea=whole scene | 2627 | | MaxCaptures=1 | 2628 | | Policy=SoundLevel:3 | 2629 | | EncodingGroup=2 | 2630 | | | 2631 | CSE(MCC1,MCC2,MCC3) | | 2632 | CSE(MCC4,MCC5,MCC6, | | 2633 | MCC7) | | 2634 +=======================+=================================+ 2636 Table 22: Advertisement send to endpoint A - switching part 2638 The above part describes the switched 3 main streams that relate to 2639 site switching. MaxCaptures=1 indicates that only one Capture from 2640 the MCC is sent at a particular time. SynchronisationID=1 indicates 2641 that the source sending is synchronised. The provider can choose to 2642 group together VC13, VC14, and VC15 for the purpose of switching 2643 according to the SynchronisationID. Therefore when the provider 2644 switches one of them into an MCC, it can also switch the others 2645 even though they are not part of the same Capture Scene. 2647 All the audio for the conference is included in this Scene #7. 2648 There isn't necessarily a one to one relation between any audio 2649 capture and video capture in this scene. Typically a change in 2650 loudest talker will cause the MCU to switch the audio streams more 2651 quickly than switching video streams. 2653 The MCU can also supply nine media streams showing the active and 2654 previous eight speakers. It includes the following in the 2655 Advertisement: 2657 +=======================+=================================+ 2658 | Capture Scene #8 | Description=Output9stream | 2659 +-----------------------|---------------------------------+ 2660 | MCC4(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2661 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:0 | 2662 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2663 | | | 2664 | MCC5(VC4,VC5,VC6,VC7, | MaxCaptures=1 | 2665 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:1 | 2666 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2667 | | | 2668 to to | 2669 | | | 2670 | MCC12(VC4,VC5,VC6,VC7,| MaxCaptures=1 | 2671 | VC8,VC9,VC10,VC11, | Policy=SoundLevel:8 | 2672 | VC12,VC13,VC14,VC15)| EncodingGroup=1 | 2673 | | | 2674 | CSE(MCC4,MCC5,MCC6, | | 2675 | MCC7,MCC8,MCC9, | | 2676 | MCC10,MCC11,MCC12)| | 2677 +=======================+=================================+ 2679 Table 23: Advertisement sent to endpoint A - 9 switched part 2681 The above part indicates that there are 9 capture encodings. Each 2682 of the Capture Encodings may contain any captures from any source 2683 site with a maximum of one Capture at a time. Which Capture is 2684 present is determined by the policy. The MCCs in this scene do not 2685 have any spatial attributes. 2687 Note: The Provider alternatively could provide each of the MCCs 2688 above in its own Capture Scene. 2690 If the MCU wanted to provide a composed Capture Encoding containing 2691 all of the 9 captures it could Advertise in addition: 2693 +=======================+=================================+ 2694 | Capture Scene #9 | Description=NineTiles | 2695 +-----------------------|---------------------------------+ 2696 | MCC13(MCC4,MCC4,MCC6, | MaxCaptures=9 | 2697 | MCC7,MCC8,MCC9, | EncodingGroup=1 | 2698 | MCC10,MCC11,MCC12)| | 2699 | | | 2700 | CSE(MCC13) | | 2701 +=======================+=================================+ 2703 Table 24: Advertisement sent to endpoint A - 9 composed part 2705 As MaxCaptures is 9 it indicates that the capture encoding contains 2706 information from up to 9 sources at a time. 2708 The Advertisement to Endpoint B is identical to the above other 2709 than the captures from Endpoint A would be added and the captures 2710 from Endpoint B would be removed. Whether the Captures are rendered 2711 on a four screen display or a three screen display is up to the 2712 Consumer to determine. The Consumer wants to place video captures 2713 from the same original source endpoint together, in the correct 2714 spatial order, but the MCCs do not have spatial attributes. So the 2715 Consumer needs to associate incoming media packets with the 2716 original individual captures in the advertisement (such as VC4, 2717 VC5, and VC6) in order to know the spatial information it needs for 2718 correct placement on the screens. 2720 Editor's note: this is an open issue, about how to associate 2721 incoming packets with the original capture that is a constituent of 2722 an MCC. This document probably should mention it in an earlier 2723 section, after the solution is worked out in the other CLUE 2724 documents. 2726 13. Acknowledgements 2728 Allyn Romanow and Brian Baldino were authors of early versions. 2729 Mark Gorzyinski contributed much to the approach. We want to 2730 thank Stephen Botzko for helpful discussions on audio. 2732 14. IANA Considerations 2734 None. 2736 15. Security Considerations 2738 There are several potential attacks related to telepresence, and 2739 specifically the protocols used by CLUE, in the case of 2740 conferencing sessions, due to the natural involvement of multiple 2741 endpoints and the many, often user-invoked, capabilities provided 2742 by the systems. 2744 A middle box involved in a CLUE session can experience many of the 2745 same attacks as that of a conferencing system such as that enabled 2746 by the XCON framework [RFC 6503]. Examples of attacks include the 2747 following: an endpoint attempting to listen to sessions in which 2748 it is not authorized to participate, an endpoint attempting to 2749 disconnect or mute other users, and theft of service by an 2750 endpoint in attempting to create telepresence sessions it is not 2751 allowed to create. Thus, it is RECOMMENDED that a middle box 2752 implementing the protocols necessary to support CLUE, follow the 2753 security recommendations specified in the conference control 2754 protocol documents. In the case of CLUE, SIP is the default 2755 conferencing protocol, thus the security considerations in RFC 2756 4579 MUST be followed. 2758 One primary security concern, surrounding the CLUE framework 2759 introduced in this document, involves securing the actual 2760 protocols and the associated authorization mechanisms. These 2761 concerns apply to endpoint to endpoint sessions, as well as 2762 sessions involving multiple endpoints and middle boxes. Figure 2 2763 in section 5 provides a basic flow of information exchange for 2764 CLUE and the protocols involved. 2766 As described in section 5, CLUE uses SIP/SDP to establish the 2767 session prior to exchanging any CLUE specific information. Thus 2768 the security mechanisms recommended for SIP [RFC 3261], including 2769 user authentication and authorization, SHOULD be followed. In 2770 addition, the media is based on RTP and thus existing RTP security 2771 mechanisms, such as DTLS/SRTP, MUST be supported. 2773 A separate data channel is established to transport the CLUE 2774 protocol messages. The contents of the CLUE protocol messages are 2775 based on information introduced in this document, which is 2776 represented by an XML schema for this information defined in the 2777 CLUE data model [ref]. Some of the information which could 2778 possibly introduce privacy concerns is the xCard information as 2779 described in section x. In addition, the (text) description field 2780 in the Media Capture attribute (section 7.1.1.7) could possibly 2781 reveal sensitive information or specific identities. The same 2782 would be true for the descriptions in the Capture Scene (section 2783 7.3.1) and Capture Scene Entry (7.3.2) attributes. One other 2784 important consideration for the information in the xCard as well 2785 as the description field in the Media Capture and Capture Scene 2786 Entry attributes is that while the endpoints involved in the 2787 session have been authenticated, there is no assurance that the 2788 information in the xCard or description fields is authentic. 2789 Thus, this information SHOULD not be used to make any 2790 authorization decisions and the participants in the sessions 2791 SHOULD be made aware of this. 2793 While other information in the CLUE protocol messages does not 2794 reveal specific identities, it can reveal characteristics and 2795 capabilities of the endpoints. That information could possibly 2796 uniquely identify specific endpoints. It might also be possible 2797 for an attacker to manipulate the information and disrupt the CLUE 2798 sessions. It would also be possible to mount a DoS attack on the 2799 CLUE endpoints if a malicious agent has access to the data 2800 channel. Thus, It MUST be possible for the endpoints to establish 2801 a channel which is secure against both message recovery and 2802 message modification. Further details on this are provided in the 2803 CLUE data channel solution document. 2805 There are also security issues associated with the authorization 2806 to perform actions at the CLUE endpoints to invoke specific 2807 capabilities (e.g., re-arranging screens, sharing content, etc.). 2808 However, the policies and security associated with these actions 2809 are outside the scope of this document and the overall CLUE 2810 solution. 2812 16. Changes Since Last Version 2814 NOTE TO THE RFC-Editor: Please remove this section prior to 2815 publication as an RFC. 2817 Changes from 13 to 14: 2819 1. Fill in section for Security Considerations. 2821 2. Replace Role placeholder with Participant Information, 2822 Participant Type, and Scene Information attributes. 2824 3. Spatial information implies nothing about how constituent 2825 media captures are combined into a composed MCC. 2827 4. Clean up MCC example in Section 12.3.3. Clarify behavior of 2828 tiled and PIP display windows. Add audio. Add new open 2829 issue about associating incoming packets to original source 2830 capture. 2832 5. Remove editor's note and associated statement about RTP 2833 multiplexing at end of section 5. 2835 6. Remove editor's note and associated paragraph about 2836 overloading media channel with both CLUE and non-CLUE usage, 2837 in section 5. 2839 7. In section 10, clarify intent of media encodings conforming 2840 to SDP, even with multiple CLUE message exchanges. Remove 2841 associated editor's note. 2843 Changes from 12 to 13: 2845 1. Added the MCC concept including updates to existing sections 2846 to incorporate the MCC concept. New MCC attributes: 2847 MaxCaptures, SynchronisationID and Policy. 2849 2. Removed the "composed" and "switched" Capture attributes due 2850 to overlap with the MCC concept. 2852 3. Removed the "Scene-switch-policy" CSE attribute, replaced by 2853 MCC and SynchronisationID. 2855 4. Editorial enhancements including numbering of the Capture 2856 attribute sections, tables, figures etc. 2858 Changes from 11 to 12: 2860 1. Ticket #44. Remove note questioning about requiring a 2861 Consumer to send a Configure after receiving Advertisement. 2863 2. Ticket #43. Remove ability for consumer to choose value of 2864 attribute for scene-switch-policy. 2866 3. Ticket #36. Remove computational complexity parameter, 2867 MaxGroupPps, from Encoding Groups. 2869 4. Reword the Abstract and parts of sections 1 and 4 (now 5) 2870 based on Mary's suggestions as discussed on the list. Move 2871 part of the Introduction into a new section Overview & 2872 Motivation. 2874 5. Add diagram of an Advertisement, in the Overview of the 2875 Framework/Model section. 2877 6. Change Intended Status to Standards Track. 2879 7. Clean up RFC2119 keyword language. 2881 Changes from 10 to 11: 2883 1. Add description attribute to Media Capture and Capture Scene 2884 Entry. 2886 2. Remove contradiction and change the note about open issue 2887 regarding always responding to Advertisement with a Configure 2888 message. 2890 3. Update example section, to cleanup formatting and make the 2891 media capture attributes and encoding parameters consistent 2892 with the rest of the document. 2894 Changes from 09 to 10: 2896 1. Several minor clarifications such as about SDP usage, Media 2897 Captures, Configure message. 2899 2. Simultaneous Set can be expressed in terms of Capture Scene 2900 and Capture Scene Entry. 2902 3. Removed Area of Scene attribute. 2904 4. Add attributes from draft-groves-clue-capture-attr-01. 2906 5. Move some of the Media Capture attribute descriptions back 2907 into this document, but try to leave detailed syntax to the 2908 data model. Remove the OUTSOURCE sections, which are already 2909 incorporated into the data model document. 2911 Changes from 08 to 09: 2913 1. Use "document" instead of "memo". 2915 2. Add basic call flow sequence diagram to introduction. 2917 3. Add definitions for Advertisement and Configure messages. 2919 4. Add definitions for Capture and Provider. 2921 5. Update definition of Capture Scene. 2923 6. Update definition of Individual Encoding. 2925 7. Shorten definition of Media Capture and add key points in the 2926 Media Captures section. 2928 8. Reword a bit about capture scenes in overview. 2930 9. Reword about labeling Media Captures. 2932 10. Remove the Consumer Capability message. 2934 11. New example section heading for media provider behavior 2936 12. Clarifications in the Capture Scene section. 2938 13. Clarifications in the Simultaneous Transmission Set section. 2940 14. Capitalize defined terms. 2942 15. Move call flow example from introduction to overview section 2944 16. General editorial cleanup 2946 17. Add some editors' notes requesting input on issues 2947 18. Summarize some sections, and propose details be outsourced 2948 to other documents. 2950 Changes from 06 to 07: 2952 1. Ticket #9. Rename Axis of Capture Point attribute to Point 2953 on Line of Capture. Clarify the description of this 2954 attribute. 2956 2. Ticket #17. Add "capture encoding" definition. Use this new 2957 term throughout document as appropriate, replacing some usage 2958 of the terms "stream" and "encoding". 2960 3. Ticket #18. Add Max Capture Encodings media capture 2961 attribute. 2963 4. Add clarification that different capture scene entries are 2964 not necessarily mutually exclusive. 2966 Changes from 05 to 06: 2968 1. Capture scene description attribute is a list of text strings, 2969 each in a different language, rather than just a single string. 2971 2. Add new Axis of Capture Point attribute. 2973 3. Remove appendices A.1 through A.6. 2975 4. Clarify that the provider must use the same coordinate system 2976 with same scale and origin for all coordinates within the same 2977 capture scene. 2979 Changes from 04 to 05: 2981 1. Clarify limitations of "composed" attribute. 2983 2. Add new section "capture scene entry attributes" and add the 2984 attribute "scene-switch-policy". 2986 3. Add capture scene description attribute and description 2987 language attribute. 2989 4. Editorial changes to examples section for consistency with the 2990 rest of the document. 2992 Changes from 03 to 04: 2994 1. Remove sentence from overview - "This constitutes a significant 2995 change ..." 2997 2. Clarify a consumer can choose a subset of captures from a 2998 capture scene entry or a simultaneous set (in section "capture 2999 scene" and "consumer's choice..."). 3001 3. Reword first paragraph of Media Capture Attributes section. 3003 4. Clarify a stereo audio capture is different from two mono audio 3004 captures (description of audio channel format attribute). 3006 5. Clarify what it means when coordinate information is not 3007 specified for area of capture, point of capture, area of scene. 3009 6. Change the term "producer" to "provider" to be consistent (it 3010 was just in two places). 3012 7. Change name of "purpose" attribute to "content" and refer to 3013 RFC4796 for values. 3015 8. Clarify simultaneous sets are part of a provider advertisement, 3016 and apply across all capture scenes in the advertisement. 3018 9. Remove sentence about lip-sync between all media captures in a 3019 capture scene. 3021 10. Combine the concepts of "capture scene" and "capture set" 3022 into a single concept, using the term "capture scene" to 3023 replace the previous term "capture set", and eliminating the 3024 original separate capture scene concept. 3026 Informative References 3028 Edt. Note: Decide which of these really are Normative References. 3030 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 3031 Requirement Levels", BCP 14, RFC 2119, March 1997. 3033 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 3034 Johnston, 3035 A., Peterson, J., Sparks, R., Handley, M., and E. 3037 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 3038 June 2002. 3040 [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model 3041 with the Session Description Protocol (SDP)", RFC 3264, 3042 June 2002. 3044 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 3045 Jacobson, "RTP: A Transport Protocol for Real-Time 3046 Applications", STD 64, RFC 3550, July 2003. 3048 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 3049 Session Initiation Protocol (SIP)", RFC 4353, 3050 February 2006. 3052 [RFC4579] Johnston, A., Levin, O., "SIP Call Control - 3053 Conferencing for User Agents", RFC 4579, August 2006 3055 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 3056 5117, 3057 January 2008. 3059 17. Authors' Addresses 3061 Mark Duckworth (editor) 3062 Polycom 3063 Andover, MA 01810 3064 USA 3066 Email: mark.duckworth@polycom.com 3068 Andrew Pepperell 3069 Acano 3070 Uxbridge, England 3071 UK 3073 Email: apeppere@gmail.com 3074 Stephan Wenger 3075 Vidyo, Inc. 3076 433 Hackensack Ave. 3077 Hackensack, N.J. 07601 3078 USA 3080 Email: stewe@stewe.org