idnits 2.17.1 draft-ietf-clue-framework-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1489 has weird spacing: '...om left bot...' == Line 1543 has weird spacing: '...om left bot...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (May 16, 2013) is 3998 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG M. Duckworth, Ed. 3 Internet Draft Polycom 4 Intended status: Informational A. Pepperell 5 Expires: November 16, 2013 Acano 6 S. Wenger 7 Vidyo 8 May 16, 2013 10 Framework for Telepresence Multi-Streams 11 draft-ietf-clue-framework-10.txt 13 Abstract 15 This document offers a framework for a protocol that enables 16 devices in a telepresence conference to interoperate by specifying 17 the relationships between multiple media streams. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current 27 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six 30 months and may be updated, replaced, or obsoleted by other 31 documents at any time. It is inappropriate to use Internet-Drafts 32 as reference material or to cite them other than as "work in 33 progress." 35 This Internet-Draft will expire on November 16, 2013. 37 Copyright Notice 39 Copyright (c) 2013 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction...................................................3 55 2. Terminology....................................................5 56 3. Definitions....................................................5 57 4. Overview of the Framework/Model................................8 58 5. Spatial Relationships.........................................13 59 6. Media Captures and Capture Scenes.............................14 60 6.1. Media Captures...........................................14 61 6.1.1. Media Capture Attributes............................15 62 6.2. Capture Scene............................................19 63 6.2.1. Capture Scene attributes............................22 64 6.2.2. Capture Scene Entry attributes......................22 65 6.3. Simultaneous Transmission Set Constraints................23 66 7. Encodings.....................................................25 67 7.1. Individual Encodings.....................................25 68 7.2. Encoding Group...........................................27 69 8. Associating Captures with Encoding Groups.....................28 70 9. Consumer's Choice of Streams to Receive from the Provider.....29 71 9.1. Local preference.........................................31 72 9.2. Physical simultaneity restrictions.......................31 73 9.3. Encoding and encoding group limits.......................32 74 10. Extensibility................................................32 75 11. Examples - Using the Framework...............................32 76 11.1. Provider Behavior.......................................33 77 11.1.1. Three screen Endpoint Provider.....................33 78 11.1.2. Encoding Group Example.............................40 79 11.1.3. The MCU Case.......................................41 80 11.2. Media Consumer Behavior.................................42 81 11.2.1. One screen Media Consumer..........................42 82 11.2.2. Two screen Media Consumer configuring the example..43 83 11.2.3. Three screen Media Consumer configuring the example43 84 12. Acknowledgements.............................................44 85 13. IANA Considerations..........................................44 86 14. Security Considerations......................................44 87 15. Changes Since Last Version...................................44 88 16. Authors' Addresses...........................................48 90 1. Introduction 92 Current telepresence systems, though based on open standards such 93 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 94 each other. A major factor limiting the interoperability of 95 telepresence systems is the lack of a standardized way to describe 96 and negotiate the use of the multiple streams of audio and video 97 comprising the media flows. This draft provides a framework for a 98 protocol to enable interoperability by handling multiple streams in 99 a standardized way. It is intended to support the use cases 100 described in draft-ietf-clue-telepresence-use-cases and to meet the 101 requirements in draft-ietf-clue-telepresence-requirements. 103 Conceptually distinguished are Media Providers and Media Consumers. 104 A Media Provider provides Media in the form of RTP packets, a Media 105 Consumer consumes those RTP packets. Media Providers and Media 106 Consumers can reside in Endpoints or in middleboxes such as 107 Multipoint Control Units (MCUs). A Media Provider in an Endpoint 108 is usually associated with the generation of media for Media 109 Captures; these Media Captures are typically sourced from cameras, 110 microphones, and the like. Similarly, the Media Consumer in an 111 Endpoint is usually associated with Renderers, such as screens and 112 loudspeakers. In middleboxes, Media Providers and Consumers can 113 have the form of outputs and inputs, respectively, of RTP mixers, 114 RTP translators, and similar devices. Typically, telepresence 115 devices such as Endpoints and middleboxes would perform as both 116 Media Providers and Media Consumers, the former being concerned 117 with those devices' transmitted media and the latter with those 118 devices' received media. In a few circumstances, a CLUE Endpoint 119 middlebox may include only Consumer or Provider functionality, such 120 as recorder-type Consumers or webcam-type Providers. 122 Motivations for this document (and, in fact, for the existence of 123 the CLUE protocol) include: 125 (1) Endpoints according to this document can, and usually do, have 126 multiple Media Captures and Media Renderers, that is, for example, 127 multiple cameras and screens. While previous system designs were 128 able to set up calls that would light up all screens and cameras 129 (or equivalent), what was missing was a mechanism that can 130 associate the Media Captures with each other in space and time. 132 (2) The mere fact that there are multiple capture and rendering 133 devices, each of which may be configurable in aspects such as zoom, 134 leads to the difficulty that a variable number of such devices can 135 be used to capture different aspects of a region. The Capture 136 Scene concept allows for the description of multiple setups for 137 those multiple capture devices that could represent sensible 138 operation points of the physical capture devices in a room, chosen 139 by the operator. A Consumer can pick and choose from those 140 configurations based on its rendering abilities and inform the 141 Provider about its choices. Details are provided in section 6. 143 (3) In some cases, physical limitations or other reasons disallow 144 the concurrent use of a device in more than one setup. For 145 example, the center camera in a typical three-camera conference 146 room can set its zoom objective either to capture only the middle 147 few seats, or all seats of a room, but not both concurrently. The 148 Simultaneous Transmission Set concept allows a Provider to signal 149 such limitations. Simultaneous Transmission Sets are part of the 150 Capture Scene description, and discussed in section 6.3. 152 (4) Often, the devices in a room do not have the computational 153 complexity or connectivity to deal with multiple encoding options 154 simultaneously, even if each of these options may be sensible in 155 certain environments, and even if the simultaneous transmission may 156 also be sensible (i.e. in case of multicast media distribution to 157 multiple endpoints). Such constraints can be expressed by the 158 Provider using the Encoding Group concept, described in section 7. 160 (5) Due to the potentially large number of RTP flows required for a 161 Multimedia Conference involving potentially many Endpoints, each of 162 which can have many Media Captures and Media Renderers, a sensible 163 system design is to multiplex multiple RTP media flows onto the 164 same transport address, so to avoid using the port number as a 165 multiplexing point and the associated shortcomings such as 166 NAT/firewall traversal. While the actual mapping of those RTP 167 flows to the header fields of the RTP packets is not subject of 168 this specification, the large number of possible permutations of 169 sensible options a Media Provider may make available to a Media 170 Consumer makes a mechanism desirable that allows to narrow down the 171 number of possible options that a SIP offer-answer exchange has to 172 consider. Such information is made available using protocol 173 mechanisms specified in this document and companion documents, 174 although it should be stressed that its use in an implementation is 175 optional. Also, there are aspects of the control of both Endpoints 176 and middleboxes/MCUs that dynamically change during the progress of 177 a call, such as audio-level based screen switching, layout changes, 178 and so on, which need to be conveyed. Note that these control 179 aspects are complementary to those specified in traditional SIP 180 based conference management such as BFCP. An exemplary call flow 181 can be found in section 4. 183 Finally, all this information needs to be conveyed, and the notion 184 of support for it needs to be established. This is done by the 185 negotiation of a "CLUE channel", a data channel negotiated early 186 during the initiation of a call. An Endpoint or MCU that rejects 187 the establishment of this data channel, by definition, is not 188 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 189 accepts it is required to use it to the extent specified in this 190 document and its companion documents. 192 2. Terminology 194 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 195 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 196 this document are to be interpreted as described in RFC 2119 197 [RFC2119]. 199 3. Definitions 201 The terms defined below are used throughout this document and 202 companion documents and they are normative. In order to easily 203 identify the use of a defined term, those terms are capitalized. 205 Advertisement: a CLUE message a Media Provider sends to a Media 206 Consumer describing specific aspects of the content of the media, 207 the formatting of the media streams it can send, and any 208 restrictions it has in terms of being able to provide certain 209 Streams simultaneously. 211 Audio Capture: Media Capture for audio. Denoted as ACn in the 212 example cases in this document. 214 Camera-Left and Right: For Media Captures, camera-left and camera- 215 right are from the point of view of a person observing the rendered 216 media. They are the opposite of Stage-Left and Stage-Right. 218 Capture: Same as Media Capture. 220 Capture Device: A device that converts audio and video input into 221 an electrical signal, in most cases to be fed into a media encoder. 223 Capture Encoding: A specific encoding of a Media Capture, to be 224 sent by a Media Provider to a Media Consumer via RTP. 226 Capture Scene: a structure representing a spatial region containing 227 one or more Capture Devices, each capturing media representing a 228 portion of the region. The spatial region represented by a Capture 229 Scene may or may not correspond to a real region in physical space, 230 such as a room. A Capture Scene includes attributes and one or 231 more Capture Scene Entries, with each entry including one or more 232 Media Captures. 234 Capture Scene Entry: a list of Media Captures of the same media 235 type that together form one way to represent the entire Capture 236 Scene. 238 Conference: used as defined in [RFC4353], A Framework for 239 Conferencing within the Session Initiation Protocol (SIP). 241 Configure Message: A CLUE message a Media Consumer sends to a Media 242 Provider specifying which content and media streams it wants to 243 receive, based on the information in a corresponding Advertisement 244 message. 246 Consumer: short for Media Consumer. 248 Encoding or Individual Encoding: a set of parameters representing a 249 way to encode a Media Capture to become a Capture Encoding. 251 Encoding Group: A set of encoding parameters representing a total 252 media encoding capability to be sub-divided across potentially 253 multiple Individual Encodings. 255 Endpoint: The logical point of final termination through receiving, 256 decoding and rendering, and/or initiation through capturing, 257 encoding, and sending of media streams. An endpoint consists of 258 one or more physical devices which source and sink media streams, 259 and exactly one [RFC4353] Participant (which, in turn, includes 260 exactly one SIP User Agent). Endpoints can be anything from 261 multiscreen/multicamera rooms to handheld devices. 263 Front: the portion of the room closest to the cameras. In going 264 towards back you move away from the cameras. 266 MCU: Multipoint Control Unit (MCU) - a device that connects two or 267 more endpoints together into one single multimedia conference 268 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 269 [RFC4353] requirement to send media to each participant. 271 Media: Any data that, after suitable encoding, can be conveyed over 272 RTP, including audio, video or timed text. 274 Media Capture: a source of Media, such as from one or more Capture 275 Devices or constructed from other Media streams. 277 Media Consumer: an Endpoint or middle box that receives Media 278 streams 280 Media Provider: an Endpoint or middle box that sends Media streams 282 Model: a set of assumptions a telepresence system of a given vendor 283 adheres to and expects the remote telepresence system(s) also to 284 adhere to. 286 Plane of Interest: The spatial plane containing the most relevant 287 subject matter. 289 Provider: Same as Media Provider. 291 Render: the process of generating a representation from a media, 292 such as displayed motion video or sound emitted from loudspeakers. 294 Simultaneous Transmission Set: a set of Media Captures that can be 295 transmitted simultaneously from a Media Provider. 297 Spatial Relation: The arrangement in space of two objects, in 298 contrast to relation in time or other relationships. See also 299 Camera-Left and Right. 301 Stage-Left and Right: For Media Captures, Stage-left and Stage- 302 right are the opposite of Camera-left and Camera-right. For the 303 case of a person facing (and captured by) a camera, Stage-left and 304 Stage-right are from the point of view of that person. 306 Stream: a Capture Encoding sent from a Media Provider to a Media 307 Consumer via RTP [RFC3550]. 309 Stream Characteristics: the media stream attributes commonly used 310 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 311 resolution, profile/level etc.) as well as CLUE specific 312 attributes, such as the Capture ID or a spatial location. 314 Video Capture: Media Capture for video. Denoted as VCn in the 315 example cases in this document. 317 Video Composite: A single image that is formed, normally by an RTP 318 mixer inside an MCU, by combining visual elements from separate 319 sources. 321 4. Overview of the Framework/Model 323 The CLUE framework specifies how multiple media streams are to be 324 handled in a telepresence conference. 326 A Media Provider (transmitting Endpoint or MCU) describes specific 327 aspects of the content of the media and the formatting of the media 328 streams it can send in an Advertisement; and the Media Consumer 329 responds to the Media Provider by specifying which content and 330 media streams it wants to receive in a Configure message. The 331 Provider then transmits the asked-for content in the specified 332 streams. 334 This Advertisement and Configure occurs as a minimum during call 335 initiation but may also happen at any time throughout the call, 336 whenever there is a change in what the Consumer wants to receive or 337 (perhaps less common) the Provider can send. 339 An Endpoint or MCU typically act as both Provider and Consumer at 340 the same time, sending Advertisements and sending Configurations in 341 response to receiving Advertisements. (It is possible to be just 342 one or the other.) 344 The data model is based around two main concepts: a Capture and an 345 Encoding. A Media Capture (MC), such as audio or video, describes 346 the content a Provider can send. Media Captures are described in 347 terms of CLUE-defined attributes, such as spatial relationships and 348 purpose of the capture. Providers tell Consumers which Media 349 Captures they can provide, described in terms of the Media Capture 350 attributes. 352 A Provider organizes its Media Captures into one or more Capture 353 Scenes, each representing a spatial region, such as a room. A 354 Consumer chooses which Media Captures it wants to receive from each 355 Capture Scene. 357 In addition, the Provider can send the Consumer a description of 358 the Individual Encodings it can send in terms of the media 359 attributes of the Encodings, in particular, audio and video 360 parameters such as bandwidth, frame rate, macroblocks per second. 361 Note that this is optional, and intended to minimize the number of 362 options a later SDP offer-answer would require to include in the 363 SDP in case of complex setups, as should become clearer shortly 364 when discussing an outline of the call flow. 366 The Provider can also specify constraints on its ability to provide 367 Media, and a sensible design choice for a Consumer is to take these 368 into account when choosing the content and Capture Encodings it 369 requests in the later offer-answer exchange. Some constraints are 370 due to the physical limitations of devices - for example, a camera 371 may not be able to provide zoom and non-zoom views simultaneously. 372 Other constraints are system based constraints, such as maximum 373 bandwidth and maximum macroblocks/second. 375 A very brief outline of the call flow used by a simple system (two 376 Endpoints) in compliance with this document can be described as 377 follows, and as shown in the following figure. 379 +-----------+ +-----------+ 380 | Endpoint1 | | Endpoint2 | 381 +----+------+ +-----+-----+ 382 | INVITE (BASIC SDP+CLUECHANNEL) | 383 |--------------------------------->| 384 | 200 0K (BASIC SDP+CLUECHANNEL)| 385 |<---------------------------------| 386 | ACK | 387 |--------------------------------->| 388 | | 389 |<################################>| 390 | BASIC SDP MEDIA SESSION | 391 |<################################>| 392 | | 393 | CONNECT (CLUE CTRL CHANNEL) | 394 |=================================>| 395 | ... | 396 |<================================>| 397 | CLUE CTRL CHANNEL ESTABLISHED | 398 |<================================>| 399 | | 400 | ADVERTISEMENT 1 | 401 |*********************************>| 402 | ADVERTISEMENT 2 | 403 |<*********************************| 404 | | 405 | CONFIGURE 1 | 406 |<*********************************| 407 | CONFIGURE 2 | 408 |*********************************>| 409 | | 410 | REINVITE (UPDATED SDP) | 411 |--------------------------------->| 412 | 200 0K (UPDATED SDP)| 413 |<---------------------------------| 414 | ACK | 415 |--------------------------------->| 416 | | 417 |<################################>| 418 | UPDATED SDP MEDIA SESSION | 419 |<################################>| 420 | | 421 v v 423 An initial offer/answer exchange establishes a basic media session, 424 for example audio-only, and a CLUE channel between two Endpoints. 425 With the establishment of that channel, the endpoints have 426 consented to use the CLUE protocol mechanisms and have to adhere to 427 them. 429 Over this CLUE channel, the Provider in each Endpoint conveys its 430 characteristics and capabilities by sending an Advertisement as 431 specified herein (which will typically not be sufficient to set up 432 all media). The Consumer in the Endpoint receives the information 433 provided by the Provider, and can use it for two purposes. First, 434 it constructs and sends a CLUE Configure message to tell the 435 Provider what the Consumer wishes to receive. Second, it can, but 436 is not necessarily required to, use the information provided to 437 tailor the SDP it is going to send during the following SIP 438 offer/answer exchange, and its reaction to SDP it receives in that 439 step. It is often a sensible implementation choice to do so, as 440 the representation of the media information conveyed over the CLUE 441 channel can dramatically cut down on the size of SDP messages used 442 in the O/A exchange that follows. Spatial relationships associated 443 with the Media can be included in the Advertisement, and it is 444 often sensible for the Media Consumer to take those spatial 445 relationships into account when tailoring the SDP. 447 This CLUE exchange is followed by an SDP offer answer exchange that 448 not only establishes those aspects of the media that have not been 449 "negotiated" over CLUE, but has also the side effect of setting up 450 the media transmission itself, involving potentially security 451 exchanges, ICE, and whatnot. This step is plain vanilla SIP, with 452 the exception that the SDP used herein, in most cases can (but not 453 necessarily must) be considerably smaller than the SDP a system 454 would typically need to exchange if there were no pre-established 455 knowledge about the Provider and Consumer characteristics. (The 456 need for cutting down SDP size may not be obvious for a point-to- 457 point call involving simple endpoints; however, when considering a 458 large multipoint conference involving many multi-screen/multi- 459 camera endpoints, each of which can operate using multiple codecs 460 for each camera and microphone, it becomes perhaps somewhat more 461 intuitive.) 463 During the lifetime of a call, further exchanges can occur over the 464 CLUE channel. In some cases, those further exchanges can lead to a 465 modified system behavior of Provider or Consumer (or both) without 466 any other protocol activity such as further offer/answer exchanges. 467 For example, voice-activated screen switching, signaled over the 468 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 469 re-invites. However, in other cases, after the CLUE negotiation an 470 additional offer/answer exchange may become necessary. For 471 example, if both sides decide to upgrade the call from a single 472 screen to a multi-screen call and more bandwidth is required for 473 the additional video channels, that could require a new O/A 474 exchange. 476 Numerous optimizations may be possible, and are the implementer's 477 choice. For example, it may be sensible to establish one or more 478 initial media channels during the initial offer/answer exchange, 479 which would allow, for example, for a fast startup of audio. 480 Depending on the system design, it may be possible to re-use this 481 established channel for more advanced media negotiated only by CLUE 482 mechanisms, thereby avoiding further offer/answer exchanges. 484 Edt. note: The editors are not sure whether the mentioned 485 overloading of established RTP channels using only CLUE messages is 486 possible, or desired by the WG. If it were, certainly there is 487 need for specification work. One possible issue: a Provider which 488 thinks that it can switch, say, a audio codec algorithm by CLUE 489 only, talks to a Consumer which thinks that it has to faithfully 490 answer the Providers Advertisement through a Configure, but does 491 not dare setting up its internal resource until such time it has 492 received its authoritative O/A exchange. Working group input is 493 solicited. 495 One aspect of the protocol outlined herein and specified in 496 normative detail in companion documents is that it makes available 497 information regarding the Provider's capabilities to deliver Media, 498 and attributes related to that Media such as their spatial 499 relationship, to the Consumer. The operation of the Renderer 500 inside the Consumer is unspecified in that it can choose to ignore 501 some information provided by the Provider, and/or not render media 502 streams available from the Provider (although it has to follow the 503 CLUE protocol and, therefore, has to gracefully receive and respond 504 (through a Configure) to the Provider's information). All CLUE 505 protocol mechanisms are optional in the Consumer in the sense that, 506 while the Consumer must be able to receive (and, potentially, 507 gracefully acknowledge) CLUE messages, it is free to ignore the 508 information provided therein. Obviously, this is not a 509 particularly sensible design choice. 511 Legacy devices are defined here in as those Endpoints and MCUs that 512 do not support the setup and use of the CLUE channel. The notion 513 of a device being a legacy device is established during the initial 514 offer/answer exchange, in which the legacy device will not 515 understand the offer for the CLUE channel and, therefore, reject 516 it. This is the indication for the CLUE-implementing Endpoint or 517 MCU that the other side of the communication is not compliant with 518 CLUE, and to fall back to whatever mechanism was used before the 519 introduction of CLUE. 521 As for the media, Provider and Consumer have an end-to-end 522 communication relationship with respect to (RTP transported) media; 523 and the mechanisms described herein and in companion documents do 524 not change the aspects of setting up those RTP flows and sessions. 525 In other words, the RTP media sessions conform to the negotiated 526 SDP whether or not CLUE is used. However, it should be noted that 527 forms of RTP multiplexing of multiple RTP flows onto the same 528 transport address are developed concurrently with the CLUE suite of 529 specifications, and it is widely expected that most, if not all, 530 Endpoints or MCUs supporting CLUE will also support those 531 mechanisms. Some design choices made in this document reflect this 532 coincidence in spec development timing. 534 5. Spatial Relationships 536 In order for a Consumer to perform a proper rendering, it is often 537 necessary or at least helpful for the Consumer to have received 538 spatial information about the streams it is receiving. CLUE 539 defines a coordinate system that allows Media Providers to describe 540 the spatial relationships of their Media Captures to enable proper 541 scaling and spatially sensible rendering of their streams. The 542 coordinate system is based on a few principles: 544 o Simple systems which do not have multiple Media Captures to 545 associate spatially need not use the coordinate model. 547 o Coordinates can either be in real, physical units (millimeters), 548 have an unknown scale or have no physical scale. Systems which 549 know their physical dimensions (for example professionally 550 installed Telepresence room systems) should always provide those 551 real-world measurements. Systems which don't know specific 552 physical dimensions but still know relative distances should use 553 'unknown scale'. 'No scale' is intended to be used where Media 554 Captures from different devices (with potentially different 555 scales) will be forwarded alongside one another (e.g. in the 556 case of a middle box). 558 * "millimeters" means the scale is in millimeters 560 * "Unknown" means the scale is not necessarily millimeters, but 561 the scale is the same for every Capture in the Capture Scene. 563 * "No Scale" means the scale could be different for each 564 capture- an MCU provider that advertises two adjacent 565 captures and picks sources (which can change quickly) from 566 different endpoints might use this value; the scale could be 567 different and changing for each capture. But the areas of 568 capture still represent a spatial relation between captures. 570 o The coordinate system is Cartesian X, Y, Z with the origin at a 571 spatial location of the provider's choosing. The Provider must 572 use the same coordinate system with same scale and origin for 573 all coordinates within the same Capture Scene. 575 The direction of increasing coordinate values is: 576 X increases from Camera-Left to Camera-Right 577 Y increases from Front to back 578 Z increases from low to high 580 6. Media Captures and Capture Scenes 582 This section describes how Providers can describe the content of 583 media to Consumers. 585 6.1. Media Captures 587 Media Captures are the fundamental representations of streams that 588 a device can transmit. What a Media Capture actually represents is 589 flexible: 591 o It can represent the immediate output of a physical source (e.g. 592 camera, microphone) or 'synthetic' source (e.g. laptop computer, 593 DVD player). 595 o It can represent the output of an audio mixer or video composer 597 o It can represent a concept such as 'the loudest speaker' 599 o It can represent a conceptual position such as 'the leftmost 600 stream' 602 To identify and distinguish between multiple instances, video and 603 audio captures are labeled. For instance: VC1, VC2 and AC1, AC2, 604 where VC1 and VC2 refer to two different video captures and AC1 605 and AC2 refer to two different audio captures. 607 Some key points about Media Captures: 609 . A Media Capture is of a single media type (e.g. audio or 610 video) 611 . A Media Capture is associated with exactly one Capture Scene 612 . A Media Capture is associated with one or more Capture Scene 613 Entries 614 . A Media Capture has exactly one set of spatial information 615 . A Media Capture may be the source of one or more Capture 616 Encodings 618 Each Media Capture can be associated with attributes to describe 619 what it represents. 621 6.1.1. Media Capture Attributes 623 Media Capture Attributes describe information about the Captures. 624 A Provider can use the Media Capture Attributes to describe the 625 Captures for the benefit of the Consumer in the Advertisement 626 message. Media Capture Attributes include: 628 . spatial information, such as point of capture, point on line 629 of capture, and area of capture, all of which, in combination 630 define the capture field of, for example, a camera; 631 . Capture multiplexing information (composed/switched video, 632 mono/stereo audio, maximum number of simultaneous encodings 633 per Capture and so on); and 634 . Other descriptive information to help the Consumer choose 635 between captures (presentation, view, priority, language, 636 role). 637 . Control information for use inside the CLUE protocol suite. 639 Point of Capture: 641 A field with a single Cartesian (X, Y, Z) point value which 642 describes the spatial location of the capturing device (such as 643 camera). 645 Point on Line of Capture: 647 A field with a single Cartesian (X, Y, Z) point value which 648 describes a position in space of a second point on the axis of the 649 capturing device; the first point being the Point of Capture (see 650 above). 652 Together, the Point of Capture and Point on Line of Capture define 653 an axis of the capturing device, for example the optical axis of a 654 camera. The Media Consumer can use this information to adjust how 655 it renders the received media if it so chooses. 657 Area of Capture: 659 A field with a set of four (X, Y, Z) points as a value which 660 describe the spatial location of what is being "captured". By 661 comparing the Area of Capture for different Media Captures within 662 the same Capture Scene a consumer can determine the spatial 663 relationships between them and render them correctly. 665 The four points should be co-planar, forming a quadrilateral, which 666 defines the Plane of Interest for the particular media capture. 668 If the Area of Capture is not specified, it means the Media Capture 669 is not spatially related to any other Media Capture. 671 For a switched capture that switches between different sections 672 within a larger area, the area of capture should use coordinates 673 for the larger potential area. 675 Mobility of Capture: 677 This attribute indicates whether or not the point of capture, line 678 on point of capture, and area of capture values will stay the same, 679 or are expected to change frequently. Possible values are static, 680 dynamic, and highly dynamic. 682 For example, a camera may be placed at different positions in order 683 to provide the best angle to capture a work task, or may include a 684 camera worn by a participant. This would have an effect of changing 685 the capture point, capture axis and area of capture. In order that 686 the Consumer can choose to render the capture appropriately, the 687 Provider can include this attribute to indicate if the camera 688 location is dynamic or not. 690 The capture point of a static capture does not move for the life of 691 the conference. The capture point of dynamic captures is 692 categorised by a change in position followed by a reasonable period 693 of stability. High dynamic captures are categorised by a capture 694 point that is constantly moving. If the "area of capture", 695 "capture point" and "line of capture" attributes are included with 696 dynamic or highly dynamic captures they indicate spatial 697 information at the time of the Advertisement. No information 698 regarding future spatial information should be assumed. 700 Composed: 702 A boolean field which indicates whether or not the Media Capture is 703 a mix (audio) or composition (video) of streams. 705 This attribute is useful for a media consumer to avoid nesting a 706 composed video capture into another composed capture or rendering. 707 This attribute is not intended to describe the layout a media 708 provider uses when composing video streams. 710 Switched: 712 A boolean field which indicates whether or not the Media Capture 713 represents the (dynamic) most appropriate subset of a 'whole'. 714 What is 'most appropriate' is up to the provider and could be the 715 active speaker, a lecturer or a VIP. 717 Audio Channel Format: 719 A field with enumerated values which describes the method of 720 encoding used for audio. A value of 'mono' means the Audio Capture 721 has one channel. 'stereo' means the Audio Capture has two audio 722 channels, left and right. 724 This attribute applies only to Audio Captures. A single stereo 725 capture is different from two mono captures that have a left-right 726 spatial relationship. A stereo capture maps to a single Capture 727 Encoding, while each mono audio capture maps to a separate Capture 728 Encoding. 730 Max Capture Encodings: 732 An optional attribute indicating the maximum number of Capture 733 Encodings that can be simultaneously active for the Media Capture. 734 The number of simultaneous Capture Encodings is also limited by the 735 restrictions of the Encoding Group for the Media Capture. 737 Presentation: 739 This attribute indicates that the capture originates from a 740 presentation device, that is one that provides supplementary 741 information to a conference through slides, video, still images, 742 data etc. Where more information is known about the capture it may 743 be expanded hierarchically to indicate the different types of 744 presentation media, e.g. presentation.slides, presentation.image 745 etc. 747 Note: It is expected that a number of keywords will be defined that 748 provide more detail on the type of presentation. 750 View: 752 A field with enumerated values, indicating what type of view the 753 capture relates to. The Consumer can use this information to help 754 choose which Media Captures it wishes to receive. The value can be 755 one of: 757 Room - Captures the entire scene 759 Table - Captures the conference table with seated participants 761 Individual - Captures an individual participant 763 Lectern - Captures the region of the lectern including the 764 presenter in a classroom style conference 766 Audience - Captures a region showing the audience in a classroom 767 style conference 769 Language: 771 This attribute indicates one or more languages used in the content 772 of the media capture. Captures may be offered in different 773 languages in case of multilingualand/or accessible conferences, so 774 a Consumer can use this attribute to differentiate between them. 776 This indicates which language is associated with the capture. For 777 example: it may provide a language associated with an audio capture 778 or a language associated with a video capture when sign 779 interpretation or text is used. 781 Role: 783 Edt. Note -- this is a placeholder for a role attribute, as 784 discussed in draft-groves-clue-capture-attr. We expect to continue 785 discussing the role attribute in the context of that draft, and 786 follow-on drafts, before adding it to this framework document. 788 Priority: 790 This attribute indicates a relative priority between different 791 Media Captures. The Provider sets this priority, and the Consumer 792 may use the priority to help decide which captures it wishes to 793 receive. 795 The "priority" attribute is an integer which indicates a relative 796 priority between captures. For example it is possible to assign a 797 priority between two presentation captures that would allow a 798 remote endpoint to determine which presentation is more important. 799 Priority is assigned at the individual capture level. It represents 800 the Provider's view of the relative priority between captures with 801 a priority. The same priority number may be used across multiple 802 captures. It indicates they are equally as important. If no 803 priority is assigned no assumptions regarding relative important of 804 the capture can be assumed. 806 Embedded Text: 808 This attribute indicates that a capture provides embedded textual 809 information. For example the video capture may contain speech to 810 text information composed with the video image. This attribute is 811 only applicable to video captures and presentation streams with 812 visual information. 814 Related To: 816 This attribute indicates the capture contains additional 817 complementary information related to another capture. The value 818 indicates the other capture to which this capture is providing 819 additional information. 821 For example, a conferences can utilise translators or facilitators 822 that provide an additional audio stream (i.e. a translation or 823 description or commentary of the conference). Where multiple 824 captures are available, it may be advantageous for a Consumer to 825 select a complementary capture instead of or in addition to a 826 capture it relates to. 828 6.2. Capture Scene 830 In order for a Provider's individual Captures to be used 831 effectively by a Consumer, the provider organizes the Captures into 832 one or more Capture Scenes, with the structure and contents of 833 these Capture Scenes being sent from the Provider to the Consumer 834 in the Advertisement. 836 A Capture Scene is a structure representing a spatial region 837 containing one or more Capture Devices, each capturing media 838 representing a portion of the region. A Capture Scene includes one 839 or more Capture Scene entries, with each entry including one or 840 more Media Captures. A Capture Scene represents, for example, the 841 video image of a group of people seated next to each other, along 842 with the sound of their voices, which could be represented by some 843 number of VCs and ACs in the Capture Scene Entries. A middle box 844 may also express Capture Scenes that it constructs from media 845 Streams it receives. 847 A Provider may advertise multiple Capture Scenes or just a single 848 Capture Scene. What constitutes an entire Capture Scene is up to 849 the Provider. A Provider might typically use one Capture Scene for 850 participant media (live video from the room cameras) and another 851 Capture Scene for a computer generated presentation. In more 852 complex systems, the use of additional Capture Scenes is also 853 sensible. For example, a classroom may advertise two Capture 854 Scenes involving live video, one including only the camera 855 capturing the instructor (and associated audio), the other 856 including camera(s) capturing students (and associated audio). 858 A Capture Scene may (and typically will) include more than one type 859 of media. For example, a Capture Scene can include several Capture 860 Scene Entries for Video Captures, and several Capture Scene Entries 861 for Audio Captures. A particular Capture may be included in more 862 than one Capture Scene Entry. 864 A provider can express spatial relationships between Captures that 865 are included in the same Capture Scene. However, there is not 866 necessarily the same spatial relationship between Media Captures 867 that are in different Capture Scenes. In other words, Capture 868 Scenes can use their own spatial measurement system as outlined 869 above in section 5. 871 A Provider arranges Captures in a Capture Scene to help the 872 Consumer choose which captures it wants. The Capture Scene Entries 873 in a Capture Scene are different alternatives the provider is 874 suggesting for representing the Capture Scene. The order of 875 Capture Scene Entries within a Capture Scene has no significance. 876 The Media Consumer can choose to receive all Media Captures from 877 one Capture Scene Entry for each media type (e.g. audio and video), 878 or it can pick and choose Media Captures regardless of how the 879 Provider arranges them in Capture Scene Entries. Different Capture 880 Scene Entries of the same media type are not necessarily mutually 881 exclusive alternatives. Also note that the presence of multiple 882 Capture Scene Entries (with potentially multiple encoding options 883 in each entry) in a given Capture Scene does not necessarily imply 884 that a Provider is able to serve all the associated media 885 simultaneously (although the construction of such an over-rich 886 Capture Scene is probably not sensible in many cases). What a 887 Provider can send simultaneously is determined through the 888 Simultaneous Transmission Set mechanism, described in section 6.3. 890 Captures within the same Capture Scene entry must be of the same 891 media type - it is not possible to mix audio and video captures in 892 the same Capture Scene Entry, for instance. The Provider must be 893 capable of encoding and sending all Captures in a single Capture 894 Scene Entry simultaneously. The order of Captures within a Capture 895 Scene Entry has no significance. A Consumer may decide to receive 896 all the Captures in a single Capture Scene Entry, but a Consumer 897 could also decide to receive just a subset of those captures. A 898 Consumer can also decide to receive Captures from different Capture 899 Scene Entries, all subject to the constraints set by Simultaneous 900 Transmission Sets, as discussed in section 6.3. 902 When a Provider advertises a Capture Scene with multiple entries, 903 it is essentially signaling that there are multiple representations 904 of the same Capture Scene available. In some cases, these multiple 905 representations would typically be used simultaneously (for 906 instance a "video entry" and an "audio entry"). In some cases the 907 entries would conceptually be alternatives (for instance an entry 908 consisting of three Video Captures covering the whole room versus 909 an entry consisting of just a single Video Capture covering only 910 the center if a room). In this latter example, one sensible choice 911 for a Consumer would be to indicate (through its Configure and 912 possibly through an additional offer/answer exchange) the Captures 913 of that Capture Scene Entry that most closely matched the 914 Consumer's number of display devices or screen layout. 916 The following is an example of 4 potential Capture Scene Entries 917 for an endpoint-style Provider: 919 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 921 2. (VC3) - Video Capture associated with loudest room segment 923 3. (VC4) - Video Capture zoomed out view of all people in the room 925 4. (AC0) - main audio 927 The first entry in this Capture Scene example is a list of Video 928 Captures which have a spatial relationship to each other. 929 Determination of the order of these captures (VC0, VC1 and VC2) for 930 rendering purposes is accomplished through use of their Area of 931 Capture attributes. The second entry (VC3) and the third entry 932 (VC4) are alternative representations of the same room's video, 933 which might be better suited to some Consumers' rendering 934 capabilities. The inclusion of the Audio Capture in the same 935 Capture Scene indicates that AC0 is associated with all of those 936 Video Captures, meaning it comes from the same spatial region. 937 Therefore, if audio were to be rendered at all, this audio would be 938 the correct choice irrespective of which Video Captures were 939 chosen. 941 6.2.1. Capture Scene attributes 943 Capture Scene Attributes can be applied to Capture Scenes as well 944 as to individual media captures. Attributes specified at this 945 level apply to all constituent Captures. Capture Scene attributes 946 include 948 . Human-readable description of the Capture Scene, which could 949 be in multiple languages; 950 . Scale information (millimeters, unknown, no scale), as 951 described in Section 5. 953 6.2.2. Capture Scene Entry attributes 955 A Capture Scene can include one or more Capture Scene Entries in 956 addition to the Capture Scene wide attributes described above. 957 Capture Scene Entry attributes apply to the Capture Scene Entry as 958 a whole, i.e. to all Captures that are part of the Capture Scene 959 Entry. 961 Capture Scene Entry attributes include: 963 . Scene-switch-policy: {site-switch, segment-switch} 965 A media provider uses this scene-switch-policy attribute to 966 indicate its support for different switching policies. In the 967 provider's Advertisement, this attribute can have multiple values, 968 which means the provider supports each of the indicated policies. 970 The consumer, when it requests media captures from this Capture 971 Scene Entry, should also include this attribute but with only the 972 single value (from among the values indicated by the provider) 973 indicating the Consumer's choice for which policy it wants the 974 provider to use. The Consumer must choose the same value for all 975 the Media Captures in the Capture Scene Entry. If the provider 976 does not support any of these policies, it should omit this 977 attribute. 979 The "site-switch" policy means all captures are switched at the 980 same time to keep captures from the same endpoint site together. 981 Let's say the speaker is at site A and everyone else is at a 982 "remote" site. 984 When the room at site A shown, all the camera images from site A 985 are forwarded to the remote sites. Therefore at each receiving 986 remote site, all the screens display camera images from site A. 987 This can be used to preserve full size image display, and also 988 provide full visual context of the displayed far end, site A. In 989 site switching, there is a fixed relation between the cameras in 990 each room and the displays in remote rooms. The room or 991 participants being shown is switched from time to time based on who 992 is speaking or by manual control. 994 The "segment-switch" policy means different captures can switch at 995 different times, and can be coming from different endpoints. Still 996 using site A as where the speaker is, and "remote" to refer to all 997 the other sites, in segment switching, rather than sending all the 998 images from site A, only the image containing the speaker at site A 999 is shown. The camera images of the current speaker and previous 1000 speakers (if any) are forwarded to the other sites in the 1001 conference. 1003 Therefore the screens in each site are usually displaying images 1004 from different remote sites - the current speaker at site A and the 1005 previous ones. This strategy can be used to preserve full size 1006 image display, and also capture the non-verbal communication 1007 between the speakers. In segment switching, the display depends on 1008 the activity in the remote rooms - generally, but not necessarily 1009 based on audio / speech detection. 1011 6.3. Simultaneous Transmission Set Constraints 1013 The Provider may have constraints or limitations on its ability to 1014 send Captures. One type is caused by the physical limitations of 1015 capture mechanisms; these constraints are represented by a 1016 simultaneous transmission set. The second type of limitation 1017 reflects the encoding resources available - bandwidth and 1018 macroblocks/second. This type of constraint is captured by 1019 encoding groups, discussed below. 1021 Some Endpoints or MCUs can send multiple Captures simultaneously, 1022 however sometimes there are constraints that limit which Captures 1023 can be sent simultaneously with other Captures. A device may not 1024 be able to be used in different ways at the same time. Provider 1025 Advertisements are made so that the Consumer can choose one of 1026 several possible mutually exclusive usages of the device. This 1027 type of constraint is expressed in a Simultaneous Transmission Set, 1028 which lists all the Captures of a particular media type (e.g. 1029 audio, video, text) that can be sent at the same time. There are 1030 different Simultaneous Transmission Sets for each media type in the 1031 Advertisement. This is easier to show in an example. 1033 Consider the example of a room system where there are three cameras 1034 each of which can send a separate capture covering two persons 1035 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1036 optical zoom lens) and show all six persons, VC3. But the middle 1037 camera cannot be used in both modes at the same time - it has to 1038 either show the space where two participants sit or the whole six 1039 seats, but not both at the same time. 1041 Simultaneous transmission sets are expressed as sets of the Media 1042 Captures that the Provider could transmit at the same time (though 1043 it may not make sense to do so). In this example the two 1044 simultaneous sets are shown in Table 1. If a Provider advertises 1045 one or more mutually exclusive Simultaneous Transmission Sets, then 1046 for each media type the Consumer must ensure that it chooses Media 1047 Captures that lie wholly within one of those Simultaneous 1048 Transmission Sets. 1050 +-------------------+ 1051 | Simultaneous Sets | 1052 +-------------------+ 1053 | {VC0, VC1, VC2} | 1054 | {VC0, VC3, VC2} | 1055 +-------------------+ 1057 Table 1: Two Simultaneous Transmission Sets 1059 A Provider optionally can include the simultaneous sets in its 1060 provider Advertisement. These simultaneous set constraints apply 1061 across all the Capture Scenes in the Advertisement. It is a syntax 1062 conformance requirement that the simultaneous transmission sets 1063 must allow all the media captures in any particular Capture Scene 1064 Entry to be used simultaneously. 1066 For shorthand convenience, a Provider may describe a Simultaneous 1067 Transmission Set in terms of Capture Scene Entries and Capture 1068 Scenes. If a Capture Scene Entry is included in a Simultaneous 1069 Transmission Set, then all Media Captures in the Capture Scene 1070 Entry are included in the Simultaneous Transmission Set. If a 1071 Capture Scene is included in a Simultaneous Transmission Set, then 1072 all its Capture Scene Entries (of the corresponding media type) are 1073 included in the Simultaneous Transmission Set. The end result 1074 reduces to a set of Media Captures in any case. 1076 If an Advertisement does not include Simultaneous Transmission 1077 Sets, then all Capture Scenes can be provided simultaneously. If 1078 multiple capture Scene Entries are in a Capture Scene then the 1079 Consumer chooses at most one Capture Scene Entry per Capture Scene 1080 for each media type. 1082 If an Advertisement includes multiple Capture Scene Entries in a 1083 Capture Scene then the Consumer should choose one Capture Scene 1084 Entry for each media type, but may choose individual Captures based 1085 on the Simultaneous Transmission Sets. 1087 7. Encodings 1089 Individual encodings and encoding groups are CLUE's mechanisms 1090 allowing a Provider to signal its limitations for sending Captures, 1091 or combinations of Captures, to a Consumer. Consumers can map the 1092 Captures they want to receive onto the Encodings, with encoding 1093 parameters they want. As for the relationship between the CLUE- 1094 specified mechanisms based on Encodings and the SIP Offer-Answer 1095 exchange, please refer to section 4. 1097 7.1. Individual Encodings 1099 An Individual Encoding represents a way to encode a Media Capture 1100 to become a Capture Encoding, to be sent as an encoded media stream 1101 from the Provider to the Consumer. An Individual Encoding has a 1102 set of parameters characterizing how the media is encoded. 1104 Different media types have different parameters, and different 1105 encoding algorithms may have different parameters. An Individual 1106 Encoding can be assigned to at most one Capture Encoding at any 1107 given time. 1109 The parameters of an Individual Encoding represent the maximum 1110 values for certain aspects of the encoding. A particular 1111 instantiation into a Capture Encoding might use lower values than 1112 these maximums. 1114 In general, the parameters of an Individual Encoding have been 1115 chosen to represent those negotiable parameters of media codecs of 1116 the media type that greatly influence computational complexity, 1117 while abstracting from details of particular media codecs used. 1118 The parameters have been chosen with those media codecs in mind 1119 that have seen wide deployment in the video conferencing and 1120 Telepresence industry. 1122 For video codecs (using H.26x compression technologies), those 1123 parameters include: 1125 . Maximum bitrate; 1126 . Maximum picture size in pixels; 1127 . Maxmimum number of pixels to be processed per second; and 1128 . Clue-protocol internal information. 1130 For audio codecs, so far only one parameter has been identified: 1132 . Maximum bitrate. 1134 Edt. note: the maximum number of pixel per second are currently 1135 expressed as H.264maxmbps. 1137 Edt. note: it would be desirable to make the computational 1138 complexity mechanism codec independent so to allow for expressing 1139 that, say, H.264 codecs are less complex than H.265 codecs, and, 1140 therefore, the same hardware can process higher pixel rates for 1141 H.264 than for H.265. To be discussed in the WG. 1143 7.2. Encoding Group 1145 An Encoding Group includes a set of one or more Individual 1146 Encodings, and parameters that apply to the group as a whole. By 1147 grouping multiple individual Encodings together, an Encoding Group 1148 describes additional constraints on bandwidth and other parameters 1149 for the group. 1151 The Encoding Group data structure contains: 1153 . Maximum bitrate for all encodings in the group combined; 1154 . Maximum number of pixels per second for all video encodings of 1155 the group combined. 1156 . A list of identifiers for audio and video encodings, 1157 respectively, belonging to the group. 1159 When the Individual Encodings in a group are instantiated into 1160 Capture Encodings, each Capture Encoding has a bitrate that must be 1161 less than or equal to the max bitrate for the particular individual 1162 encoding. The "maximum bitrate for all encodings in the group" 1163 parameter gives the additional restriction that the sum of all the 1164 individual capture encoding bitrates must be less than or equal to 1165 the this group value. 1167 Likewise, the sum of the pixels per second of each instantiated 1168 encoding in the group must not exceed the group value. 1170 The following diagram illustrates one example of the structure of a 1171 media provider's Encoding Groups and their contents. 1173 ,-------------------------------------------------. 1174 | Media Provider | 1175 | | 1176 | ,--------------------------------------. | 1177 | | ,--------------------------------------. | 1178 | | | ,--------------------------------------. | 1179 | | | | Encoding Group | | 1180 | | | | ,-----------. | | 1181 | | | | | | ,---------. | | 1182 | | | | | | | | ,---------.| | 1183 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1184 | `.| | | | | | `---------'| | 1185 | `.| `-----------' `---------' | | 1186 | `--------------------------------------' | 1187 `-------------------------------------------------' 1189 Figure 1: Encoding Group Structure 1191 A Provider advertises one or more Encoding Groups. Each Encoding 1192 Group includes one or more Individual Encodings. Each Individual 1193 Encoding can represent a different way of encoding media. For 1194 example one Individual Encoding may be 1080p60 video, another could 1195 be 720p30, with a third being CIF, all in, for example, H.264 1196 format. 1198 While a typical three codec/display system might have one Encoding 1199 Group per "codec box" (physical codec, connected to one camera and 1200 one screen), there are many possibilities for the number of 1201 Encoding Groups a Provider may be able to offer and for the 1202 encoding values in each Encoding Group. 1204 There is no requirement for all Encodings within an Encoding Group 1205 to be instantiated at the same time. 1207 8. Associating Captures with Encoding Groups 1209 Every Capture is associated with an Encoding Group, which is used 1210 to instantiate that Capture into one or more Capture Encodings. 1211 More than one Capture may use the same Encoding Group. 1213 The maximum number of streams that can result from a particular 1214 Encoding Group constraint is equal to the number of individual 1215 Encodings in the group. The actual number of Capture Encodings 1216 used at any time may be less than this maximum. Any of the 1217 Captures that use a particular Encoding Group can be encoded 1218 according to any of the Individual Encodings in the group. If 1219 there are multiple Individual Encodings in the group, then the 1220 Consumer can configure the Provider, via a Configure message, to 1221 encode a single Media Capture into multiple different Capture 1222 Encodings at the same time, subject to the Max Capture Encodings 1223 constraint, with each capture encoding following the constraints of 1224 a different Individual Encoding. 1226 It is a protocol conformance requirement that the Encoding Groups 1227 must allow all the Captures in a particular Capture Scene Entry to 1228 be used simultaneously. 1230 9. Consumer's Choice of Streams to Receive from the Provider 1232 After receiving the Provider's Advertisement message (that includes 1233 media captures and associated constraints), the Consumer composes 1234 its reply to the Provider in the form of a Configure message. The 1235 Consumer is free to use the information in the Advertisement as it 1236 chooses, but there are a few obviously sensible design choices, 1237 which are outlined below. 1239 If multiple Providers connect to the same Consumer (i.e. in a n 1240 MCU-less multiparty call), it is the repsonsibility of the Consumer 1241 to compose Configures for each Provider that both fulfill each 1242 Provider's constraints as expressed in the Advertisement, as well 1243 as its own capabilities. 1245 In an MCU-based multiparty call, the MCU can logically terminate 1246 the Advertisement/Configure negotiation in that it can hide the 1247 characteristics of the receiving endpoint and rely on its own 1248 capabilities (transcoding/transrating/...) to create Media Streams 1249 that can be decoded at the Endpoint Consumers. The timing of an 1250 MCU's sending of Advertisements (for its outgoing ports) and 1251 Configures (for its incoming ports, in response to Advertisements 1252 received there) is up to the MCU and implementation dependent. 1254 As a general outline, A Consumer can choose, based on the 1255 Advertisement it has received, which Captures it wishes to receive, 1256 and which Individual Encodings it wants the Provider to use to 1257 encode the Captures. Each Capture has an Encoding Group ID 1258 attribute which specifies which Individual Encodings are available 1259 to be used for that Capture. 1261 A Configure Message includes a list of Capture Encodings. These 1262 are the Capture Encodings the Consumer wishes to receive from the 1263 Provider. Each Capture Encoding refers to one Media Capture, one 1264 Individual Encoding, and includes the encoding parameter values. 1265 For each Media Capture in the message, the Consumer may also 1266 specify the value of any attributes for which the Provider has 1267 offered a choice, for example the value for the Scene-switch-policy 1268 attribute. A Configure Message does not include references to 1269 Capture Scenes or Capture Scene Entries. 1271 For each Capture the Consumer wants to receive, it configures one 1272 or more of the encodings in that capture's encoding group. The 1273 Consumer does this by telling the Provider, in its Configure 1274 Message, parameters such as the resolution, frame rate, bandwidth, 1275 etc. for each Capture Encodings for its chosen Captures. Upon 1276 receipt of this Configure from the Consumer, common knowledge is 1277 established between Provider and Consumer regarding sensible 1278 choices for the media streams and their parameters. The setup of 1279 the actual media channels, at least in the simplest case, is left 1280 to a following offer-answer exchange. Optimized implementations 1281 may speed up the reaction to the offer-answer exchange by reserving 1282 the resources at the time of finalization of the CLUE handshake. 1283 Even more advanced devices may choose to establish media streams 1284 without an offer-answer exchange, for example by overloading 1285 existing 5 tuple connections with the negotiated media. 1287 The Consumer must have received at least one Advertisement from the 1288 Provider to be able to create and send a Configure. Each 1289 Advertisement is acknowledged by a corresponding Configure. 1291 In addition, the Consumer can send a Configure at any time during 1292 the call. The Configure must be valid according to the most 1293 recently received Advertisement. The Consumer can send a Configure 1294 either in response to a new Advertisement from the Provider or as 1295 by its own, for example because of a local change in conditions 1296 (people leaving the room, connectivity changes, multipoint related 1297 considerations). 1299 The Consumer need not send a new Configure message to the Provider 1300 when it receives a new Advertisement from the Provider unless the 1301 contents of the new Advertisement cause the Consumer's current 1302 Configure message to become invalid. 1304 Edt. Note: The editors solicit input from the working group as to 1305 whether or not a Consumer must respond to every Advertisement with 1306 a new Configure message. 1308 When choosing which Media Streams to receive from the Provider, and 1309 the encoding characteristics of those Media Streams, the Consumer 1310 advantageously takes several things into account: its local 1311 preference, simultaneity restrictions, and encoding limits. 1313 9.1. Local preference 1315 A variety of local factors influence the Consumer's choice of 1316 Media Streams to be received from the Provider: 1318 o if the Consumer is an Endpoint, it is likely that it would 1319 choose, where possible, to receive video and audio Captures that 1320 match the number of display devices and audio system it has 1322 o if the Consumer is a middle box such as an MCU, it may choose to 1323 receive loudest speaker streams (in order to perform its own 1324 media composition) and avoid pre-composed video Captures 1326 o user choice (for instance, selection of a new layout) may result 1327 in a different set of Captures, or different encoding 1328 characteristics, being required by the Consumer 1330 9.2. Physical simultaneity restrictions 1332 There may be physical simultaneity constraints imposed by the 1333 Provider that affect the Provider's ability to simultaneously send 1334 all of the captures the Consumer would wish to receive. For 1335 instance, a middle box such as an MCU, when connected to a multi- 1336 camera room system, might prefer to receive both individual video 1337 streams of the people present in the room and an overall view of 1338 the room from a single camera. Some Endpoint systems might be 1339 able to provide both of these sets of streams simultaneously, 1340 whereas others may not (if the overall room view were produced by 1341 changing the optical zoom level on the center camera, for 1342 instance). 1344 9.3. Encoding and encoding group limits 1346 Each of the Provider's encoding groups has limits on bandwidth and 1347 computational complexity, and the constituent potential encodings 1348 have limits on the bandwidth, computational complexity, video 1349 frame rate, and resolution that can be provided. When choosing 1350 the Captures to be received from a Provider, a Consumer device 1351 must ensure that the encoding characteristics requested for each 1352 individual Capture fits within the capability of the encoding it 1353 is being configured to use, as well as ensuring that the combined 1354 encoding characteristics for Captures fit within the capabilities 1355 of their associated encoding groups. In some cases, this could 1356 cause an otherwise "preferred" choice of capture encodings to be 1357 passed over in favour of different Capture Encodings - for 1358 instance, if a set of three Captures could only be provided at a 1359 low resolution then a three screen device could switch to favoring 1360 a single, higher quality, Capture Encoding. 1362 10. Extensibility 1364 One of the most important characteristics of the Framework is its 1365 extensibility. Telepresence is a relatively new industry and 1366 while we can foresee certain directions, we also do not know 1367 everything about how it will develop. The standard for 1368 interoperability and handling multiple streams must be future- 1369 proof. The framework itself is inherently extensible through 1370 expanding the data model types. For example: 1372 o Adding more types of media, such as telemetry, can done by 1373 defining additional types of Captures in addition to audio and 1374 video. 1376 o Adding new functionalities , such as 3-D, say, may require 1377 additional attributes describing the Captures. 1379 o Adding a new codecs, such as H.265, can be accomplished by 1380 defining new encoding variables. 1382 The infrastructure is designed to be extended rather than 1383 requiring new infrastructure elements. Extension comes through 1384 adding to defined types. 1386 11. Examples - Using the Framework 1387 EDT. Note: these examples are currently out of date with respect 1388 to H264Mbps codepoints, which will be fixed in the next release 1389 once an agreement about codec computational complexity has been 1390 found. Other than that, the examples are still valid. 1392 EDT Note: remove syntax-like details in these examples, and focus 1393 on concepts for this document. Syntax examples with XML should be 1394 in the data model doc or dedicated example document. 1396 This section gives some examples, first from the point of view of 1397 the Provider, then the Consumer. 1399 11.1. Provider Behavior 1401 This section shows some examples in more detail of how a Provider 1402 can use the framework to represent a typical case for telepresence 1403 rooms. First an endpoint is illustrated, then an MCU case is 1404 shown. 1406 11.1.1. Three screen Endpoint Provider 1408 Consider an Endpoint with the following description: 1410 3 cameras, 3 displays, a 6 person table 1412 o Each camera can provide one Capture for each 1/3 section of the 1413 table 1415 o A single Capture representing the active speaker can be provided 1416 (voice activity based camera selection to a given encoder input 1417 port implemented locally in the Endpoint) 1419 o A single Capture representing the active speaker with the other 1420 2 Captures shown picture in picture within the stream can be 1421 provided (again, implemented inside the endpoint) 1423 o A Capture showing a zoomed out view of all 6 seats in the room 1424 can be provided 1426 The audio and video Captures for this Endpoint can be described as 1427 follows. 1429 Video Captures: 1431 o VC0- (the camera-left camera stream), encoding group=EG0, 1432 content=main, switched=false 1434 o VC1- (the center camera stream), encoding group=EG1, 1435 content=main, switched=false 1437 o VC2- (the camera-right camera stream), encoding group=EG2, 1438 content=main, switched=false 1440 o VC3- (the loudest panel stream), encoding group=EG1, 1441 content=main, switched=true 1443 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1444 content=main, composed=true, switched=true 1446 o VC5- (the zoomed out view of all people in the room), encoding 1447 group=EG1, content=main, composed=false, switched=false 1449 o VC6- (presentation stream), encoding group=EG1, content=slides, 1450 switched=false 1452 The following diagram is a top view of the room with 3 cameras, 3 1453 displays, and 6 seats. Each camera is capturing 2 people. The 1454 six seats are not all in a straight line. 1456 ,-. D 1457 ( )`--.__ +---+ 1458 `-' / `--.__ | | 1459 ,-. | `-.._ |_-+Camera 2 (VC2) 1460 ( ).' ___..-+-''`+-+ 1461 `-' |_...---'' | | 1462 ,-.c+-..__ +---+ 1463 ( )| ``--..__ | | 1464 `-' | ``+-..|_-+Camera 1 (VC1) 1465 ,-. | __..--'|+-+ 1466 ( )| __..--' | | 1467 `-'b|..--' +---+ 1468 ,-. |``---..___ | | 1469 ( )\ ```--..._|_-+Camera 0 (VC0) 1470 `-' \ _..-''`-+ 1471 ,-. \ __.--'' | | 1472 ( ) |..-'' +---+ 1473 `-' a 1475 The two points labeled b and c are intended to be at the midpoint 1476 between the seating positions, and where the fields of view of the 1477 cameras intersect. 1479 The plane of interest for VC0 is a vertical plane that intersects 1480 points 'a' and 'b'. 1482 The plane of interest for VC1 intersects points 'b' and 'c'. The 1483 plane of interest for VC2 intersects points 'c' and 'd'. 1485 This example uses an area scale of millimeters. 1487 Areas of capture: 1489 bottom left bottom right top left top right 1490 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1491 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1492 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1493 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1494 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1495 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1496 VC6 none 1498 Points of capture: 1499 VC0 (-1678,0,800) 1500 VC1 (0,0,800) 1501 VC2 (1678,0,800) 1502 VC3 none 1503 VC4 none 1504 VC5 (0,0,800) 1505 VC6 none 1507 In this example, the right edge of the VC0 area lines up with the 1508 left edge of the VC1 area. It doesn't have to be this way. There 1509 could be a gap or an overlap. One additional thing to note for 1510 this example is the distance from a to b is equal to the distance 1511 from b to c and the distance from c to d. All these distances are 1512 1346 mm. This is the planar width of each area of capture for VC0, 1513 VC1, and VC2. 1515 Note the text in parentheses (e.g. "the camera-left camera 1516 stream") is not explicitly part of the model, it is just 1517 explanatory text for this example, and is not included in the 1518 model with the media captures and attributes. Also, the 1519 "composed" boolean attribute doesn't say anything about how a 1520 capture is composed, so the media consumer can't tell based on 1521 this attribute that VC4 is composed of a "loudest panel with 1522 PiPs". 1524 Audio Captures: 1526 o AC0 (camera-left), encoding group=EG3, content=main, channel 1527 format=mono 1529 o AC1 (camera-right), encoding group=EG3, content=main, channel 1530 format=mono 1532 o AC2 (center) encoding group=EG3, content=main, channel 1533 format=mono 1535 o AC3 being a simple pre-mixed audio stream from the room (mono), 1536 encoding group=EG3, content=main, channel format=mono 1538 o AC4 audio stream associated with the presentation video (mono) 1539 encoding group=EG3, content=slides, channel format=mono 1541 Areas of capture: 1543 bottom left bottom right top left top right 1545 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1546 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1547 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1548 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1549 AC4 none 1551 The physical simultaneity information is: 1553 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1555 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1557 This constraint indicates it is not possible to use all the VCs at 1558 the same time. VC5 can not be used at the same time as VC1 or VC3 1559 or VC4. Also, using every member in the set simultaneously may 1560 not make sense - for example VC3(loudest) and VC4 (loudest with 1561 PIP). (In addition, there are encoding constraints that make 1562 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1563 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1564 in the encoding groups, not in the simultaneous transmission 1565 sets.) 1567 In this example there are no restrictions on which audio captures 1568 can be sent simultaneously. 1570 Encoding Groups: 1572 This example has three encoding groups associated with the video 1573 captures. Each group can have 3 encodings, but with each 1574 potential encoding having a progressively lower specification. In 1575 this example, 1080p60 transmission is possible (as ENC0 has a 1576 maxMbps value compatible with that) as long as it is the only 1577 active encoding in the group(as maxMbps for the entire encoding 1578 group is also 489600). Significantly, as up to 3 encodings are 1579 available per group, it is possible to transmit some video 1580 captures simultaneously that are not in the same entry in the 1581 capture scene. For example VC1 and VC3 at the same time. 1583 It is also possible to transmit multiple capture encodings of a 1584 single video capture. For example VC0 can be encoded using ENC0 1585 and ENC1 at the same time, as long as the encoding parameters 1586 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1587 1080p30 and one at 720p30. 1589 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1590 maxGroupBandwidth=6000000 1591 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1592 maxH264Mbps=489600, maxBandwidth=4000000 1593 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1594 maxH264Mbps=108000, maxBandwidth=4000000 1595 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1596 maxH264Mbps=61200, maxBandwidth=4000000 1597 encodeGroupID=EG1 maxGroupH264Mbps=489600 1598 maxGroupBandwidth=6000000 1599 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1600 maxH264Mbps=489600, maxBandwidth=4000000 1601 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1602 maxH264Mbps=108000, maxBandwidth=4000000 1603 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1604 maxH264Mbps=61200, maxBandwidth=4000000 1605 encodeGroupID=EG2 maxGroupH264Mbps=489600 1606 maxGroupBandwidth=6000000 1607 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1608 maxH264Mbps=489600, maxBandwidth=4000000 1609 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1610 maxH264Mbps=108000, maxBandwidth=4000000 1611 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1612 maxH264Mbps=61200, maxBandwidth=4000000 1614 Figure 2: Example Encoding Groups for Video 1616 For audio, there are five potential encodings available, so all 1617 five audio captures can be encoded at the same time. 1619 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1620 encodeID=ENC9, maxBandwidth=64000 1621 encodeID=ENC10, maxBandwidth=64000 1622 encodeID=ENC11, maxBandwidth=64000 1623 encodeID=ENC12, maxBandwidth=64000 1624 encodeID=ENC13, maxBandwidth=64000 1626 Figure 3: Example Encoding Group for Audio 1628 Capture Scenes: 1630 The following table represents the capture scenes for this 1631 provider. Recall that a capture scene is composed of alternative 1632 capture scene entries covering the same spatial region. Capture 1633 Scene #1 is for the main people captures, and Capture Scene #2 is 1634 for presentation. 1636 Each row in the table is a separate entry in the capture scene 1638 +------------------+ 1639 | Capture Scene #1 | 1640 +------------------+ 1641 | VC0, VC1, VC2 | 1642 | VC3 | 1643 | VC4 | 1644 | VC5 | 1645 | AC0, AC1, AC2 | 1646 | AC3 | 1647 +------------------+ 1649 +------------------+ 1650 | Capture Scene #2 | 1651 +------------------+ 1652 | VC6 | 1653 | AC4 | 1654 +------------------+ 1656 Different capture scenes are unique to each other, non- 1657 overlapping. A consumer can choose an entry from each capture 1658 scene. In this case the three captures VC0, VC1, and VC2 are one 1659 way of representing the video from the endpoint. These three 1660 captures should appear adjacent next to each other. 1661 Alternatively, another way of representing the Capture Scene is 1662 with the capture VC3, which automatically shows the person who is 1663 talking. Similarly for the VC4 and VC5 alternatives. 1665 As in the video case, the different entries of audio in Capture 1666 Scene #1 represent the "same thing", in that one way to receive 1667 the audio is with the 3 audio captures (AC0, AC1, AC2), and 1668 another way is with the mixed AC3. The Media Consumer can choose 1669 an audio capture entry it is capable of receiving. 1671 The spatial ordering is understood by the media capture attributes 1672 area and point of capture. 1674 A Media Consumer would likely want to choose a capture scene entry 1675 to receive based in part on how many streams it can simultaneously 1676 receive. A consumer that can receive three people streams would 1677 probably prefer to receive the first entry of Capture Scene #1 1678 (VC0, VC1, VC2) and not receive the other entries. A consumer 1679 that can receive only one people stream would probably choose one 1680 of the other entries. 1682 If the consumer can receive a presentation stream too, it would 1683 also choose to receive the only entry from Capture Scene #2 (VC6). 1685 11.1.2. Encoding Group Example 1687 This is an example of an encoding group to illustrate how it can 1688 express dependencies between encodings. 1690 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1691 maxGroupBandwidth=6000000 1692 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1693 maxFrameRate=60, 1694 maxH264Mbps=244800, maxBandwidth=4000000 1695 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1696 maxFrameRate=60, 1697 maxH264Mbps=244800, maxBandwidth=4000000 1698 encodeID=AUDENC0, maxBandwidth=96000 1699 encodeID=AUDENC1, maxBandwidth=96000 1700 encodeID=AUDENC2, maxBandwidth=96000 1702 Here, the encoding group is EG0. It can transmit up to two 1703 1080p30 capture encodings (Mbps for 1080p = 244800), but it is 1704 capable of transmitting a maxFrameRate of 60 frames per second 1705 (fps). To achieve the maximum resolution (1920 x 1088) the frame 1706 rate is limited to 30 fps. However 60 fps can be achieved at a 1707 lower resolution if required by the consumer. Although the 1708 encoding group is capable of transmitting up to 6Mbit/s, no 1709 individual video encoding can exceed 4Mbit/s. 1711 This encoding group also allows up to 3 audio encodings, AUDENC<0- 1712 2>. It is not required that audio and video encodings reside 1713 within the same encoding group, but if so then the group's overall 1714 maxBandwidth value is a limit on the sum of all audio and video 1715 encodings configured by the consumer. A system that does not wish 1716 or need to combine bandwidth limitations in this way should 1717 instead use separate encoding groups for audio and video in order 1718 for the bandwidth limitations on audio and video to not interact. 1720 Audio and video can be expressed in separate encoding groups, as 1721 in this illustration. 1723 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1724 maxGroupBandwidth=6000000 1725 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1726 maxFrameRate=60, 1727 maxH264Mbps=244800, maxBandwidth=4000000 1728 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1729 maxFrameRate=60, 1730 maxH264Mbps=244800, maxBandwidth=4000000 1731 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1732 encodeID=AUDENC0, maxBandwidth=96000 1733 encodeID=AUDENC1, maxBandwidth=96000 1734 encodeID=AUDENC2, maxBandwidth=96000 1736 11.1.3. The MCU Case 1738 This section shows how an MCU might express its Capture Scenes, 1739 intending to offer different choices for consumers that can handle 1740 different numbers of streams. A single audio capture stream is 1741 provided for all single and multi-screen configurations that can 1742 be associated (e.g. lip-synced) with any combination of video 1743 captures at the consumer. 1745 +--------------------+-------------------------------------------- 1746 -+ 1747 | Capture Scene #1 | note 1748 | 1749 +--------------------+-------------------------------------------- 1750 -+ 1751 | VC0 | video capture for single screen consumer 1752 | 1753 | VC1, VC2 | video capture for 2 screen consumer 1754 | 1755 | VC3, VC4, VC5 | video capture for 3 screen consumer 1756 | 1757 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer 1758 | 1759 | AC0 | audio capture representing all participants 1760 | 1761 +--------------------+-------------------------------------------- 1762 -+ 1764 If / when a presentation stream becomes active within the 1765 conference the MCU might re-advertise the available media as: 1767 +------------------+--------------------------------------+ 1768 | Capture Scene #2 | note | 1769 +------------------+--------------------------------------+ 1770 | VC10 | video capture for presentation | 1771 | AC1 | presentation audio to accompany VC10 | 1772 +------------------+--------------------------------------+ 1774 11.2. Media Consumer Behavior 1776 This section gives an example of how a Media Consumer might behave 1777 when deciding how to request streams from the three screen 1778 endpoint described in the previous section. 1780 The receive side of a call needs to balance its requirements, 1781 based on number of screens and speakers, its decoding capabilities 1782 and available bandwidth, and the provider's capabilities in order 1783 to optimally configure the provider's streams. Typically it would 1784 want to receive and decode media from each Capture Scene 1785 advertised by the Provider. 1787 A sane, basic, algorithm might be for the consumer to go through 1788 each Capture Scene in turn and find the collection of Video 1789 Captures that best matches the number of screens it has (this 1790 might include consideration of screens dedicated to presentation 1791 video display rather than "people" video) and then decide between 1792 alternative entries in the video Capture Scenes based either on 1793 hard-coded preferences or user choice. Once this choice has been 1794 made, the consumer would then decide how to configure the 1795 provider's encoding groups in order to make best use of the 1796 available network bandwidth and its own decoding capabilities. 1798 11.2.1. One screen Media Consumer 1800 VC3, VC4 and VC5 are all different entries by themselves, not 1801 grouped together in a single entry, so the receiving device should 1802 choose between one of those. The choice would come down to 1803 whether to see the greatest number of participants simultaneously 1804 at roughly equal precedence (VC5), a switched view of just the 1805 loudest region (VC3) or a switched view with PiPs (VC4). An 1806 endpoint device with a small amount of knowledge of these 1807 differences could offer a dynamic choice of these options, in- 1808 call, to the user. 1810 11.2.2. Two screen Media Consumer configuring the example 1812 Mixing systems with an even number of screens, "2n", and those 1813 with "2n+1" cameras (and vice versa) is always likely to be the 1814 problematic case. In this instance, the behavior is likely to be 1815 determined by whether a "2 screen" system is really a "2 decoder" 1816 system, i.e., whether only one received stream can be displayed 1817 per screen or whether more than 2 streams can be received and 1818 spread across the available screen area. To enumerate 3 possible 1819 behaviors here for the 2 screen system when it learns that the far 1820 end is "ideally" expressed via 3 capture streams: 1822 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1823 per the 1 screen consumer case above) and either leave one 1824 screen blank or use it for presentation if / when a 1825 presentation becomes active. 1827 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 1828 screens (either with each capture being scaled to 2/3 of a 1829 screen and the center capture being split across 2 screens) or, 1830 as would be necessary if there were large bezels on the 1831 screens, with each stream being scaled to 1/2 the screen width 1832 and height and there being a 4th "blank" panel. This 4th panel 1833 could potentially be used for any presentation that became 1834 active during the call. 1836 3. Receive 3 streams, decode all 3, and use control information 1837 indicating which was the most active to switch between showing 1838 the left and center streams (one per screen) and the center and 1839 right streams. 1841 For an endpoint capable of all 3 methods of working described 1842 above, again it might be appropriate to offer the user the choice 1843 of display mode. 1845 11.2.3. Three screen Media Consumer configuring the example 1847 This is the most straightforward case - the Media Consumer would 1848 look to identify a set of streams to receive that best matched its 1849 available screens and so the VC0 plus VC1 plus VC2 should match 1850 optimally. The spatial ordering would give sufficient information 1851 for the correct video capture to be shown on the correct screen, 1852 and the consumer would either need to divide a single encoding 1853 group's capability by 3 to determine what resolution and frame 1854 rate to configure the provider with or to configure the individual 1855 video captures' encoding groups with what makes most sense (taking 1856 into account the receive side decode capabilities, overall call 1857 bandwidth, the resolution of the screens plus any user preferences 1858 such as motion vs sharpness). 1860 12. Acknowledgements 1862 Allyn Romanow and Brian Baldino were authors of early versions. 1863 Mark Gorzyinski contributed much to the approach. We want to 1864 thank Stephen Botzko for helpful discussions on audio. 1866 13. IANA Considerations 1868 TBD 1870 14. Security Considerations 1872 TBD 1874 15. Changes Since Last Version 1876 NOTE TO THE RFC-Editor: Please remove this section prior to 1877 publication as an RFC. 1879 Changes from 09 to 10: 1881 1. Several minor clarifications such as about SDP usage, Media 1882 Captures, Configure message. 1884 2. Simultaneous Set can be expressed in terms of Capture Scene 1885 and Capture Scene Entry. 1887 3. Removed Area of Scene attribute. 1889 4. Add attributes from draft-groves-clue-capture-attr-01. 1891 5. Move some of the Media Capture attribute descriptions back 1892 into this document, but try to leave detailed syntax to the 1893 data model. Remove the OUTSOURCE sections, which are already 1894 incorporated into the data model document. 1896 Changes from 08 to 09: 1898 6. Use "document" instead of "memo". 1900 7. Add basic call flow sequence diagram to introduction. 1902 8. Add definitions for Advertisement and Configure messages. 1904 9. Add definitions for Capture and Provider. 1906 10. Update definition of Capture Scene. 1908 11. Update definition of Individual Encoding. 1910 12. Shorten definition of Media Capture and add key points in 1911 the Media Captures section. 1913 13. Reword a bit about capture scenes in overview. 1915 14. Reword about labeling Media Captures. 1917 15. Remove the Consumer Capability message. 1919 16. New example section heading for media provider behavior 1921 17. Clarifications in the Capture Scene section. 1923 18. Clarifications in the Simultaneous Transmission Set section. 1925 19. Capitalize defined terms. 1927 20. Move call flow example from introduction to overview section 1929 21. General editorial cleanup 1931 22. Add some editors' notes requesting input on issues 1933 23. Summarize some sections, and propose details be outsourced 1934 to other documents. 1936 Changes from 06 to 07: 1938 1. Ticket #9. Rename Axis of Capture Point attribute to Point 1939 on Line of Capture. Clarify the description of this 1940 attribute. 1942 2. Ticket #17. Add "capture encoding" definition. Use this new 1943 term throughout document as appropriate, replacing some usage 1944 of the terms "stream" and "encoding". 1946 3. Ticket #18. Add Max Capture Encodings media capture 1947 attribute. 1949 4. Add clarification that different capture scene entries are 1950 not necessarily mutually exclusive. 1952 Changes from 05 to 06: 1954 1. Capture scene description attribute is a list of text strings, 1955 each in a different language, rather than just a single string. 1957 2. Add new Axis of Capture Point attribute. 1959 3. Remove appendices A.1 through A.6. 1961 4. Clarify that the provider must use the same coordinate system 1962 with same scale and origin for all coordinates within the same 1963 capture scene. 1965 Changes from 04 to 05: 1967 1. Clarify limitations of "composed" attribute. 1969 2. Add new section "capture scene entry attributes" and add the 1970 attribute "scene-switch-policy". 1972 3. Add capture scene description attribute and description 1973 language attribute. 1975 4. Editorial changes to examples section for consistency with the 1976 rest of the document. 1978 Changes from 03 to 04: 1980 1. Remove sentence from overview - "This constitutes a significant 1981 change ..." 1983 2. Clarify a consumer can choose a subset of captures from a 1984 capture scene entry or a simultaneous set (in section "capture 1985 scene" and "consumer's choice..."). 1987 3. Reword first paragraph of Media Capture Attributes section. 1989 4. Clarify a stereo audio capture is different from two mono audio 1990 captures (description of audio channel format attribute). 1992 5. Clarify what it means when coordinate information is not 1993 specified for area of capture, point of capture, area of scene. 1995 6. Change the term "producer" to "provider" to be consistent (it 1996 was just in two places). 1998 7. Change name of "purpose" attribute to "content" and refer to 1999 RFC4796 for values. 2001 8. Clarify simultaneous sets are part of a provider advertisement, 2002 and apply across all capture scenes in the advertisement. 2004 9. Remove sentence about lip-sync between all media captures in a 2005 capture scene. 2007 10. Combine the concepts of "capture scene" and "capture set" 2008 into a single concept, using the term "capture scene" to 2009 replace the previous term "capture set", and eliminating the 2010 original separate capture scene concept. 2012 Informative References 2014 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2015 Requirement Levels", BCP 14, RFC 2119, March 1997. 2017 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 2018 Johnston, 2019 A., Peterson, J., Sparks, R., Handley, M., and E. 2020 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 2021 June 2002. 2023 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 2024 Jacobson, "RTP: A Transport Protocol for Real-Time 2025 Applications", STD 64, RFC 3550, July 2003. 2027 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 2028 Session Initiation Protocol (SIP)", RFC 4353, 2029 February 2006. 2031 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 2032 5117, 2033 January 2008. 2035 16. Authors' Addresses 2037 Mark Duckworth (editor) 2038 Polycom 2039 Andover, MA 01810 2040 USA 2042 Email: mark.duckworth@polycom.com 2044 Andrew Pepperell 2045 Acano 2046 Uxbridge, England 2047 UK 2049 Email: apeppere@gmail.com 2051 Stephan Wenger 2052 Vidyo, Inc. 2053 433 Hackensack Ave. 2054 Hackensack, N.J. 07601 2055 USA 2057 Email: stewe@stewe.org