idnits 2.17.1 draft-ietf-clue-framework-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1490 has weird spacing: '...om left bot...' == Line 1544 has weird spacing: '...om left bot...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 15, 2013) is 3938 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Informational A. Pepperell 4 Expires: November 16, 2013 Acano 5 S. Wenger 6 Vidyo 7 July 15, 2013 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-11.txt 12 Abstract 14 This document offers a framework for a protocol that enables 15 devices in a telepresence conference to interoperate by specifying 16 the relationships between multiple media streams. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current 26 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six 29 months and may be updated, replaced, or obsoleted by other 30 documents at any time. It is inappropriate to use Internet-Drafts 31 as reference material or to cite them other than as "work in 32 progress." 34 This Internet-Draft will expire on November 16, 2013. 36 Copyright Notice 38 Copyright (c) 2013 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with 46 respect to this document. Code Components extracted from this 47 document must include Simplified BSD License text as described in 48 Section 4.e of the Trust Legal Provisions and are provided without 49 warranty as described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction...................................................3 54 2. Terminology....................................................5 55 3. Definitions....................................................5 56 4. Overview of the Framework/Model................................8 57 5. Spatial Relationships.........................................13 58 6. Media Captures and Capture Scenes.............................14 59 6.1. Media Captures...........................................14 60 6.1.1. Media Capture Attributes............................15 61 6.2. Capture Scene............................................19 62 6.2.1. Capture Scene attributes............................22 63 6.2.2. Capture Scene Entry attributes......................22 64 6.3. Simultaneous Transmission Set Constraints................24 65 7. Encodings.....................................................25 66 7.1. Individual Encodings.....................................25 67 7.2. Encoding Group...........................................27 68 8. Associating Captures with Encoding Groups.....................28 69 9. Consumer's Choice of Streams to Receive from the Provider.....29 70 9.1. Local preference.........................................31 71 9.2. Physical simultaneity restrictions.......................31 72 9.3. Encoding and encoding group limits.......................31 73 10. Extensibility................................................32 74 11. Examples - Using the Framework...............................32 75 11.1. Provider Behavior.......................................33 76 11.1.1. Three screen Endpoint Provider.....................33 77 11.1.2. Encoding Group Example.............................40 78 11.1.3. The MCU Case.......................................41 79 11.2. Media Consumer Behavior.................................41 80 11.2.1. One screen Media Consumer..........................42 81 11.2.2. Two screen Media Consumer configuring the example..42 82 11.2.3. Three screen Media Consumer configuring the example43 83 12. Acknowledgements.............................................43 84 13. IANA Considerations..........................................44 85 14. Security Considerations......................................44 86 15. Changes Since Last Version...................................44 87 16. Authors' Addresses...........................................48 89 1. Introduction 91 Current telepresence systems, though based on open standards such 92 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 93 each other. A major factor limiting the interoperability of 94 telepresence systems is the lack of a standardized way to describe 95 and negotiate the use of the multiple streams of audio and video 96 comprising the media flows. This draft provides a framework for a 97 protocol to enable interoperability by handling multiple streams in 98 a standardized way. It is intended to support the use cases 99 described in draft-ietf-clue-telepresence-use-cases and to meet the 100 requirements in draft-ietf-clue-telepresence-requirements. 102 Conceptually distinguished are Media Providers and Media Consumers. 103 A Media Provider provides Media in the form of RTP packets, a Media 104 Consumer consumes those RTP packets. Media Providers and Media 105 Consumers can reside in Endpoints or in middleboxes such as 106 Multipoint Control Units (MCUs). A Media Provider in an Endpoint 107 is usually associated with the generation of media for Media 108 Captures; these Media Captures are typically sourced from cameras, 109 microphones, and the like. Similarly, the Media Consumer in an 110 Endpoint is usually associated with Renderers, such as screens and 111 loudspeakers. In middleboxes, Media Providers and Consumers can 112 have the form of outputs and inputs, respectively, of RTP mixers, 113 RTP translators, and similar devices. Typically, telepresence 114 devices such as Endpoints and middleboxes would perform as both 115 Media Providers and Media Consumers, the former being concerned 116 with those devices' transmitted media and the latter with those 117 devices' received media. In a few circumstances, a CLUE Endpoint 118 middlebox may include only Consumer or Provider functionality, such 119 as recorder-type Consumers or webcam-type Providers. 121 Motivations for this document (and, in fact, for the existence of 122 the CLUE protocol) include: 124 (1) Endpoints according to this document can, and usually do, have 125 multiple Media Captures and Media Renderers, that is, for example, 126 multiple cameras and screens. While previous system designs were 127 able to set up calls that would light up all screens and cameras 128 (or equivalent), what was missing was a mechanism that can 129 associate the Media Captures with each other in space and time. 131 (2) The mere fact that there are multiple capture and rendering 132 devices, each of which may be configurable in aspects such as zoom, 133 leads to the difficulty that a variable number of such devices can 134 be used to capture different aspects of a region. The Capture 135 Scene concept allows for the description of multiple setups for 136 those multiple capture devices that could represent sensible 137 operation points of the physical capture devices in a room, chosen 138 by the operator. A Consumer can pick and choose from those 139 configurations based on its rendering abilities and inform the 140 Provider about its choices. Details are provided in section 6. 142 (3) In some cases, physical limitations or other reasons disallow 143 the concurrent use of a device in more than one setup. For 144 example, the center camera in a typical three-camera conference 145 room can set its zoom objective either to capture only the middle 146 few seats, or all seats of a room, but not both concurrently. The 147 Simultaneous Transmission Set concept allows a Provider to signal 148 such limitations. Simultaneous Transmission Sets are part of the 149 Capture Scene description, and discussed in section 6.3. 151 (4) Often, the devices in a room do not have the computational 152 complexity or connectivity to deal with multiple encoding options 153 simultaneously, even if each of these options may be sensible in 154 certain environments, and even if the simultaneous transmission may 155 also be sensible (i.e. in case of multicast media distribution to 156 multiple endpoints). Such constraints can be expressed by the 157 Provider using the Encoding Group concept, described in section 7. 159 (5) Due to the potentially large number of RTP flows required for a 160 Multimedia Conference involving potentially many Endpoints, each of 161 which can have many Media Captures and Media Renderers, a sensible 162 system design is to multiplex multiple RTP media flows onto the 163 same transport address, so to avoid using the port number as a 164 multiplexing point and the associated shortcomings such as 165 NAT/firewall traversal. While the actual mapping of those RTP 166 flows to the header fields of the RTP packets is not subject of 167 this specification, the large number of possible permutations of 168 sensible options a Media Provider may make available to a Media 169 Consumer makes a mechanism desirable that allows to narrow down the 170 number of possible options that a SIP offer-answer exchange has to 171 consider. Such information is made available using protocol 172 mechanisms specified in this document and companion documents, 173 although it should be stressed that its use in an implementation is 174 optional. Also, there are aspects of the control of both Endpoints 175 and middleboxes/MCUs that dynamically change during the progress of 176 a call, such as audio-level based screen switching, layout changes, 177 and so on, which need to be conveyed. Note that these control 178 aspects are complementary to those specified in traditional SIP 179 based conference management such as BFCP. An exemplary call flow 180 can be found in section 4. 182 Finally, all this information needs to be conveyed, and the notion 183 of support for it needs to be established. This is done by the 184 negotiation of a "CLUE channel", a data channel negotiated early 185 during the initiation of a call. An Endpoint or MCU that rejects 186 the establishment of this data channel, by definition, is not 187 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 188 accepts it is required to use it to the extent specified in this 189 document and its companion documents. 191 2. Terminology 193 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 194 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 195 this document are to be interpreted as described in RFC 2119 196 [RFC2119]. 198 3. Definitions 200 The terms defined below are used throughout this document and 201 companion documents and they are normative. In order to easily 202 identify the use of a defined term, those terms are capitalized. 204 Advertisement: a CLUE message a Media Provider sends to a Media 205 Consumer describing specific aspects of the content of the media, 206 the formatting of the media streams it can send, and any 207 restrictions it has in terms of being able to provide certain 208 Streams simultaneously. 210 Audio Capture: Media Capture for audio. Denoted as ACn in the 211 example cases in this document. 213 Camera-Left and Right: For Media Captures, camera-left and camera- 214 right are from the point of view of a person observing the rendered 215 media. They are the opposite of Stage-Left and Stage-Right. 217 Capture: Same as Media Capture. 219 Capture Device: A device that converts audio and video input into 220 an electrical signal, in most cases to be fed into a media encoder. 222 Capture Encoding: A specific encoding of a Media Capture, to be 223 sent by a Media Provider to a Media Consumer via RTP. 225 Capture Scene: a structure representing a spatial region containing 226 one or more Capture Devices, each capturing media representing a 227 portion of the region. The spatial region represented by a Capture 228 Scene may or may not correspond to a real region in physical space, 229 such as a room. A Capture Scene includes attributes and one or 230 more Capture Scene Entries, with each entry including one or more 231 Media Captures. 233 Capture Scene Entry: a list of Media Captures of the same media 234 type that together form one way to represent the entire Capture 235 Scene. 237 Conference: used as defined in [RFC4353], A Framework for 238 Conferencing within the Session Initiation Protocol (SIP). 240 Configure Message: A CLUE message a Media Consumer sends to a Media 241 Provider specifying which content and media streams it wants to 242 receive, based on the information in a corresponding Advertisement 243 message. 245 Consumer: short for Media Consumer. 247 Encoding or Individual Encoding: a set of parameters representing a 248 way to encode a Media Capture to become a Capture Encoding. 250 Encoding Group: A set of encoding parameters representing a total 251 media encoding capability to be sub-divided across potentially 252 multiple Individual Encodings. 254 Endpoint: The logical point of final termination through receiving, 255 decoding and rendering, and/or initiation through capturing, 256 encoding, and sending of media streams. An endpoint consists of 257 one or more physical devices which source and sink media streams, 258 and exactly one [RFC4353] Participant (which, in turn, includes 259 exactly one SIP User Agent). Endpoints can be anything from 260 multiscreen/multicamera rooms to handheld devices. 262 Front: the portion of the room closest to the cameras. In going 263 towards back you move away from the cameras. 265 MCU: Multipoint Control Unit (MCU) - a device that connects two or 266 more endpoints together into one single multimedia conference 267 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 268 [RFC4353] requirement to send media to each participant. 270 Media: Any data that, after suitable encoding, can be conveyed over 271 RTP, including audio, video or timed text. 273 Media Capture: a source of Media, such as from one or more Capture 274 Devices or constructed from other Media streams. 276 Media Consumer: an Endpoint or middle box that receives Media 277 streams 279 Media Provider: an Endpoint or middle box that sends Media streams 281 Model: a set of assumptions a telepresence system of a given vendor 282 adheres to and expects the remote telepresence system(s) also to 283 adhere to. 285 Plane of Interest: The spatial plane containing the most relevant 286 subject matter. 288 Provider: Same as Media Provider. 290 Render: the process of generating a representation from a media, 291 such as displayed motion video or sound emitted from loudspeakers. 293 Simultaneous Transmission Set: a set of Media Captures that can be 294 transmitted simultaneously from a Media Provider. 296 Spatial Relation: The arrangement in space of two objects, in 297 contrast to relation in time or other relationships. See also 298 Camera-Left and Right. 300 Stage-Left and Right: For Media Captures, Stage-left and Stage- 301 right are the opposite of Camera-left and Camera-right. For the 302 case of a person facing (and captured by) a camera, Stage-left and 303 Stage-right are from the point of view of that person. 305 Stream: a Capture Encoding sent from a Media Provider to a Media 306 Consumer via RTP [RFC3550]. 308 Stream Characteristics: the media stream attributes commonly used 309 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 310 resolution, profile/level etc.) as well as CLUE specific 311 attributes, such as the Capture ID or a spatial location. 313 Video Capture: Media Capture for video. Denoted as VCn in the 314 example cases in this document. 316 Video Composite: A single image that is formed, normally by an RTP 317 mixer inside an MCU, by combining visual elements from separate 318 sources. 320 4. Overview of the Framework/Model 322 The CLUE framework specifies how multiple media streams are to be 323 handled in a telepresence conference. 325 A Media Provider (transmitting Endpoint or MCU) describes specific 326 aspects of the content of the media and the formatting of the media 327 streams it can send in an Advertisement; and the Media Consumer 328 responds to the Media Provider by specifying which content and 329 media streams it wants to receive in a Configure message. The 330 Provider then transmits the asked-for content in the specified 331 streams. 333 This Advertisement and Configure occurs as a minimum during call 334 initiation but may also happen at any time throughout the call, 335 whenever there is a change in what the Consumer wants to receive or 336 (perhaps less common) the Provider can send. 338 An Endpoint or MCU typically act as both Provider and Consumer at 339 the same time, sending Advertisements and sending Configurations in 340 response to receiving Advertisements. (It is possible to be just 341 one or the other.) 343 The data model is based around two main concepts: a Capture and an 344 Encoding. A Media Capture (MC), such as audio or video, describes 345 the content a Provider can send. Media Captures are described in 346 terms of CLUE-defined attributes, such as spatial relationships and 347 purpose of the capture. Providers tell Consumers which Media 348 Captures they can provide, described in terms of the Media Capture 349 attributes. 351 A Provider organizes its Media Captures into one or more Capture 352 Scenes, each representing a spatial region, such as a room. A 353 Consumer chooses which Media Captures it wants to receive from each 354 Capture Scene. 356 In addition, the Provider can send the Consumer a description of 357 the Individual Encodings it can send in terms of the media 358 attributes of the Encodings, in particular, audio and video 359 parameters such as bandwidth, frame rate, macroblocks per second. 360 Note that this is optional, and intended to minimize the number of 361 options a later SDP offer-answer would require to include in the 362 SDP in case of complex setups, as should become clearer shortly 363 when discussing an outline of the call flow. 365 The Provider can also specify constraints on its ability to provide 366 Media, and a sensible design choice for a Consumer is to take these 367 into account when choosing the content and Capture Encodings it 368 requests in the later offer-answer exchange. Some constraints are 369 due to the physical limitations of devices - for example, a camera 370 may not be able to provide zoom and non-zoom views simultaneously. 371 Other constraints are system based constraints, such as maximum 372 bandwidth and maximum macroblocks/second. 374 A very brief outline of the call flow used by a simple system (two 375 Endpoints) in compliance with this document can be described as 376 follows, and as shown in the following figure. 378 +-----------+ +-----------+ 379 | Endpoint1 | | Endpoint2 | 380 +----+------+ +-----+-----+ 381 | INVITE (BASIC SDP+CLUECHANNEL) | 382 |--------------------------------->| 383 | 200 0K (BASIC SDP+CLUECHANNEL)| 384 |<---------------------------------| 385 | ACK | 386 |--------------------------------->| 387 | | 388 |<################################>| 389 | BASIC SDP MEDIA SESSION | 390 |<################################>| 391 | | 392 | CONNECT (CLUE CTRL CHANNEL) | 393 |=================================>| 394 | ... | 395 |<================================>| 396 | CLUE CTRL CHANNEL ESTABLISHED | 397 |<================================>| 398 | | 399 | ADVERTISEMENT 1 | 400 |*********************************>| 401 | ADVERTISEMENT 2 | 402 |<*********************************| 403 | | 404 | CONFIGURE 1 | 405 |<*********************************| 406 | CONFIGURE 2 | 407 |*********************************>| 408 | | 409 | REINVITE (UPDATED SDP) | 410 |--------------------------------->| 411 | 200 0K (UPDATED SDP)| 412 |<---------------------------------| 413 | ACK | 414 |--------------------------------->| 415 | | 416 |<################################>| 417 | UPDATED SDP MEDIA SESSION | 418 |<################################>| 419 | | 420 v v 422 An initial offer/answer exchange establishes a basic media session, 423 for example audio-only, and a CLUE channel between two Endpoints. 424 With the establishment of that channel, the endpoints have 425 consented to use the CLUE protocol mechanisms and have to adhere to 426 them. 428 Over this CLUE channel, the Provider in each Endpoint conveys its 429 characteristics and capabilities by sending an Advertisement as 430 specified herein (which will typically not be sufficient to set up 431 all media). The Consumer in the Endpoint receives the information 432 provided by the Provider, and can use it for two purposes. First, 433 it constructs and sends a CLUE Configure message to tell the 434 Provider what the Consumer wishes to receive. Second, it can, but 435 is not necessarily required to, use the information provided to 436 tailor the SDP it is going to send during the following SIP 437 offer/answer exchange, and its reaction to SDP it receives in that 438 step. It is often a sensible implementation choice to do so, as 439 the representation of the media information conveyed over the CLUE 440 channel can dramatically cut down on the size of SDP messages used 441 in the O/A exchange that follows. Spatial relationships associated 442 with the Media can be included in the Advertisement, and it is 443 often sensible for the Media Consumer to take those spatial 444 relationships into account when tailoring the SDP. 446 This CLUE exchange is followed by an SDP offer answer exchange that 447 not only establishes those aspects of the media that have not been 448 "negotiated" over CLUE, but has also the side effect of setting up 449 the media transmission itself, involving potentially security 450 exchanges, ICE, and whatnot. This step is plain vanilla SIP, with 451 the exception that the SDP used herein, in most cases can (but not 452 necessarily must) be considerably smaller than the SDP a system 453 would typically need to exchange if there were no pre-established 454 knowledge about the Provider and Consumer characteristics. (The 455 need for cutting down SDP size may not be obvious for a point-to- 456 point call involving simple endpoints; however, when considering a 457 large multipoint conference involving many multi-screen/multi- 458 camera endpoints, each of which can operate using multiple codecs 459 for each camera and microphone, it becomes perhaps somewhat more 460 intuitive.) 462 During the lifetime of a call, further exchanges can occur over the 463 CLUE channel. In some cases, those further exchanges can lead to a 464 modified system behavior of Provider or Consumer (or both) without 465 any other protocol activity such as further offer/answer exchanges. 466 For example, voice-activated screen switching, signaled over the 467 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 468 re-invites. However, in other cases, after the CLUE negotiation an 469 additional offer/answer exchange may become necessary. For 470 example, if both sides decide to upgrade the call from a single 471 screen to a multi-screen call and more bandwidth is required for 472 the additional video channels, that could require a new O/A 473 exchange. 475 Numerous optimizations may be possible, and are the implementer's 476 choice. For example, it may be sensible to establish one or more 477 initial media channels during the initial offer/answer exchange, 478 which would allow, for example, for a fast startup of audio. 479 Depending on the system design, it may be possible to re-use this 480 established channel for more advanced media negotiated only by CLUE 481 mechanisms, thereby avoiding further offer/answer exchanges. 483 Edt. note: The editors are not sure whether the mentioned 484 overloading of established RTP channels using only CLUE messages is 485 possible, or desired by the WG. If it were, certainly there is 486 need for specification work. One possible issue: a Provider which 487 thinks that it can switch, say, a audio codec algorithm by CLUE 488 only, talks to a Consumer which thinks that it has to faithfully 489 answer the Providers Advertisement through a Configure, but does 490 not dare setting up its internal resource until such time it has 491 received its authoritative O/A exchange. Working group input is 492 solicited. 494 One aspect of the protocol outlined herein and specified in 495 normative detail in companion documents is that it makes available 496 information regarding the Provider's capabilities to deliver Media, 497 and attributes related to that Media such as their spatial 498 relationship, to the Consumer. The operation of the Renderer 499 inside the Consumer is unspecified in that it can choose to ignore 500 some information provided by the Provider, and/or not render media 501 streams available from the Provider (although it has to follow the 502 CLUE protocol and, therefore, has to gracefully receive and respond 503 (through a Configure) to the Provider's information). All CLUE 504 protocol mechanisms are optional in the Consumer in the sense that, 505 while the Consumer must be able to receive (and, potentially, 506 gracefully acknowledge) CLUE messages, it is free to ignore the 507 information provided therein. Obviously, this is not a 508 particularly sensible design choice. 510 Legacy devices are defined here in as those Endpoints and MCUs that 511 do not support the setup and use of the CLUE channel. The notion 512 of a device being a legacy device is established during the initial 513 offer/answer exchange, in which the legacy device will not 514 understand the offer for the CLUE channel and, therefore, reject 515 it. This is the indication for the CLUE-implementing Endpoint or 516 MCU that the other side of the communication is not compliant with 517 CLUE, and to fall back to whatever mechanism was used before the 518 introduction of CLUE. 520 As for the media, Provider and Consumer have an end-to-end 521 communication relationship with respect to (RTP transported) media; 522 and the mechanisms described herein and in companion documents do 523 not change the aspects of setting up those RTP flows and sessions. 524 In other words, the RTP media sessions conform to the negotiated 525 SDP whether or not CLUE is used. However, it should be noted that 526 forms of RTP multiplexing of multiple RTP flows onto the same 527 transport address are developed concurrently with the CLUE suite of 528 specifications, and it is widely expected that most, if not all, 529 Endpoints or MCUs supporting CLUE will also support those 530 mechanisms. Some design choices made in this document reflect this 531 coincidence in spec development timing. 533 5. Spatial Relationships 535 In order for a Consumer to perform a proper rendering, it is often 536 necessary or at least helpful for the Consumer to have received 537 spatial information about the streams it is receiving. CLUE 538 defines a coordinate system that allows Media Providers to describe 539 the spatial relationships of their Media Captures to enable proper 540 scaling and spatially sensible rendering of their streams. The 541 coordinate system is based on a few principles: 543 o Simple systems which do not have multiple Media Captures to 544 associate spatially need not use the coordinate model. 546 o Coordinates can either be in real, physical units (millimeters), 547 have an unknown scale or have no physical scale. Systems which 548 know their physical dimensions (for example professionally 549 installed Telepresence room systems) should always provide those 550 real-world measurements. Systems which don't know specific 551 physical dimensions but still know relative distances should use 552 'unknown scale'. 'No scale' is intended to be used where Media 553 Captures from different devices (with potentially different 554 scales) will be forwarded alongside one another (e.g. in the 555 case of a middle box). 557 * "millimeters" means the scale is in millimeters 559 * "Unknown" means the scale is not necessarily millimeters, but 560 the scale is the same for every Capture in the Capture Scene. 562 * "No Scale" means the scale could be different for each 563 capture- an MCU provider that advertises two adjacent 564 captures and picks sources (which can change quickly) from 565 different endpoints might use this value; the scale could be 566 different and changing for each capture. But the areas of 567 capture still represent a spatial relation between captures. 569 o The coordinate system is Cartesian X, Y, Z with the origin at a 570 spatial location of the provider's choosing. The Provider must 571 use the same coordinate system with same scale and origin for 572 all coordinates within the same Capture Scene. 574 The direction of increasing coordinate values is: 575 X increases from Camera-Left to Camera-Right 576 Y increases from Front to back 577 Z increases from low to high 579 6. Media Captures and Capture Scenes 581 This section describes how Providers can describe the content of 582 media to Consumers. 584 6.1. Media Captures 586 Media Captures are the fundamental representations of streams that 587 a device can transmit. What a Media Capture actually represents is 588 flexible: 590 o It can represent the immediate output of a physical source (e.g. 591 camera, microphone) or 'synthetic' source (e.g. laptop computer, 592 DVD player). 594 o It can represent the output of an audio mixer or video composer 596 o It can represent a concept such as 'the loudest speaker' 598 o It can represent a conceptual position such as 'the leftmost 599 stream' 601 To identify and distinguish between multiple instances, video and 602 audio captures are labeled. For instance: VC1, VC2 and AC1, AC2, 603 where VC1 and VC2 refer to two different video captures and AC1 604 and AC2 refer to two different audio captures. 606 Some key points about Media Captures: 608 . A Media Capture is of a single media type (e.g. audio or 609 video) 610 . A Media Capture is associated with exactly one Capture Scene 611 . A Media Capture is associated with one or more Capture Scene 612 Entries 613 . A Media Capture has exactly one set of spatial information 614 . A Media Capture may be the source of one or more Capture 615 Encodings 617 Each Media Capture can be associated with attributes to describe 618 what it represents. 620 6.1.1. Media Capture Attributes 622 Media Capture Attributes describe information about the Captures. 623 A Provider can use the Media Capture Attributes to describe the 624 Captures for the benefit of the Consumer in the Advertisement 625 message. Media Capture Attributes include: 627 . spatial information, such as point of capture, point on line 628 of capture, and area of capture, all of which, in combination 629 define the capture field of, for example, a camera; 630 . Capture multiplexing information (composed/switched video, 631 mono/stereo audio, maximum number of simultaneous encodings 632 per Capture and so on); and 633 . Other descriptive information to help the Consumer choose 634 between captures (description, presentation, view, priority, 635 language, role). 636 . Control information for use inside the CLUE protocol suite. 638 Point of Capture: 640 A field with a single Cartesian (X, Y, Z) point value which 641 describes the spatial location of the capturing device (such as 642 camera). 644 Point on Line of Capture: 646 A field with a single Cartesian (X, Y, Z) point value which 647 describes a position in space of a second point on the axis of the 648 capturing device; the first point being the Point of Capture (see 649 above). 651 Together, the Point of Capture and Point on Line of Capture define 652 an axis of the capturing device, for example the optical axis of a 653 camera. The Media Consumer can use this information to adjust how 654 it renders the received media if it so chooses. 656 Area of Capture: 658 A field with a set of four (X, Y, Z) points as a value which 659 describe the spatial location of what is being "captured". By 660 comparing the Area of Capture for different Media Captures within 661 the same Capture Scene a consumer can determine the spatial 662 relationships between them and render them correctly. 664 The four points should be co-planar, forming a quadrilateral, which 665 defines the Plane of Interest for the particular media capture. 667 If the Area of Capture is not specified, it means the Media Capture 668 is not spatially related to any other Media Capture. 670 For a switched capture that switches between different sections 671 within a larger area, the area of capture should use coordinates 672 for the larger potential area. 674 Mobility of Capture: 676 This attribute indicates whether or not the point of capture, line 677 on point of capture, and area of capture values will stay the same, 678 or are expected to change frequently. Possible values are static, 679 dynamic, and highly dynamic. 681 For example, a camera may be placed at different positions in order 682 to provide the best angle to capture a work task, or may include a 683 camera worn by a participant. This would have an effect of changing 684 the capture point, capture axis and area of capture. In order that 685 the Consumer can choose to render the capture appropriately, the 686 Provider can include this attribute to indicate if the camera 687 location is dynamic or not. 689 The capture point of a static capture does not move for the life of 690 the conference. The capture point of dynamic captures is 691 categorised by a change in position followed by a reasonable period 692 of stability. High dynamic captures are categorised by a capture 693 point that is constantly moving. If the "area of capture", 694 "capture point" and "line of capture" attributes are included with 695 dynamic or highly dynamic captures they indicate spatial 696 information at the time of the Advertisement. No information 697 regarding future spatial information should be assumed. 699 Composed: 701 A boolean field which indicates whether or not the Media Capture is 702 a mix (audio) or composition (video) of streams. 704 This attribute is useful for a media consumer to avoid nesting a 705 composed video capture into another composed capture or rendering. 706 This attribute is not intended to describe the layout a media 707 provider uses when composing video streams. 709 Switched: 711 A boolean field which indicates whether or not the Media Capture 712 represents the (dynamic) most appropriate subset of a 'whole'. 713 What is 'most appropriate' is up to the provider and could be the 714 active speaker, a lecturer or a VIP. 716 Audio Channel Format: 718 A field with enumerated values which describes the method of 719 encoding used for audio. A value of 'mono' means the Audio Capture 720 has one channel. 'stereo' means the Audio Capture has two audio 721 channels, left and right. 723 This attribute applies only to Audio Captures. A single stereo 724 capture is different from two mono captures that have a left-right 725 spatial relationship. A stereo capture maps to a single Capture 726 Encoding, while each mono audio capture maps to a separate Capture 727 Encoding. 729 Max Capture Encodings: 731 An optional attribute indicating the maximum number of Capture 732 Encodings that can be simultaneously active for the Media Capture. 733 The number of simultaneous Capture Encodings is also limited by the 734 restrictions of the Encoding Group for the Media Capture. 736 Description: 738 Human-readable description of the Capture Scene, which could be in 739 multiple languages. 741 Presentation: 743 This attribute indicates that the capture originates from a 744 presentation device, that is one that provides supplementary 745 information to a conference through slides, video, still images, 746 data etc. Where more information is known about the capture it may 747 be expanded hierarchically to indicate the different types of 748 presentation media, e.g. presentation.slides, presentation.image 749 etc. 751 Note: It is expected that a number of keywords will be defined that 752 provide more detail on the type of presentation. 754 View: 756 A field with enumerated values, indicating what type of view the 757 capture relates to. The Consumer can use this information to help 758 choose which Media Captures it wishes to receive. The value can be 759 one of: 761 Room - Captures the entire scene 763 Table - Captures the conference table with seated participants 765 Individual - Captures an individual participant 767 Lectern - Captures the region of the lectern including the 768 presenter in a classroom style conference 770 Audience - Captures a region showing the audience in a classroom 771 style conference 773 Language: 775 This attribute indicates one or more languages used in the content 776 of the media capture. Captures may be offered in different 777 languages in case of multilingualand/or accessible conferences, so 778 a Consumer can use this attribute to differentiate between them. 780 This indicates which language is associated with the capture. For 781 example: it may provide a language associated with an audio capture 782 or a language associated with a video capture when sign 783 interpretation or text is used. 785 Role: 787 Edt. Note -- this is a placeholder for a role attribute, as 788 discussed in draft-groves-clue-capture-attr. We expect to continue 789 discussing the role attribute in the context of that draft, and 790 follow-on drafts, before adding it to this framework document. 792 Priority: 794 This attribute indicates a relative priority between different 795 Media Captures. The Provider sets this priority, and the Consumer 796 may use the priority to help decide which captures it wishes to 797 receive. 799 The "priority" attribute is an integer which indicates a relative 800 priority between captures. For example it is possible to assign a 801 priority between two presentation captures that would allow a 802 remote endpoint to determine which presentation is more important. 803 Priority is assigned at the individual capture level. It represents 804 the Provider's view of the relative priority between captures with 805 a priority. The same priority number may be used across multiple 806 captures. It indicates they are equally as important. If no 807 priority is assigned no assumptions regarding relative important of 808 the capture can be assumed. 810 Embedded Text: 812 This attribute indicates that a capture provides embedded textual 813 information. For example the video capture may contain speech to 814 text information composed with the video image. This attribute is 815 only applicable to video captures and presentation streams with 816 visual information. 818 Related To: 820 This attribute indicates the capture contains additional 821 complementary information related to another capture. The value 822 indicates the other capture to which this capture is providing 823 additional information. 825 For example, a conferences can utilise translators or facilitators 826 that provide an additional audio stream (i.e. a translation or 827 description or commentary of the conference). Where multiple 828 captures are available, it may be advantageous for a Consumer to 829 select a complementary capture instead of or in addition to a 830 capture it relates to. 832 6.2. Capture Scene 834 In order for a Provider's individual Captures to be used 835 effectively by a Consumer, the provider organizes the Captures into 836 one or more Capture Scenes, with the structure and contents of 837 these Capture Scenes being sent from the Provider to the Consumer 838 in the Advertisement. 840 A Capture Scene is a structure representing a spatial region 841 containing one or more Capture Devices, each capturing media 842 representing a portion of the region. A Capture Scene includes one 843 or more Capture Scene entries, with each entry including one or 844 more Media Captures. A Capture Scene represents, for example, the 845 video image of a group of people seated next to each other, along 846 with the sound of their voices, which could be represented by some 847 number of VCs and ACs in the Capture Scene Entries. A middle box 848 may also express Capture Scenes that it constructs from media 849 Streams it receives. 851 A Provider may advertise multiple Capture Scenes or just a single 852 Capture Scene. What constitutes an entire Capture Scene is up to 853 the Provider. A Provider might typically use one Capture Scene for 854 participant media (live video from the room cameras) and another 855 Capture Scene for a computer generated presentation. In more 856 complex systems, the use of additional Capture Scenes is also 857 sensible. For example, a classroom may advertise two Capture 858 Scenes involving live video, one including only the camera 859 capturing the instructor (and associated audio), the other 860 including camera(s) capturing students (and associated audio). 862 A Capture Scene may (and typically will) include more than one type 863 of media. For example, a Capture Scene can include several Capture 864 Scene Entries for Video Captures, and several Capture Scene Entries 865 for Audio Captures. A particular Capture may be included in more 866 than one Capture Scene Entry. 868 A provider can express spatial relationships between Captures that 869 are included in the same Capture Scene. However, there is not 870 necessarily the same spatial relationship between Media Captures 871 that are in different Capture Scenes. In other words, Capture 872 Scenes can use their own spatial measurement system as outlined 873 above in section 5. 875 A Provider arranges Captures in a Capture Scene to help the 876 Consumer choose which captures it wants. The Capture Scene Entries 877 in a Capture Scene are different alternatives the provider is 878 suggesting for representing the Capture Scene. The order of 879 Capture Scene Entries within a Capture Scene has no significance. 880 The Media Consumer can choose to receive all Media Captures from 881 one Capture Scene Entry for each media type (e.g. audio and video), 882 or it can pick and choose Media Captures regardless of how the 883 Provider arranges them in Capture Scene Entries. Different Capture 884 Scene Entries of the same media type are not necessarily mutually 885 exclusive alternatives. Also note that the presence of multiple 886 Capture Scene Entries (with potentially multiple encoding options 887 in each entry) in a given Capture Scene does not necessarily imply 888 that a Provider is able to serve all the associated media 889 simultaneously (although the construction of such an over-rich 890 Capture Scene is probably not sensible in many cases). What a 891 Provider can send simultaneously is determined through the 892 Simultaneous Transmission Set mechanism, described in section 6.3. 894 Captures within the same Capture Scene entry must be of the same 895 media type - it is not possible to mix audio and video captures in 896 the same Capture Scene Entry, for instance. The Provider must be 897 capable of encoding and sending all Captures in a single Capture 898 Scene Entry simultaneously. The order of Captures within a Capture 899 Scene Entry has no significance. A Consumer may decide to receive 900 all the Captures in a single Capture Scene Entry, but a Consumer 901 could also decide to receive just a subset of those captures. A 902 Consumer can also decide to receive Captures from different Capture 903 Scene Entries, all subject to the constraints set by Simultaneous 904 Transmission Sets, as discussed in section 6.3. 906 When a Provider advertises a Capture Scene with multiple entries, 907 it is essentially signaling that there are multiple representations 908 of the same Capture Scene available. In some cases, these multiple 909 representations would typically be used simultaneously (for 910 instance a "video entry" and an "audio entry"). In some cases the 911 entries would conceptually be alternatives (for instance an entry 912 consisting of three Video Captures covering the whole room versus 913 an entry consisting of just a single Video Capture covering only 914 the center if a room). In this latter example, one sensible choice 915 for a Consumer would be to indicate (through its Configure and 916 possibly through an additional offer/answer exchange) the Captures 917 of that Capture Scene Entry that most closely matched the 918 Consumer's number of display devices or screen layout. 920 The following is an example of 4 potential Capture Scene Entries 921 for an endpoint-style Provider: 923 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 925 2. (VC3) - Video Capture associated with loudest room segment 927 3. (VC4) - Video Capture zoomed out view of all people in the room 929 4. (AC0) - main audio 930 The first entry in this Capture Scene example is a list of Video 931 Captures which have a spatial relationship to each other. 932 Determination of the order of these captures (VC0, VC1 and VC2) for 933 rendering purposes is accomplished through use of their Area of 934 Capture attributes. The second entry (VC3) and the third entry 935 (VC4) are alternative representations of the same room's video, 936 which might be better suited to some Consumers' rendering 937 capabilities. The inclusion of the Audio Capture in the same 938 Capture Scene indicates that AC0 is associated with all of those 939 Video Captures, meaning it comes from the same spatial region. 940 Therefore, if audio were to be rendered at all, this audio would be 941 the correct choice irrespective of which Video Captures were 942 chosen. 944 6.2.1. Capture Scene attributes 946 Capture Scene Attributes can be applied to Capture Scenes as well 947 as to individual media captures. Attributes specified at this 948 level apply to all constituent Captures. Capture Scene attributes 949 include 951 . Human-readable description of the Capture Scene, which could 952 be in multiple languages; 953 . Scale information (millimeters, unknown, no scale), as 954 described in Section 5. 956 6.2.2. Capture Scene Entry attributes 958 A Capture Scene can include one or more Capture Scene Entries in 959 addition to the Capture Scene wide attributes described above. 960 Capture Scene Entry attributes apply to the Capture Scene Entry as 961 a whole, i.e. to all Captures that are part of the Capture Scene 962 Entry. 964 Capture Scene Entry attributes include: 966 . Human-readable description of the Capture Scene, which could 967 be in multiple languages; 968 . Scene-switch-policy: {site-switch, segment-switch} 970 A media provider uses this scene-switch-policy attribute to 971 indicate its support for different switching policies. In the 972 provider's Advertisement, this attribute can have multiple values, 973 which means the provider supports each of the indicated policies. 974 The consumer, when it requests media captures from this Capture 975 Scene Entry, should also include this attribute but with only the 976 single value (from among the values indicated by the provider) 977 indicating the Consumer's choice for which policy it wants the 978 provider to use. The Consumer must choose the same value for all 979 the Media Captures in the Capture Scene Entry. If the provider 980 does not support any of these policies, it should omit this 981 attribute. 983 The "site-switch" policy means all captures are switched at the 984 same time to keep captures from the same endpoint site together. 985 Let's say the speaker is at site A and everyone else is at a 986 "remote" site. 988 When the room at site A shown, all the camera images from site A 989 are forwarded to the remote sites. Therefore at each receiving 990 remote site, all the screens display camera images from site A. 991 This can be used to preserve full size image display, and also 992 provide full visual context of the displayed far end, site A. In 993 site switching, there is a fixed relation between the cameras in 994 each room and the displays in remote rooms. The room or 995 participants being shown is switched from time to time based on who 996 is speaking or by manual control. 998 The "segment-switch" policy means different captures can switch at 999 different times, and can be coming from different endpoints. Still 1000 using site A as where the speaker is, and "remote" to refer to all 1001 the other sites, in segment switching, rather than sending all the 1002 images from site A, only the image containing the speaker at site A 1003 is shown. The camera images of the current speaker and previous 1004 speakers (if any) are forwarded to the other sites in the 1005 conference. 1007 Therefore the screens in each site are usually displaying images 1008 from different remote sites - the current speaker at site A and the 1009 previous ones. This strategy can be used to preserve full size 1010 image display, and also capture the non-verbal communication 1011 between the speakers. In segment switching, the display depends on 1012 the activity in the remote rooms - generally, but not necessarily 1013 based on audio / speech detection. 1015 6.3. Simultaneous Transmission Set Constraints 1017 The Provider may have constraints or limitations on its ability to 1018 send Captures. One type is caused by the physical limitations of 1019 capture mechanisms; these constraints are represented by a 1020 simultaneous transmission set. The second type of limitation 1021 reflects the encoding resources available - bandwidth and 1022 macroblocks/second. This type of constraint is captured by 1023 encoding groups, discussed below. 1025 Some Endpoints or MCUs can send multiple Captures simultaneously, 1026 however sometimes there are constraints that limit which Captures 1027 can be sent simultaneously with other Captures. A device may not 1028 be able to be used in different ways at the same time. Provider 1029 Advertisements are made so that the Consumer can choose one of 1030 several possible mutually exclusive usages of the device. This 1031 type of constraint is expressed in a Simultaneous Transmission Set, 1032 which lists all the Captures of a particular media type (e.g. 1033 audio, video, text) that can be sent at the same time. There are 1034 different Simultaneous Transmission Sets for each media type in the 1035 Advertisement. This is easier to show in an example. 1037 Consider the example of a room system where there are three cameras 1038 each of which can send a separate capture covering two persons 1039 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1040 optical zoom lens) and show all six persons, VC3. But the middle 1041 camera cannot be used in both modes at the same time - it has to 1042 either show the space where two participants sit or the whole six 1043 seats, but not both at the same time. 1045 Simultaneous transmission sets are expressed as sets of the Media 1046 Captures that the Provider could transmit at the same time (though 1047 it may not make sense to do so). In this example the two 1048 simultaneous sets are shown in Table 1. If a Provider advertises 1049 one or more mutually exclusive Simultaneous Transmission Sets, then 1050 for each media type the Consumer must ensure that it chooses Media 1051 Captures that lie wholly within one of those Simultaneous 1052 Transmission Sets. 1054 +-------------------+ 1055 | Simultaneous Sets | 1056 +-------------------+ 1057 | {VC0, VC1, VC2} | 1058 | {VC0, VC3, VC2} | 1059 +-------------------+ 1061 Table 1: Two Simultaneous Transmission Sets 1063 A Provider optionally can include the simultaneous sets in its 1064 provider Advertisement. These simultaneous set constraints apply 1065 across all the Capture Scenes in the Advertisement. It is a syntax 1066 conformance requirement that the simultaneous transmission sets 1067 must allow all the media captures in any particular Capture Scene 1068 Entry to be used simultaneously. 1070 For shorthand convenience, a Provider may describe a Simultaneous 1071 Transmission Set in terms of Capture Scene Entries and Capture 1072 Scenes. If a Capture Scene Entry is included in a Simultaneous 1073 Transmission Set, then all Media Captures in the Capture Scene 1074 Entry are included in the Simultaneous Transmission Set. If a 1075 Capture Scene is included in a Simultaneous Transmission Set, then 1076 all its Capture Scene Entries (of the corresponding media type) are 1077 included in the Simultaneous Transmission Set. The end result 1078 reduces to a set of Media Captures in any case. 1080 If an Advertisement does not include Simultaneous Transmission 1081 Sets, then all Capture Scenes can be provided simultaneously. If 1082 multiple capture Scene Entries are in a Capture Scene then the 1083 Consumer chooses at most one Capture Scene Entry per Capture Scene 1084 for each media type. 1086 If an Advertisement includes multiple Capture Scene Entries in a 1087 Capture Scene then the Consumer should choose one Capture Scene 1088 Entry for each media type, but may choose individual Captures based 1089 on the Simultaneous Transmission Sets. 1091 7. Encodings 1093 Individual encodings and encoding groups are CLUE's mechanisms 1094 allowing a Provider to signal its limitations for sending Captures, 1095 or combinations of Captures, to a Consumer. Consumers can map the 1096 Captures they want to receive onto the Encodings, with encoding 1097 parameters they want. As for the relationship between the CLUE- 1098 specified mechanisms based on Encodings and the SIP Offer-Answer 1099 exchange, please refer to section 4. 1101 7.1. Individual Encodings 1103 An Individual Encoding represents a way to encode a Media Capture 1104 to become a Capture Encoding, to be sent as an encoded media stream 1105 from the Provider to the Consumer. An Individual Encoding has a 1106 set of parameters characterizing how the media is encoded. 1108 Different media types have different parameters, and different 1109 encoding algorithms may have different parameters. An Individual 1110 Encoding can be assigned to at most one Capture Encoding at any 1111 given time. 1113 The parameters of an Individual Encoding represent the maximum 1114 values for certain aspects of the encoding. A particular 1115 instantiation into a Capture Encoding might use lower values than 1116 these maximums. 1118 In general, the parameters of an Individual Encoding have been 1119 chosen to represent those negotiable parameters of media codecs of 1120 the media type that greatly influence computational complexity, 1121 while abstracting from details of particular media codecs used. 1122 The parameters have been chosen with those media codecs in mind 1123 that have seen wide deployment in the video conferencing and 1124 Telepresence industry. 1126 For video codecs (using H.26x compression technologies), those 1127 parameters include: 1129 . Maximum bitrate; 1130 . Maximum picture size in pixels; 1131 . Maxmimum number of pixels to be processed per second; and 1132 . Clue-protocol internal information. 1134 For audio codecs, so far only one parameter has been identified: 1136 . Maximum bitrate. 1138 Edt. note: the maximum number of pixel per second are currently 1139 expressed as H.264maxmbps. 1141 Edt. note: it would be desirable to make the computational 1142 complexity mechanism codec independent so to allow for expressing 1143 that, say, H.264 codecs are less complex than H.265 codecs, and, 1144 therefore, the same hardware can process higher pixel rates for 1145 H.264 than for H.265. To be discussed in the WG. 1147 7.2. Encoding Group 1149 An Encoding Group includes a set of one or more Individual 1150 Encodings, and parameters that apply to the group as a whole. By 1151 grouping multiple individual Encodings together, an Encoding Group 1152 describes additional constraints on bandwidth and other parameters 1153 for the group. 1155 The Encoding Group data structure contains: 1157 . Maximum bitrate for all encodings in the group combined; 1158 . Maximum number of pixels per second for all video encodings of 1159 the group combined. 1160 . A list of identifiers for audio and video encodings, 1161 respectively, belonging to the group. 1163 When the Individual Encodings in a group are instantiated into 1164 Capture Encodings, each Capture Encoding has a bitrate that must be 1165 less than or equal to the max bitrate for the particular individual 1166 encoding. The "maximum bitrate for all encodings in the group" 1167 parameter gives the additional restriction that the sum of all the 1168 individual capture encoding bitrates must be less than or equal to 1169 the this group value. 1171 Likewise, the sum of the pixels per second of each instantiated 1172 encoding in the group must not exceed the group value. 1174 The following diagram illustrates one example of the structure of a 1175 media provider's Encoding Groups and their contents. 1177 ,-------------------------------------------------. 1178 | Media Provider | 1179 | | 1180 | ,--------------------------------------. | 1181 | | ,--------------------------------------. | 1182 | | | ,--------------------------------------. | 1183 | | | | Encoding Group | | 1184 | | | | ,-----------. | | 1185 | | | | | | ,---------. | | 1186 | | | | | | | | ,---------.| | 1187 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1188 | `.| | | | | | `---------'| | 1189 | `.| `-----------' `---------' | | 1190 | `--------------------------------------' | 1191 `-------------------------------------------------' 1193 Figure 1: Encoding Group Structure 1195 A Provider advertises one or more Encoding Groups. Each Encoding 1196 Group includes one or more Individual Encodings. Each Individual 1197 Encoding can represent a different way of encoding media. For 1198 example one Individual Encoding may be 1080p60 video, another could 1199 be 720p30, with a third being CIF, all in, for example, H.264 1200 format. 1202 While a typical three codec/display system might have one Encoding 1203 Group per "codec box" (physical codec, connected to one camera and 1204 one screen), there are many possibilities for the number of 1205 Encoding Groups a Provider may be able to offer and for the 1206 encoding values in each Encoding Group. 1208 There is no requirement for all Encodings within an Encoding Group 1209 to be instantiated at the same time. 1211 8. Associating Captures with Encoding Groups 1213 Every Capture is associated with an Encoding Group, which is used 1214 to instantiate that Capture into one or more Capture Encodings. 1215 More than one Capture may use the same Encoding Group. 1217 The maximum number of streams that can result from a particular 1218 Encoding Group constraint is equal to the number of individual 1219 Encodings in the group. The actual number of Capture Encodings 1220 used at any time may be less than this maximum. Any of the 1221 Captures that use a particular Encoding Group can be encoded 1222 according to any of the Individual Encodings in the group. If 1223 there are multiple Individual Encodings in the group, then the 1224 Consumer can configure the Provider, via a Configure message, to 1225 encode a single Media Capture into multiple different Capture 1226 Encodings at the same time, subject to the Max Capture Encodings 1227 constraint, with each capture encoding following the constraints of 1228 a different Individual Encoding. 1230 It is a protocol conformance requirement that the Encoding Groups 1231 must allow all the Captures in a particular Capture Scene Entry to 1232 be used simultaneously. 1234 9. Consumer's Choice of Streams to Receive from the Provider 1236 After receiving the Provider's Advertisement message (that includes 1237 media captures and associated constraints), the Consumer composes 1238 its reply to the Provider in the form of a Configure message. The 1239 Consumer is free to use the information in the Advertisement as it 1240 chooses, but there are a few obviously sensible design choices, 1241 which are outlined below. 1243 If multiple Providers connect to the same Consumer (i.e. in a n 1244 MCU-less multiparty call), it is the repsonsibility of the Consumer 1245 to compose Configures for each Provider that both fulfill each 1246 Provider's constraints as expressed in the Advertisement, as well 1247 as its own capabilities. 1249 In an MCU-based multiparty call, the MCU can logically terminate 1250 the Advertisement/Configure negotiation in that it can hide the 1251 characteristics of the receiving endpoint and rely on its own 1252 capabilities (transcoding/transrating/...) to create Media Streams 1253 that can be decoded at the Endpoint Consumers. The timing of an 1254 MCU's sending of Advertisements (for its outgoing ports) and 1255 Configures (for its incoming ports, in response to Advertisements 1256 received there) is up to the MCU and implementation dependent. 1258 As a general outline, A Consumer can choose, based on the 1259 Advertisement it has received, which Captures it wishes to receive, 1260 and which Individual Encodings it wants the Provider to use to 1261 encode the Captures. Each Capture has an Encoding Group ID 1262 attribute which specifies which Individual Encodings are available 1263 to be used for that Capture. 1265 A Configure Message includes a list of Capture Encodings. These 1266 are the Capture Encodings the Consumer wishes to receive from the 1267 Provider. Each Capture Encoding refers to one Media Capture, one 1268 Individual Encoding, and includes the encoding parameter values. 1269 For each Media Capture in the message, the Consumer may also 1270 specify the value of any attributes for which the Provider has 1271 offered a choice, for example the value for the Scene-switch-policy 1272 attribute. A Configure Message does not include references to 1273 Capture Scenes or Capture Scene Entries. 1275 For each Capture the Consumer wants to receive, it configures one 1276 or more of the encodings in that capture's encoding group. The 1277 Consumer does this by telling the Provider, in its Configure 1278 Message, parameters such as the resolution, frame rate, bandwidth, 1279 etc. for each Capture Encodings for its chosen Captures. Upon 1280 receipt of this Configure from the Consumer, common knowledge is 1281 established between Provider and Consumer regarding sensible 1282 choices for the media streams and their parameters. The setup of 1283 the actual media channels, at least in the simplest case, is left 1284 to a following offer-answer exchange. Optimized implementations 1285 may speed up the reaction to the offer-answer exchange by reserving 1286 the resources at the time of finalization of the CLUE handshake. 1287 Even more advanced devices may choose to establish media streams 1288 without an offer-answer exchange, for example by overloading 1289 existing 5 tuple connections with the negotiated media. 1291 The Consumer must have received at least one Advertisement from the 1292 Provider to be able to create and send a Configure. 1294 In addition, the Consumer can send a Configure at any time during 1295 the call. The Configure must be valid according to the most 1296 recently received Advertisement. The Consumer can send a Configure 1297 either in response to a new Advertisement from the Provider or as 1298 by its own, for example because of a local change in conditions 1299 (people leaving the room, connectivity changes, multipoint related 1300 considerations). 1302 Edt. Note: The editors solicit input from the working group as to 1303 whether or not a Consumer must respond to every Advertisement with 1304 a new Configure message. We expect this to be decided in the 1305 context of the signaling document, then it should be mentioned 1306 here. 1308 When choosing which Media Streams to receive from the Provider, and 1309 the encoding characteristics of those Media Streams, the Consumer 1310 advantageously takes several things into account: its local 1311 preference, simultaneity restrictions, and encoding limits. 1313 9.1. Local preference 1315 A variety of local factors influence the Consumer's choice of 1316 Media Streams to be received from the Provider: 1318 o if the Consumer is an Endpoint, it is likely that it would 1319 choose, where possible, to receive video and audio Captures that 1320 match the number of display devices and audio system it has 1322 o if the Consumer is a middle box such as an MCU, it may choose to 1323 receive loudest speaker streams (in order to perform its own 1324 media composition) and avoid pre-composed video Captures 1326 o user choice (for instance, selection of a new layout) may result 1327 in a different set of Captures, or different encoding 1328 characteristics, being required by the Consumer 1330 9.2. Physical simultaneity restrictions 1332 There may be physical simultaneity constraints imposed by the 1333 Provider that affect the Provider's ability to simultaneously send 1334 all of the captures the Consumer would wish to receive. For 1335 instance, a middle box such as an MCU, when connected to a multi- 1336 camera room system, might prefer to receive both individual video 1337 streams of the people present in the room and an overall view of 1338 the room from a single camera. Some Endpoint systems might be 1339 able to provide both of these sets of streams simultaneously, 1340 whereas others may not (if the overall room view were produced by 1341 changing the optical zoom level on the center camera, for 1342 instance). 1344 9.3. Encoding and encoding group limits 1346 Each of the Provider's encoding groups has limits on bandwidth and 1347 computational complexity, and the constituent potential encodings 1348 have limits on the bandwidth, computational complexity, video 1349 frame rate, and resolution that can be provided. When choosing 1350 the Captures to be received from a Provider, a Consumer device 1351 must ensure that the encoding characteristics requested for each 1352 individual Capture fits within the capability of the encoding it 1353 is being configured to use, as well as ensuring that the combined 1354 encoding characteristics for Captures fit within the capabilities 1355 of their associated encoding groups. In some cases, this could 1356 cause an otherwise "preferred" choice of capture encodings to be 1357 passed over in favour of different Capture Encodings - for 1358 instance, if a set of three Captures could only be provided at a 1359 low resolution then a three screen device could switch to favoring 1360 a single, higher quality, Capture Encoding. 1362 10. Extensibility 1364 One of the most important characteristics of the Framework is its 1365 extensibility. Telepresence is a relatively new industry and 1366 while we can foresee certain directions, we also do not know 1367 everything about how it will develop. The standard for 1368 interoperability and handling multiple streams must be future- 1369 proof. The framework itself is inherently extensible through 1370 expanding the data model types. For example: 1372 o Adding more types of media, such as telemetry, can done by 1373 defining additional types of Captures in addition to audio and 1374 video. 1376 o Adding new functionalities , such as 3-D, say, may require 1377 additional attributes describing the Captures. 1379 o Adding a new codecs, such as H.265, can be accomplished by 1380 defining new encoding variables. 1382 The infrastructure is designed to be extended rather than 1383 requiring new infrastructure elements. Extension comes through 1384 adding to defined types. 1386 11. Examples - Using the Framework 1388 EDT. Note: these examples are currently out of date with respect 1389 to H264Mbps codepoints, which will be fixed in the next release 1390 once an agreement about codec computational complexity has been 1391 found. Other than that, the examples are still valid. 1393 EDT Note: remove syntax-like details in these examples, and focus 1394 on concepts for this document. Syntax examples with XML should be 1395 in the data model doc or dedicated example document. 1397 This section gives some examples, first from the point of view of 1398 the Provider, then the Consumer. 1400 11.1. Provider Behavior 1402 This section shows some examples in more detail of how a Provider 1403 can use the framework to represent a typical case for telepresence 1404 rooms. First an endpoint is illustrated, then an MCU case is 1405 shown. 1407 11.1.1. Three screen Endpoint Provider 1409 Consider an Endpoint with the following description: 1411 3 cameras, 3 displays, a 6 person table 1413 o Each camera can provide one Capture for each 1/3 section of the 1414 table 1416 o A single Capture representing the active speaker can be provided 1417 (voice activity based camera selection to a given encoder input 1418 port implemented locally in the Endpoint) 1420 o A single Capture representing the active speaker with the other 1421 2 Captures shown picture in picture within the stream can be 1422 provided (again, implemented inside the endpoint) 1424 o A Capture showing a zoomed out view of all 6 seats in the room 1425 can be provided 1427 The audio and video Captures for this Endpoint can be described as 1428 follows. 1430 Video Captures: 1432 o VC0- (the camera-left camera stream), encoding group=EG0, 1433 switched=false, view=table 1435 o VC1- (the center camera stream), encoding group=EG1, 1436 switched=false, view=table 1438 o VC2- (the camera-right camera stream), encoding group=EG2, 1439 switched=false, view=table 1441 o VC3- (the loudest panel stream), encoding group=EG1, 1442 switched=true, view=table 1444 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1445 composed=true, switched=true, view=room 1447 o VC5- (the zoomed out view of all people in the room), encoding 1448 group=EG1, composed=false, switched=false, view=room 1450 o VC6- (presentation stream), encoding group=EG1, presentation, 1451 switched=false 1453 The following diagram is a top view of the room with 3 cameras, 3 1454 displays, and 6 seats. Each camera is capturing 2 people. The 1455 six seats are not all in a straight line. 1457 ,-. d 1458 ( )`--.__ +---+ 1459 `-' / `--.__ | | 1460 ,-. | `-.._ |_-+Camera 2 (VC2) 1461 ( ).' ___..-+-''`+-+ 1462 `-' |_...---'' | | 1463 ,-.c+-..__ +---+ 1464 ( )| ``--..__ | | 1465 `-' | ``+-..|_-+Camera 1 (VC1) 1466 ,-. | __..--'|+-+ 1467 ( )| __..--' | | 1468 `-'b|..--' +---+ 1469 ,-. |``---..___ | | 1470 ( )\ ```--..._|_-+Camera 0 (VC0) 1471 `-' \ _..-''`-+ 1472 ,-. \ __.--'' | | 1473 ( ) |..-'' +---+ 1474 `-' a 1476 The two points labeled b and c are intended to be at the midpoint 1477 between the seating positions, and where the fields of view of the 1478 cameras intersect. 1480 The plane of interest for VC0 is a vertical plane that intersects 1481 points 'a' and 'b'. 1483 The plane of interest for VC1 intersects points 'b' and 'c'. The 1484 plane of interest for VC2 intersects points 'c' and 'd'. 1486 This example uses an area scale of millimeters. 1488 Areas of capture: 1490 bottom left bottom right top left top right 1491 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1492 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1493 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1494 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1495 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1496 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1497 VC6 none 1499 Points of capture: 1500 VC0 (-1678,0,800) 1501 VC1 (0,0,800) 1502 VC2 (1678,0,800) 1503 VC3 none 1504 VC4 none 1505 VC5 (0,0,800) 1506 VC6 none 1508 In this example, the right edge of the VC0 area lines up with the 1509 left edge of the VC1 area. It doesn't have to be this way. There 1510 could be a gap or an overlap. One additional thing to note for 1511 this example is the distance from a to b is equal to the distance 1512 from b to c and the distance from c to d. All these distances are 1513 1346 mm. This is the planar width of each area of capture for VC0, 1514 VC1, and VC2. 1516 Note the text in parentheses (e.g. "the camera-left camera 1517 stream") is not explicitly part of the model, it is just 1518 explanatory text for this example, and is not included in the 1519 model with the media captures and attributes. Also, the 1520 "composed" boolean attribute doesn't say anything about how a 1521 capture is composed, so the media consumer can't tell based on 1522 this attribute that VC4 is composed of a "loudest panel with 1523 PiPs". 1525 Audio Captures: 1527 o AC0 (camera-left), encoding group=EG3, content=main, channel 1528 format=mono 1530 o AC1 (camera-right), encoding group=EG3, content=main, channel 1531 format=mono 1533 o AC2 (center) encoding group=EG3, content=main, channel 1534 format=mono 1536 o AC3 being a simple pre-mixed audio stream from the room (mono), 1537 encoding group=EG3, content=main, channel format=mono 1539 o AC4 audio stream associated with the presentation video (mono) 1540 encoding group=EG3, content=slides, channel format=mono 1542 Areas of capture: 1544 bottom left bottom right top left top right 1546 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1547 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1548 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1549 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1550 AC4 none 1552 The physical simultaneity information is: 1554 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1556 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1558 This constraint indicates it is not possible to use all the VCs at 1559 the same time. VC5 can not be used at the same time as VC1 or VC3 1560 or VC4. Also, using every member in the set simultaneously may 1561 not make sense - for example VC3(loudest) and VC4 (loudest with 1562 PIP). (In addition, there are encoding constraints that make 1563 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1564 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1565 in the encoding groups, not in the simultaneous transmission 1566 sets.) 1568 In this example there are no restrictions on which audio captures 1569 can be sent simultaneously. 1571 Encoding Groups: 1573 This example has three encoding groups associated with the video 1574 captures. Each group can have 3 encodings, but with each 1575 potential encoding having a progressively lower specification. In 1576 this example, 1080p60 transmission is possible (as ENC0 has a 1577 maxPps value compatible with that) as long as it is the only 1578 active encoding in the group(as maxGroupPps for the entire 1579 encoding group is also 124416000). Significantly, as up to 3 1580 encodings are available per group, it is possible to transmit some 1581 video captures simultaneously that are not in the same entry in 1582 the capture scene. For example VC1 and VC3 at the same time. 1584 It is also possible to transmit multiple capture encodings of a 1585 single video capture. For example VC0 can be encoded using ENC0 1586 and ENC1 at the same time, as long as the encoding parameters 1587 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1588 1080p30 and one at 720p30. 1590 encodeGroupID=EG0, maxGroupPps=124416000 maxGroupBandwidth=6000000 1591 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1592 maxPps=124416000, maxBandwidth=4000000 1593 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1594 maxPps=27648000, maxBandwidth=4000000 1595 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1596 maxPps=15552000, maxBandwidth=4000000 1597 encodeGroupID=EG1 maxGroupPps=124416000 maxGroupBandwidth=6000000 1598 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1599 maxPps=124416000, maxBandwidth=4000000 1600 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1601 maxPps=27648000, maxBandwidth=4000000 1602 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1603 maxPps=15552000, maxBandwidth=4000000 1604 encodeGroupID=EG2 maxGroupPps=124416000 maxGroupBandwidth=6000000 1605 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1606 maxPps=124416000, maxBandwidth=4000000 1607 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1608 maxPps=27648000, maxBandwidth=4000000 1609 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1610 maxPps=15552000, maxBandwidth=4000000 1612 Figure 2: Example Encoding Groups for Video 1614 For audio, there are five potential encodings available, so all 1615 five audio captures can be encoded at the same time. 1617 encodeGroupID=EG3, maxGroupPps =0, maxGroupBandwidth=320000 1618 encodeID=ENC9, maxBandwidth=64000 1619 encodeID=ENC10, maxBandwidth=64000 1620 encodeID=ENC11, maxBandwidth=64000 1621 encodeID=ENC12, maxBandwidth=64000 1622 encodeID=ENC13, maxBandwidth=64000 1624 Figure 3: Example Encoding Group for Audio 1626 Capture Scenes: 1628 The following table represents the capture scenes for this 1629 provider. Recall that a capture scene is composed of alternative 1630 capture scene entries covering the same spatial region. Capture 1631 Scene #1 is for the main people captures, and Capture Scene #2 is 1632 for presentation. 1634 Each row in the table is a separate Capture Scene Entry 1635 +------------------+ 1636 | Capture Scene #1 | 1637 +------------------+ 1638 | VC0, VC1, VC2 | 1639 | VC3 | 1640 | VC4 | 1641 | VC5 | 1642 | AC0, AC1, AC2 | 1643 | AC3 | 1644 +------------------+ 1646 +------------------+ 1647 | Capture Scene #2 | 1648 +------------------+ 1649 | VC6 | 1650 | AC4 | 1651 +------------------+ 1653 Different capture scenes are unique to each other, non- 1654 overlapping. A consumer can choose an entry from each capture 1655 scene. In this case the three captures VC0, VC1, and VC2 are one 1656 way of representing the video from the endpoint. These three 1657 captures should appear adjacent next to each other. 1658 Alternatively, another way of representing the Capture Scene is 1659 with the capture VC3, which automatically shows the person who is 1660 talking. Similarly for the VC4 and VC5 alternatives. 1662 As in the video case, the different entries of audio in Capture 1663 Scene #1 represent the "same thing", in that one way to receive 1664 the audio is with the 3 audio captures (AC0, AC1, AC2), and 1665 another way is with the mixed AC3. The Media Consumer can choose 1666 an audio capture entry it is capable of receiving. 1668 The spatial ordering is understood by the media capture attributes 1669 Area of Capture and Point of Capture. 1671 A Media Consumer would likely want to choose a capture scene entry 1672 to receive based in part on how many streams it can simultaneously 1673 receive. A consumer that can receive three people streams would 1674 probably prefer to receive the first entry of Capture Scene #1 1675 (VC0, VC1, VC2) and not receive the other entries. A consumer 1676 that can receive only one people stream would probably choose one 1677 of the other entries. 1679 If the consumer can receive a presentation stream too, it would 1680 also choose to receive the only entry from Capture Scene #2 (VC6). 1682 11.1.2. Encoding Group Example 1684 This is an example of an encoding group to illustrate how it can 1685 express dependencies between encodings. 1687 encodeGroupID=EG0 maxGroupPps=124416000 maxGroupBandwidth=6000000 1688 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1689 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1690 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1691 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1692 encodeID=AUDENC0, maxBandwidth=96000 1693 encodeID=AUDENC1, maxBandwidth=96000 1694 encodeID=AUDENC2, maxBandwidth=96000 1696 Here, the encoding group is EG0. It can transmit up to two 1697 1080p30 capture encodings (Pps for 1080p = 62208000), but it is 1698 capable of transmitting a maxFrameRate of 60 frames per second 1699 (fps). To achieve the maximum resolution (1920 x 1088) the frame 1700 rate is limited to 30 fps. However 60 fps can be achieved at a 1701 lower resolution if required by the consumer. Although the 1702 encoding group is capable of transmitting up to 6Mbit/s, no 1703 individual video encoding can exceed 4Mbit/s. 1705 This encoding group also allows up to 3 audio encodings, AUDENC<0- 1706 2>. It is not required that audio and video encodings reside 1707 within the same encoding group, but if so then the group's overall 1708 maxBandwidth value is a limit on the sum of all audio and video 1709 encodings configured by the consumer. A system that does not wish 1710 or need to combine bandwidth limitations in this way should 1711 instead use separate encoding groups for audio and video in order 1712 for the bandwidth limitations on audio and video to not interact. 1714 Audio and video can be expressed in separate encoding groups, as 1715 in this illustration. 1717 encodeGroupID=EG0 maxGroupPps=124416000 maxGroupBandwidth=6000000 1718 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1719 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1720 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1721 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1722 encodeGroupID=EG1, maxGroupPps=0, maxGroupBandwidth=500000 1723 encodeID=AUDENC0, maxBandwidth=96000 1724 encodeID=AUDENC1, maxBandwidth=96000 1725 encodeID=AUDENC2, maxBandwidth=96000 1727 11.1.3. The MCU Case 1729 This section shows how an MCU might express its Capture Scenes, 1730 intending to offer different choices for consumers that can handle 1731 different numbers of streams. A single audio capture stream is 1732 provided for all single and multi-screen configurations that can 1733 be associated (e.g. lip-synced) with any combination of video 1734 captures at the consumer. 1736 +--------------------+-------------------------------------------- 1737 | Capture Scene #1 | note 1738 | 1739 +--------------------+-------------------------------------------- 1740 | VC0 | video capture for single screen consumer 1741 | 1742 | VC1, VC2 | video capture for 2 screen consumer 1743 | 1744 | VC3, VC4, VC5 | video capture for 3 screen consumer 1745 | 1746 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer 1747 | 1748 | AC0 | audio capture representing all participants 1749 | 1750 +--------------------+-------------------------------------------- 1752 If / when a presentation stream becomes active within the 1753 conference the MCU might re-advertise the available media as: 1755 +------------------+--------------------------------------+ 1756 | Capture Scene #2 | note | 1757 +------------------+--------------------------------------+ 1758 | VC10 | video capture for presentation | 1759 | AC1 | presentation audio to accompany VC10 | 1760 +------------------+--------------------------------------+ 1762 11.2. Media Consumer Behavior 1764 This section gives an example of how a Media Consumer might behave 1765 when deciding how to request streams from the three screen 1766 endpoint described in the previous section. 1768 The receive side of a call needs to balance its requirements, 1769 based on number of screens and speakers, its decoding capabilities 1770 and available bandwidth, and the provider's capabilities in order 1771 to optimally configure the provider's streams. Typically it would 1772 want to receive and decode media from each Capture Scene 1773 advertised by the Provider. 1775 A sane, basic, algorithm might be for the consumer to go through 1776 each Capture Scene in turn and find the collection of Video 1777 Captures that best matches the number of screens it has (this 1778 might include consideration of screens dedicated to presentation 1779 video display rather than "people" video) and then decide between 1780 alternative entries in the video Capture Scenes based either on 1781 hard-coded preferences or user choice. Once this choice has been 1782 made, the consumer would then decide how to configure the 1783 provider's encoding groups in order to make best use of the 1784 available network bandwidth and its own decoding capabilities. 1786 11.2.1. One screen Media Consumer 1788 VC3, VC4 and VC5 are all different entries by themselves, not 1789 grouped together in a single entry, so the receiving device should 1790 choose between one of those. The choice would come down to 1791 whether to see the greatest number of participants simultaneously 1792 at roughly equal precedence (VC5), a switched view of just the 1793 loudest region (VC3) or a switched view with PiPs (VC4). An 1794 endpoint device with a small amount of knowledge of these 1795 differences could offer a dynamic choice of these options, in- 1796 call, to the user. 1798 11.2.2. Two screen Media Consumer configuring the example 1800 Mixing systems with an even number of screens, "2n", and those 1801 with "2n+1" cameras (and vice versa) is always likely to be the 1802 problematic case. In this instance, the behavior is likely to be 1803 determined by whether a "2 screen" system is really a "2 decoder" 1804 system, i.e., whether only one received stream can be displayed 1805 per screen or whether more than 2 streams can be received and 1806 spread across the available screen area. To enumerate 3 possible 1807 behaviors here for the 2 screen system when it learns that the far 1808 end is "ideally" expressed via 3 capture streams: 1810 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1811 per the 1 screen consumer case above) and either leave one 1812 screen blank or use it for presentation if / when a 1813 presentation becomes active. 1815 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 1816 screens (either with each capture being scaled to 2/3 of a 1817 screen and the center capture being split across 2 screens) or, 1818 as would be necessary if there were large bezels on the 1819 screens, with each stream being scaled to 1/2 the screen width 1820 and height and there being a 4th "blank" panel. This 4th panel 1821 could potentially be used for any presentation that became 1822 active during the call. 1824 3. Receive 3 streams, decode all 3, and use control information 1825 indicating which was the most active to switch between showing 1826 the left and center streams (one per screen) and the center and 1827 right streams. 1829 For an endpoint capable of all 3 methods of working described 1830 above, again it might be appropriate to offer the user the choice 1831 of display mode. 1833 11.2.3. Three screen Media Consumer configuring the example 1835 This is the most straightforward case - the Media Consumer would 1836 look to identify a set of streams to receive that best matched its 1837 available screens and so the VC0 plus VC1 plus VC2 should match 1838 optimally. The spatial ordering would give sufficient information 1839 for the correct video capture to be shown on the correct screen, 1840 and the consumer would either need to divide a single encoding 1841 group's capability by 3 to determine what resolution and frame 1842 rate to configure the provider with or to configure the individual 1843 video captures' encoding groups with what makes most sense (taking 1844 into account the receive side decode capabilities, overall call 1845 bandwidth, the resolution of the screens plus any user preferences 1846 such as motion vs sharpness). 1848 12. Acknowledgements 1850 Allyn Romanow and Brian Baldino were authors of early versions. 1851 Mark Gorzyinski contributed much to the approach. We want to 1852 thank Stephen Botzko for helpful discussions on audio. 1854 13. IANA Considerations 1856 None. 1858 14. Security Considerations 1860 TBD 1862 15. Changes Since Last Version 1864 NOTE TO THE RFC-Editor: Please remove this section prior to 1865 publication as an RFC. 1867 Changes from 10 to 11: 1869 1. Add description attribute to Media Capture and Capture Scene 1870 Entry. 1872 2. Remove contradiction and change the note about open issue 1873 regarding always responding to Advertisement with a Configure 1874 message. 1876 3. Update example section, to cleanup formatting and make the 1877 media capture attributes and encoding parameters consistent 1878 with the rest of the document. 1880 Changes from 09 to 10: 1882 1. Several minor clarifications such as about SDP usage, Media 1883 Captures, Configure message. 1885 2. Simultaneous Set can be expressed in terms of Capture Scene 1886 and Capture Scene Entry. 1888 3. Removed Area of Scene attribute. 1890 4. Add attributes from draft-groves-clue-capture-attr-01. 1892 5. Move some of the Media Capture attribute descriptions back 1893 into this document, but try to leave detailed syntax to the 1894 data model. Remove the OUTSOURCE sections, which are already 1895 incorporated into the data model document. 1897 Changes from 08 to 09: 1899 1. Use "document" instead of "memo". 1901 2. Add basic call flow sequence diagram to introduction. 1903 3. Add definitions for Advertisement and Configure messages. 1905 4. Add definitions for Capture and Provider. 1907 5. Update definition of Capture Scene. 1909 6. Update definition of Individual Encoding. 1911 7. Shorten definition of Media Capture and add key points in the 1912 Media Captures section. 1914 8. Reword a bit about capture scenes in overview. 1916 9. Reword about labeling Media Captures. 1918 10. Remove the Consumer Capability message. 1920 11. New example section heading for media provider behavior 1922 12. Clarifications in the Capture Scene section. 1924 13. Clarifications in the Simultaneous Transmission Set section. 1926 14. Capitalize defined terms. 1928 15. Move call flow example from introduction to overview section 1930 16. General editorial cleanup 1932 17. Add some editors' notes requesting input on issues 1934 18. Summarize some sections, and propose details be outsourced 1935 to other documents. 1937 Changes from 06 to 07: 1939 1. Ticket #9. Rename Axis of Capture Point attribute to Point 1940 on Line of Capture. Clarify the description of this 1941 attribute. 1943 2. Ticket #17. Add "capture encoding" definition. Use this new 1944 term throughout document as appropriate, replacing some usage 1945 of the terms "stream" and "encoding". 1947 3. Ticket #18. Add Max Capture Encodings media capture 1948 attribute. 1950 4. Add clarification that different capture scene entries are 1951 not necessarily mutually exclusive. 1953 Changes from 05 to 06: 1955 1. Capture scene description attribute is a list of text strings, 1956 each in a different language, rather than just a single string. 1958 2. Add new Axis of Capture Point attribute. 1960 3. Remove appendices A.1 through A.6. 1962 4. Clarify that the provider must use the same coordinate system 1963 with same scale and origin for all coordinates within the same 1964 capture scene. 1966 Changes from 04 to 05: 1968 1. Clarify limitations of "composed" attribute. 1970 2. Add new section "capture scene entry attributes" and add the 1971 attribute "scene-switch-policy". 1973 3. Add capture scene description attribute and description 1974 language attribute. 1976 4. Editorial changes to examples section for consistency with the 1977 rest of the document. 1979 Changes from 03 to 04: 1981 1. Remove sentence from overview - "This constitutes a significant 1982 change ..." 1984 2. Clarify a consumer can choose a subset of captures from a 1985 capture scene entry or a simultaneous set (in section "capture 1986 scene" and "consumer's choice..."). 1988 3. Reword first paragraph of Media Capture Attributes section. 1990 4. Clarify a stereo audio capture is different from two mono audio 1991 captures (description of audio channel format attribute). 1993 5. Clarify what it means when coordinate information is not 1994 specified for area of capture, point of capture, area of scene. 1996 6. Change the term "producer" to "provider" to be consistent (it 1997 was just in two places). 1999 7. Change name of "purpose" attribute to "content" and refer to 2000 RFC4796 for values. 2002 8. Clarify simultaneous sets are part of a provider advertisement, 2003 and apply across all capture scenes in the advertisement. 2005 9. Remove sentence about lip-sync between all media captures in a 2006 capture scene. 2008 10. Combine the concepts of "capture scene" and "capture set" 2009 into a single concept, using the term "capture scene" to 2010 replace the previous term "capture set", and eliminating the 2011 original separate capture scene concept. 2013 Informative References 2015 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2016 Requirement Levels", BCP 14, RFC 2119, March 1997. 2018 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 2019 Johnston, 2020 A., Peterson, J., Sparks, R., Handley, M., and E. 2021 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 2022 June 2002. 2024 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 2025 Jacobson, "RTP: A Transport Protocol for Real-Time 2026 Applications", STD 64, RFC 3550, July 2003. 2028 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 2029 Session Initiation Protocol (SIP)", RFC 4353, 2030 February 2006. 2032 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 2033 5117, 2034 January 2008. 2036 16. Authors' Addresses 2038 Mark Duckworth (editor) 2039 Polycom 2040 Andover, MA 01810 2041 USA 2043 Email: mark.duckworth@polycom.com 2045 Andrew Pepperell 2046 Acano 2047 Uxbridge, England 2048 UK 2050 Email: apeppere@gmail.com 2052 Stephan Wenger 2053 Vidyo, Inc. 2054 433 Hackensack Ave. 2055 Hackensack, N.J. 07601 2056 USA 2058 Email: stewe@stewe.org