idnits 2.17.1 draft-ietf-clue-framework-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1537 has weird spacing: '...om left bot...' == Line 1591 has weird spacing: '...om left bot...' -- The document date (Feb 21, 2013) is 4053 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG M. Duckworth, Ed. 3 Internet Draft Polycom 4 Intended status: Informational A. Pepperell 5 Expires: August 21, 2013 Silverflare 6 S. Wenger 7 Vidyo 8 Feb 21, 2013 10 Framework for Telepresence Multi-Streams 11 draft-ietf-clue-framework-09.txt 13 Abstract 15 This document offers a framework for a protocol that enables 16 devices in a telepresence conference to interoperate by specifying 17 the relationships between multiple media streams. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current 27 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six 30 months and may be updated, replaced, or obsoleted by other 31 documents at any time. It is inappropriate to use Internet-Drafts 32 as reference material or to cite them other than as "work in 33 progress." 35 This Internet-Draft will expire on August 21, 2013. 37 Copyright Notice 39 Copyright (c) 2013 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction...................................................3 55 2. Terminology....................................................5 56 3. Definitions....................................................5 57 4. Overview of the Framework/Model................................8 58 5. Spatial Relationships.........................................13 59 6. Media Captures and Capture Scenes.............................14 60 6.1. Media Captures...........................................14 61 6.1.1. Media Capture Attributes............................15 62 6.2. Capture Scene............................................18 63 6.2.1. Capture scene attributes............................21 64 6.2.2. Capture scene entry attributes......................22 65 6.3. Simultaneous Transmission Set Constraints................24 66 7. Encodings.....................................................25 67 7.1. Individual Encodings.....................................25 68 7.2. Encoding Group...........................................28 69 8. Associating Media Captures with Encoding Groups...............31 70 9. Consumer's Choice of Streams to Receive from the Provider.....31 71 9.1. Local preference.........................................33 72 9.2. Physical simultaneity restrictions.......................33 73 9.3. Encoding and encoding group limits.......................34 74 9.4. Message Flow...................Error! Bookmark not defined. 75 10. Extensibility................................................34 76 11. Examples - Using the Framework...............................34 77 11.1. Media Provider Behavior.................................35 78 11.1.1. Three screen endpoint media provider...............35 79 11.1.2. Encoding Group Example.............................42 80 11.1.3. The MCU Case.......................................43 81 11.2. Media Consumer Behavior.................................44 82 11.2.1. One screen consumer................................44 83 11.2.2. Two screen consumer configuring the example........45 84 11.2.3. Three screen consumer configuring the example......45 85 12. Acknowledgements.............................................46 86 13. IANA Considerations..........................................46 87 14. Security Considerations......................................46 88 15. Changes Since Last Version...................................46 89 16. Authors' Addresses...........................................50 91 1. Introduction 93 Current telepresence systems, though based on open standards such 94 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 95 each other. A major factor limiting the interoperability of 96 telepresence systems is the lack of a standardized way to describe 97 and negotiate the use of the multiple streams of audio and video 98 comprising the media flows. This draft provides a framework for a 99 protocol to enable interoperability by handling multiple streams in 100 a standardized way. It is intended to support the use cases 101 described in draft-ietf-clue-telepresence-use-cases and to meet the 102 requirements in draft-ietf-clue-telepresence-requirements. 104 Conceptually distinguished are Media Providers and Media Consumers. 105 A Media Provider provides Media in the form of RTP packets, a Media 106 Consumer consumes those RTP packets. Media Providers and Media 107 Consumers can reside in Endpoints or in middleboxes such as 108 Multipoint Control Units (MCUs). A Media Provider in an Endpoint 109 is usually associated with the generation of media for Media 110 Captures; these Media Captures are typically sourced from cameras, 111 microphones, and the like. Similarly, the Media Consumer in an 112 Endpoint is usually associated with Renderers, such as screens and 113 loudspeakers. In middleboxes, Media Providers and Consumers can 114 have the form of outputs and inputs, respectively, of RTP mixers, 115 RTP translators, and similar devices. Typically, telepresence 116 devices such as Endpoints and middleboxes would perform as both 117 Media Providers and Media Consumers, the former being concerned 118 with those devices' transmitted media and the latter with those 119 devices' received media. In a few circumstances, a CLUE Endpoint 120 middlebox may include only Consumer or Provider functionality, such 121 as recorder-type Consumers or webcam-type Providers. 123 Motivations for this document (and, in fact, for the existence of 124 the CLUE protocol) include: 126 (1) Endpoints according to this document can, and usually do, have 127 multiple Media Captures and Media Renderers, that is, for example, 128 multiple cameras and screens. While previous system designs were 129 able to set up calls that would light up all screens and cameras 130 (or equivalent), what was missing was a mechanism that can 131 associate the Media Captures with each other in space and time. 133 (2) The mere fact that there are multiple capture and rendering 134 devices, each of which may be configurable in aspects such as zoom, 135 leads to the difficulty that a variable number of such devices can 136 be used to capture different aspects of a region. The Capture 137 Scene concept allows for the description of multiple setups for 138 those multiple capture devices that could represent sensible 139 operation points of the physical capture devices in a room, chosen 140 by the operator. A Consumer can pick and choose from those 141 configurations based on its rendering abilities and inform the 142 Provider about its choices. Details are provided in section 6. 144 (3) In some cases, physical limitations disallow the concurrent use 145 of a device in more than one setup. For example, the center camera 146 in a typical three-camera conference room can set its zoom 147 objective either to capture only the middle few seats, or all seats 148 of a room, but not both concurrently. The simultaneous capture set 149 concept allows a Provider to signal such limitations. Simultaneous 150 capture sets are part of the Capture Scene description, and 151 discussed in section 6.3. 153 (4) Often, the devices in a room do not have the computational 154 complexity or connectivity to deal with multiple encoding options 155 simultaneously, even if each of these options may be sensible in 156 certain environments, and even if the simultaneous transmission may 157 also be sensible (i.e. in case of multicast media distribution to 158 multiple endpoints). Such constraints can be expressed by the 159 Provider using the Encoding Group concept, described in section 7. 161 (5) Due to the potentially large number of RTP flows required for a 162 Multimedia Conference involving potentially many Endpoints, each of 163 which can have many Media Captures and Media Renderers, a sensible 164 system design is to multiplex multiple RTP media flows onto the 165 same transport address, so to avoid using the port number as a 166 multiplexing point and the associated shortcomings such as 167 NAT/firewall traversal. While the actual mapping of those RTP 168 flows to the header fields of the RTP packets is not subject of 169 this specification, the large number of possible permutations of 170 sensible options a Media Provider may make available to a Media 171 Consumer makes a mechanism desirable that allows to narrow down the 172 number of possible options that a SIP offer-answer exchange has to 173 consider. Such information is made available using protocol 174 mechanisms specified in this document and companion documents, 175 although it should be stressed that its use in an implementation is 176 optional. Also, there are aspects of the control of both Endpoints 177 and middleboxes/MCUs that dynamically change during the progress of 178 a call, such as audio-level based screen switching, layout changes, 179 and so on, which need to be conveyed. Note that these control 180 aspects are complementary to those specified in traditional SIP 181 based conference management such as BFCP. An exemplary call flow 182 can be found in section 4. 184 Finally, all this information needs to be conveyed, and the notion 185 of support for it needs to be established. This is done by the 186 negotiation of a "CLUE channel", a data channel negotiated early 187 during the initiation of a call. An Endpoint or MCU that rejects 188 the establishment of this data channel, by definition, is not 189 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 190 accepts it is required to use it to the extent specified in this 191 document and its companion documents. 193 Edt. note: Certain sections in the document are marked with "BEGIN 194 OUTSOURCE" and "END OUTSOURCE". Text between those markers should 195 be removed from the framework in an upcoming version and replaced 196 by a paragraph or two that describe informatively and concisely the 197 removed text, including references to the normative definitions of 198 the text. This is mostly an alignment issue with the data model 199 draft. 201 2. Terminology 203 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 204 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 205 this document are to be interpreted as described in RFC 2119 206 [RFC2119]. 208 3. Definitions 210 The terms defined below are used throughout this document and 211 companion documents and they are normative. In order to easily 212 identify the use of a defined term, those terms are capitalized. 214 Advertisement: a CLUE message a Media Provider sends to a Media 215 Consumer describing specific aspects of the content of the media, 216 the formatting of the media streams it can send, and any 217 restrictions it has in terms of being able to provide certain 218 Streams simultaneously. 220 Audio Capture: Media Capture for audio. Denoted as ACn in the 221 example cases in this document. 223 Camera-Left and Right: For Media Captures, camera-left and camera- 224 right are from the point of view of a person observing the rendered 225 media. They are the opposite of Stage-Left and Stage-Right. 227 Capture: Same as Media Capture. 229 Capture Device: A device that converts audio and video input into 230 an electrical signal, in most cases to be fed into a media encoder. 232 Capture Encoding: A specific encoding of a Media Capture, to be 233 sent by a Media Provider to a Media Consumer via RTP. 235 Capture Scene: a structure representing a spatial region containing 236 one or more Capture Devices, each capturing media representing a 237 portion of the region. The spatial region represented by a Capture 238 Scene may or may not correspond to a real region in physical space, 239 such as a room. A Capture Scene includes attributes and one or 240 more Capture Scene Entries, with each entry including one or more 241 Media Captures. 243 Capture Scene Entry: a list of Media Captures of the same media 244 type that together form one way to represent the entire Capture 245 Scene. 247 Conference: used as defined in [RFC4353], A Framework for 248 Conferencing within the Session Initiation Protocol (SIP). 250 Configure message: A CLUE message a Media Consumer sends to a Media 251 Provider specifying which content and media streams it wants to 252 receive, based on the information in a corresponding Advertisement 253 message. 255 Consumer: short for Media Consumer. 257 Encoding or Individual Encoding: a set of parameters representing a 258 way to encode a Media Capture to become a Capture Encoding. 260 Encoding Group: A set of encoding parameters representing a total 261 media encoding capability to be sub-divided across potentially 262 multiple Individual Encodings. 264 Endpoint: The logical point of final termination through receiving, 265 decoding and rendering, and/or initiation through capturing, 266 encoding, and sending of media streams. An endpoint consists of 267 one or more physical devices which source and sink media streams, 268 and exactly one [RFC4353] Participant (which, in turn, includes 269 exactly one SIP User Agent). Endpoints can be anything from 270 multiscreen/multicamera rooms to handheld devices. 272 Front: the portion of the room closest to the cameras. In going 273 towards back you move away from the cameras. 275 MCU: Multipoint Control Unit (MCU) - a device that connects two or 276 more endpoints together into one single multimedia conference 277 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 278 [RFC4353] requirement to send media to each participant. 280 Media: Any data that, after suitable encoding, can be conveyed over 281 RTP, including audio, video or timed text. 283 Media Capture: a source of Media, such as from one or more Capture 284 Devices or constructed from other Media streams. 286 Media Consumer: an Endpoint or middle box that receives Media 287 streams 289 Media Provider: an Endpoint or middle box that sends Media streams 291 Model: a set of assumptions a telepresence system of a given vendor 292 adheres to and expects the remote telepresence system(s) also to 293 adhere to. 295 Plane of Interest: The spatial plane containing the most relevant 296 subject matter. 298 Provider: Same as Media Provider. 300 Render: the process of generating a representation from a media, 301 such as displayed motion video or sound emitted from loudspeakers. 303 Simultaneous Transmission Set: a set of Media Captures that can be 304 transmitted simultaneously from a Media Provider. 306 Spatial Relation: The arrangement in space of two objects, in 307 contrast to relation in time or other relationships. See also 308 Camera-Left and Right. 310 Stage-Left and Right: For Media Captures, Stage-left and Stage- 311 right are the opposite of Camera-left and Camera-right. For the 312 case of a person facing (and captured by) a camera, Stage-left and 313 Stage-right are from the point of view of that person. 315 Stream: a Capture Encoding sent from a Media Provider to a Media 316 Consumer via RTP [RFC3550]. 318 Stream Characteristics: the media stream attributes commonly used 319 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 320 resolution, profile/level etc.) as well as CLUE specific 321 attributes, such as the Capture ID or a spatial location. 323 Video Capture: Media Capture for video. Denoted as VCn in the 324 example cases in this document. 326 Video Composite: A single image that is formed, normally by an RTP 327 mixer inside an MCU, by combining visual elements from separate 328 sources. 330 4. Overview of the Framework/Model 332 The CLUE framework specifies how multiple media streams are to be 333 handled in a telepresence conference. 335 A Media Provider (transmitting Endpoint or MCU) describes specific 336 aspects of the content of the media and the formatting of the media 337 streams it can send in an Advertisement; and the Media Consumer 338 responds to the Media Provider by specifying which content and 339 media streams it wants to receive in a Configure message. The 340 Provider then transmits the asked-for content in the specified 341 streams. 343 This Advertisement and Configure occurs as a minimum during call 344 initiation but may also happen at any time throughout the call, 345 whenever there is a change in what the Consumer wants to receive or 346 (perhaps less common) the Provider can send. 348 An Endpoint or MCU typically act as both Provider and Consumer at 349 the same time, sending Advertisements and sending Configurations in 350 response to receiving Advertisements. (It is possible to be just 351 one or the other.) 353 The data model is based around two main concepts: a Capture and an 354 Encoding. A Media Capture (MC), such as audio or video, describes 355 the content a Provider can send. Media Captures are described in 356 terms of CLUE-defined attributes, such as spatial relationships and 357 purpose of the capture. Providers tell Consumers which Media 358 Captures they can provide, described in terms of the Media Capture 359 attributes. 361 A Provider organizes its Media Captures into one or more Capture 362 Scenes, each representing a spatial region, such as a room. A 363 Consumer chooses which Media Captures it wants to receive from each 364 Capture Scene. 366 In addition, the Provider can send the Consumer a description of 367 the Individual Encodings it can send in terms of the media 368 attributes of the Encodings, in particular, audio and video 369 parameters such as bandwidth, frame rate, macroblocks per second. 370 Note that this is optional, and intended to minimize the number of 371 options a later SDP offer-answer would require to include in the 372 SDP in case of complex setups, as should become clearer shortly 373 when discussing an outline of the call flow. 375 The Provider can also specify constraints on its ability to provide 376 Media, and a sensible design choice for a Consumer is to take these 377 into account when choosing the content and Capture Encodings it 378 requests in the later offer-answer exchange. Some constraints are 379 due to the physical limitations of devices - for example, a camera 380 may not be able to provide zoom and non-zoom views simultaneously. 381 Other constraints are system based constraints, such as maximum 382 bandwidth and maximum macroblocks/second. 384 A very brief outline of the call flow used by a simple system (two 385 Endpoints) in compliance with this document can be described as 386 follows, and as shown in the following figure. 388 +-----------+ +-----------+ 389 | Endpoint1 | | Endpoint2 | 390 +----+------+ +-----+-----+ 391 | INVITE (BASIC SDP+CLUECHANNEL) | 392 |--------------------------------->| 393 | 200 0K (BASIC SDP+CLUECHANNEL)| 394 |<---------------------------------| 395 | ACK | 396 |--------------------------------->| 397 | | 398 |<################################>| 399 | BASIC SDP MEDIA SESSION | 400 |<################################>| 401 | | 402 | CONNECT (CLUE CTRL CHANNEL) | 403 |=================================>| 404 | ... | 405 |<================================>| 406 | CLUE CTRL CHANNEL ESTABLISHED | 407 |<================================>| 408 | | 409 | ADVERTISEMENT 1 | 410 |*********************************>| 411 | ADVERTISEMENT 2 | 412 |<*********************************| 413 | | 414 | CONFIGURE 1 | 415 |<*********************************| 416 | CONFIGURE 2 | 417 |*********************************>| 418 | | 419 | REINVITE (UPDATED SDP) | 420 |--------------------------------->| 421 | 200 0K (UPDATED SDP)| 422 |<---------------------------------| 423 | ACK | 424 |--------------------------------->| 425 | | 426 |<################################>| 427 | UPDATED SDP MEDIA SESSION | 428 |<################################>| 429 | | 430 v v 432 An initial offer/answer exchange establishes a basic media session, 433 for example audio-only, and a CLUE channel between two Endpoints. 434 With the establishment of that channel, the endpoints have 435 consented to use the CLUE protocol mechanisms and have to adhere to 436 them. 438 Over this CLUE channel, the Provider in each Endpoint conveys its 439 characteristics and capabilities by sending an Advertisement as 440 specified herein (which will typically not be sufficient to set up 441 all media). The Consumer in the Endpoint receives the information 442 provided by the Provider, and can use it for two purposes. First, 443 it constructs and sends a CLUE Configure message to tell the 444 Provider what the Consumer wishes to receive. Second, it can, but 445 is not necessarily required to, use the information provided to 446 tailor the SDP it is going to send during the following SIP 447 offer/answer exchange, and its reaction to SDP it receives in that 448 step. It is often a sensible implementation choice to do so, as 449 the representation of the media information conveyed over the CLUE 450 channel can dramatically cut down on the size of SDP messages used 451 in the O/A exchange that follows. Spatial relationships associated 452 with the Media can be included in the Advertisement, and it is 453 often sensible for the Media Consumer to take those spatial 454 relationships into account when tailoring the SDP. 456 This CLUE exchange is followed by an SDP offer answer exchange that 457 not only establishes those aspects of the media that have not been 458 "negotiated" over CLUE, but has also the side effect of setting up 459 the media transmission itself, involving potentially security 460 exchanges, ICE, and whatnot. This step is plain vanilla SIP, with 461 the exception that the SDP used herein, in most cases can (but not 462 necessarily must) be considerably smaller than the SDP a system 463 would typically need to exchange if there were no pre-established 464 knowledge about the Provider and Consumer characteristics. (The 465 need for cutting down SDP size may not be obvious for a point-to- 466 point call involving simple endpoints; however, when considering a 467 large multipoint conference involving many multi-screen/multi- 468 camera endpoints, each of which can operate using multiple codecs 469 for each camera and microphone, it becomes perhaps somewhat more 470 intuitive.) 472 During the lifetime of a call, further exchanges can occur over the 473 CLUE channel. In some cases, those further exchanges can lead to a 474 modified system behavior of Provider or Consumer (or both) without 475 any other protocol activity such as further offer/answer exchanges. 476 For example, voice-activated screen switching, signaled over the 477 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 478 re-invites. However, in other cases, after the CLUE negotiation an 479 additional offer/answer exchange may become necessary. For 480 example, if both sides decide to upgrade the call from a single 481 screen to a multi-screen call and more bandwidth is required for 482 the additional video channels, that could require a new O/A 483 exchange. 485 Numerous optimizations may be possible, and are the implementer's 486 choice. For example, it may be sensible to establish one or more 487 initial media channels during the initial offer/answer exchange, 488 which would allow, for example, for a fast startup of audio. 489 Depending on the system design, it may be possible to re-use this 490 established channel for more advanced media negotiated only by CLUE 491 mechanisms, thereby avoiding further offer/answer exchanges. 493 Edt. note: The editors are not sure whether the mentioned 494 overloading of established RTP channels using only CLUE messages is 495 possible, or desired by the WG. If it were, certainly there is 496 need for specification work. One possible issue: a Provider which 497 thinks that it can switch, say, a audio codec algorithm by CLUE 498 only, talks to a Consumer which thinks that it has to faithfully 499 answer the Providers Advertisement through a Configure, but does 500 not dare setting up its internal resource until such time it has 501 received its authoritative O/A exchange. Working group input is 502 solicited. 504 One aspect of the protocol outlined herein and specified in 505 normative detail in companion documents is that it makes available 506 information regarding the Provider's capabilities to deliver Media, 507 and attributes related to that Media such as their spatial 508 relationship, to the Consumer. The operation of the Renderer 509 inside the Consumer is unspecified in that it can choose to ignore 510 some information provided by the Provider, and/or not render media 511 streams available from the Provider (although it has to follow the 512 CLUE protocol and, therefore, has to gracefully receive and respond 513 (through a Configure) to the Provider's information). All CLUE 514 protocol mechanisms are optional in the Consumer in the sense that, 515 while the Consumer must be able to receive (and, potentially, 516 gracefully acknowledge) CLUE messages, it is free to ignore the 517 information provided therein. Obviously, this is not a 518 particularly sensible design choice. 520 Legacy devices are defined here in as those Endpoints and MCUs that 521 do not support the setup and use of the CLUE channel. The notion 522 of a device being a legacy device is established during the initial 523 offer/answer exchange, in which the legacy device will not 524 understand the offer for the CLUE channel and, therefore, reject 525 it. This is the indication for the CLUE-implementing Endpoint or 526 MCU that the other side of the communication is not compliant with 527 CLUE, and to fall back to whatever mechanism was used before the 528 introduction of CLUE. 530 As for the media, Provider and Consumer have an end-to-end 531 communication relationship with respect to (RTP transported) media; 532 and the mechanisms described herein and in companion documents do 533 not change the aspects of setting up those RTP flows and sessions. 534 However, it should be noted that forms of RTP multiplexing of 535 multiple RTP flows onto the same transport address are developed 536 concurrently with the CLUE suite of specifications, and it is 537 widely expected that most, if not all, Endpoints or MCUs supporting 538 CLUE will also support those mechanisms. Some design choices made 539 in this document reflect this coincidence in spec development 540 timing. 542 5. Spatial Relationships 544 In order for a Consumer to perform a proper rendering, it is often 545 necessary or at least helpful for the Consumer to have received 546 spatial information about the streams it is receiving. CLUE 547 defines a coordinate system that allows Media Providers to describe 548 the spatial relationships of their Media Captures to enable proper 549 scaling and spatially sensible rendering of their streams. The 550 coordinate system is based on a few principles: 552 o Simple systems which do not have multiple Media Captures to 553 associate spatially need not use the coordinate model. 555 o Coordinates can either be in real, physical units (millimeters), 556 have an unknown scale or have no physical scale. Systems which 557 know their physical dimensions (for example professionally 558 installed Telepresence room systems) should always provide those 559 real-world measurements. Systems which don't know specific 560 physical dimensions but still know relative distances should use 561 'unknown scale'. 'No scale' is intended to be used where Media 562 Captures from different devices (with potentially different 563 scales) will be forwarded alongside one another (e.g. in the 564 case of a middle box). 566 * "millimeters" means the scale is in millimeters 567 * "Unknown" means the scale is not necessarily millimeters, but 568 the scale is the same for every Capture in the Capture Scene. 570 * "No Scale" means the scale could be different for each 571 capture- an MCU provider that advertises two adjacent 572 captures and picks sources (which can change quickly) from 573 different endpoints might use this value; the scale could be 574 different and changing for each capture. But the areas of 575 capture still represent a spatial relation between captures. 577 o The coordinate system is Cartesian X, Y, Z with the origin at a 578 spatial location of the provider's choosing. The Provider must 579 use the same coordinate system with same scale and origin for 580 all coordinates within the same Capture Scene. 582 The direction of increasing coordinate values is: 583 X increases from Camera-Left to Camera-Right 584 Y increases from Front to back 585 Z increases from low to high 587 6. Media Captures and Capture Scenes 589 This section describes how Providers can describe the content of 590 media to Consumers. 592 6.1. Media Captures 594 Media Captures are the fundamental representations of streams that 595 a device can transmit. What a Media Capture actually represents is 596 flexible: 598 o It can represent the immediate output of a physical source (e.g. 599 camera, microphone) or 'synthetic' source (e.g. laptop computer, 600 DVD player). 602 o It can represent the output of an audio mixer or video composer 604 o It can represent a concept such as 'the loudest speaker' 606 o It can represent a conceptual position such as 'the leftmost 607 stream' 609 To identify and distinguish between multiple instances, video and 610 audio captures are labeled. For instance: VC1, VC2 and AC1, AC2, 611 where VC1 and VC2 refer to two different video captures and AC1 612 and AC2 refer to two different audio captures. 614 Some key points about Media Captures: 616 . A Media Capture is of a single media type (e.g. audio or 617 video) 618 . A Media Capture is associated with exactly one Capture Scene 619 . A Media Capture has exactly one set of spatial information 620 . A Media Capture may be the source of one or more Capture 621 Encodings 623 Each Media Capture can be associated with attributes to describe 624 what it represents. 626 6.1.1. Media Capture Attributes 628 Media Capture Attributes describe static information about the 629 Captures. A Provider can use the Media Capture Attributes to 630 describe the Captures for the benefit of the Consumer in the 631 Advertisement message. Media Capture Attributes include 633 . spatial information, such as point of capture, point on line 634 of capture, and area of capture, all of which, in combination 635 define the capture field of, for example, a camera; 636 . Capture multiplexing information (composed/switched video, 637 mono/stereo audio, maximum number of simultaneous encodings 638 per Capture and so on); and 639 . Control information for use inside the CLUE protocol suite. 641 BEGIN OUTSOURCE (to datamodel) 643 Media Capture Attributes describe static information about the 644 captures. A provider uses the media capture attributes to describe 645 the media captures to the consumer. The consumer will select the 646 captures it wants to receive. Attributes are defined by a variable 647 and its value. The currently defined attributes and their values 648 are: 650 Content: {slides, speaker, sl, main, alt} 652 A field with enumerated values which describes the role of the 653 media capture and can be applied to any media type. The enumerated 654 values are defined by [RFC4796]. The values for this attribute are 655 the same as the mediacnt values for the content attribute in 656 [RFC4796]. This attribute can have multiple values, for example 657 content={main, speaker}. 659 Composed: {true, false} 661 A field with a Boolean value which indicates whether or not the 662 Media Capture is a mix (audio) or composition (video) of streams. 664 This attribute is useful for a media consumer to avoid nesting a 665 composed video capture into another composed capture or rendering. 666 This attribute is not intended to describe the layout a media 667 provider uses when composing video streams. 669 Audio Channel Format: {mono, stereo} A field with enumerated values 670 which describes the method of encoding used for audio. 672 A value of 'mono' means the Audio Capture has one channel. 674 A value of 'stereo' means the Audio Capture has two audio channels, 675 left and right. 677 This attribute applies only to Audio Captures. A single stereo 678 capture is different from two mono captures that have a left-right 679 spatial relationship. A stereo capture maps to a single RTP 680 stream, while each mono audio capture maps to a separate RTP 681 stream. 683 Switched: {true, false} 685 A field with a Boolean value which indicates whether or not the 686 Media Capture represents the (dynamic) most appropriate subset of a 687 'whole'. What is 'most appropriate' is up to the provider and 688 could be the active speaker, a lecturer or a VIP. 690 Point of Capture: {(X, Y, Z)} 692 A field with a single Cartesian (X, Y, Z) point value which 693 describes the spatial location, virtual or physical, of the 694 capturing device (such as camera). 696 When the Point of Capture attribute is specified, it must include 697 X, Y and Z coordinates. If the point of capture is not specified, 698 it means the consumer should not assume anything about the spatial 699 location of the capturing device. Even if the provider specifies 700 an area of capture attribute, it does not need to specify the point 701 of capture. 703 Point on Line of Capture: {(X,Y,Z)} 705 A field with a single Cartesian (X, Y, Z) point value (virtual or 706 physical) which describes a position in space of a second point on 707 the axis of the capturing device; the first point being the Point 708 of Capture (see above). This point MUST lie between the Point of 709 Capture and the Area of Capture. 711 The Point on Line of Capture MUST be ignored if the Point of 712 Capture is not present for this capture device. When the Point on 713 Line of Capture attribute is specified, it must include X, Y and Z 714 coordinates. These coordinates MUST NOT be identical to the Point 715 of Capture coordinates. If the Point on Line of Capture is not 716 specified, no assumptions are made about the axis of the capturing 717 device. 719 Area of Capture: 721 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, 722 Y3, Z3), top right(X4, Y4, Z4)} 724 A field with a set of four (X, Y, Z) points as a value which 725 describe the spatial location of what is being "captured". By 726 comparing the Area of Capture for different Media Captures within 727 the same Capture Scene a consumer can determine the spatial 728 relationships between them and render them correctly. 730 The four points should be co-planar. The four points form a 731 quadrilateral, not necessarily a rectangle. 733 The quadrilateral described by the four (X, Y, Z) points defines 734 the plane of interest for the particular media capture. 736 If the area of capture attribute is specified, it must include X, Y 737 and Z coordinates for all four points. If the area of capture is 738 not specified, it means the Media Capture is not spatially related 739 to any other Media Capture (but this can change in a subsequent 740 provider Advertisement). 742 For a switched capture that switches between different sections 743 within a larger area, the area of capture should use coordinates 744 for the larger potential area. 746 EncodingGroup: {} 748 A field with a value equal to the encodeGroupID of the encoding 749 group associated with the media capture. 751 Max Capture Encodings: {unsigned integer} 753 An optional attribute indicating the maximum number of capture 754 encodings that can be simultaneously active for the media capture. 755 If absent, this parameter defaults to 1. The minimum value for 756 this attribute is 1. The number of simultaneous capture encodings 757 is also limited by the restrictions of the encoding group for the 758 media capture. 760 END OUTSOURCE 762 6.2. Capture Scene 764 In order for a Provider's individual Captures to be used 765 effectively by a Consumer, the provider organizes the Captures into 766 one or more Capture Scenes, with the structure and contents of 767 these Capture Scenes being sent from the Provider to the Consumer 768 in the Advertisement. 770 A Capture Scene is a structure representing a spatial region 771 containing one or more Capture Devices, each capturing media 772 representing a portion of the region. A Capture Scene includes one 773 or more Capture Scene entries, with each entry including one or 774 more Media Captures. A Capture Scene represents, for example, the 775 video image of a group of people seated next to each other, along 776 with the sound of their voices, which could be represented by some 777 number of VCs and ACs in the Capture Scene Entries. A middle box 778 may also express Capture Scenes that it constructs from media 779 Streams it receives. 781 A Provider may advertise multiple Capture Scenes or just a single 782 Capture Scene. What constitutes an entire Capture Scene is up to 783 the Provider. A Provider might typically use one Capture Scene for 784 participant media (live video from the room cameras) and another 785 Capture Scene for a computer generated presentation. In more 786 complex systems, the use of additional Capture Scenes is also 787 sensible. For example, a three camera room may advertise two 788 Capture Scenes involving live video, one including only the center 789 camera (and associated audio), the other involving all three 790 cameras (and associated audio). 792 A Capture Scene may (and typically will) include more than one type 793 of media. For example, a Capture Scene can include several Capture 794 Scene Entries for Video Captures, and several Capture Scene Entries 795 for Audio Captures. A particular Capture may be included in more 796 than one Capture Scene Entry. 798 A provider can express spatial relationships between Captures that 799 are included in the same Capture Scene. However, there is not 800 necessarily the same spatial relationship between Media Captures 801 that are in different Capture Scenes. In other words, Capture 802 Scenes can use their own spatial measurement system as outlined 803 above in section 5. 805 A Provider arranges Captures in a Capture Scene to help the 806 Consumer choose which captures it wants. The Capture Scene Entries 807 in a Capture Scene are different alternatives the provider is 808 suggesting for representing the Capture Scene. The order of 809 Capture Scene Entries within a Capture Scene has no significance. 810 The Media Consumer can choose to receive all Media Captures from 811 one Capture Scene Entry for each media type (e.g. audio and video), 812 or it can pick and choose Media Captures regardless of how the 813 Provider arranges them in Capture Scene Entries. Different Capture 814 Scene Entries of the same media type are not necessarily mutually 815 exclusive alternatives. Also note that the presence of multiple 816 Capture Scene Entries (with potentially multiple encoding options 817 in each entry) in a given Capture Scene does not necessarily imply 818 that a Provider is able to serve all the associated media 819 simultaneously (although the construction of such an over-rich 820 Capture Scene is probably not sensible in many cases). What a 821 Provider can send simultaneously is determined through the 822 Simultaneous Transmission Set mechanism, described in section 6.3. 824 Captures within the same Capture Scene entry must be of the same 825 media type - it is not possible to mix audio and video captures in 826 the same Capture Scene Entry, for instance. The Provider must be 827 capable of encoding and sending all Captures in a single Capture 828 Scene Entry simultaneously. The order of Captures within a Capture 829 Scene Entry has no significance. A Consumer may decide to receive 830 all the Captures in a single Capture Scene Entry, but a Consumer 831 could also decide to receive just a subset of those captures. A 832 Consumer can also decide to receive Captures from different Capture 833 Scene Entries, all subject to the constraints set by Simultaneous 834 Transmission Sets, as discussed in section 6.3. 836 When a Provider advertises a Capture Scene with multiple entries, 837 it is essentially signaling that there are multiple representations 838 of the same Capture Scene available. In some cases, these multiple 839 representations would typically be used simultaneously (for 840 instance a "video entry" and an "audio entry"). In some cases the 841 entries would conceptually be alternatives (for instance an entry 842 consisting of three Video Captures covering the whole room versus 843 an entry consisting of just a single Video Capture covering only 844 the center if a room). In this latter example, one sensible choice 845 for a Consumer would be to indicate (through its Configure and 846 possibly through an additional offer/answer exchange) the Captures 847 of that Capture Scene Entry that most closely matched the 848 Consumer's number of display devices or screen layout. 850 The following is an example of 4 potential Capture Scene Entries 851 for an endpoint-style Provider: 853 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 855 2. (VC3) - Video Capture associated with loudest room segment 857 3. (VC4) - Video Capture zoomed out view of all people in the room 859 4. (AC0) - main audio 861 The first entry in this Capture Scene example is a list of Video 862 Captures which have a spatial relationship to each other. 863 Determination of the order of these captures (VC0, VC1 and VC2) for 864 rendering purposes is accomplished through use of their Area of 865 Capture attributes. The second entry (VC3) and the third entry 866 (VC4) are alternative representations of the same room's video, 867 which might be better suited to some Consumers' rendering 868 capabilities. The inclusion of the Audio Capture in the same 869 Capture Scene indicates that AC0 is associated with all of those 870 Video Captures, meaning it comes from the same spatial region. 871 Therefore, if audio were to be rendered at all, this audio would be 872 the correct choice irrespective of which Video Captures were 873 chosen. 875 6.2.1. Capture Scene attributes 877 Capture Scene Attributes can be applied to Capture Scenes as well 878 as to individual media captures. Attributes specified at this 879 level apply to all constituent Captures. Capture Scene attributes 880 include 882 . Human-readable description of the Capture Scene; 883 . Area of Scene, describing the spatial area of all Captures of 884 a Capture Scene (in contrast to a field of a Capture in 885 isolation); and 886 . Scale information (millimeters, unknown, no scale). 888 OUTSOURCE TO data model 890 Description attribute - list of {, } 893 The optional description attribute is a list of human readable text 894 strings which describe the capture scene. If there is more than 895 one string in the list, then each string in the list should contain 896 the same description, but in a different language. A provider that 897 advertises multiple capture scenes can provide descriptions for 898 each of them. This attribute can contain text in any number of 899 languages. 901 The language tag identifies the language of the corresponding 902 description text. The possible values for a language tag are the 903 values of the 'Subtag' column for the "Type: language" entries in 904 the "Language Subtag Registry" at [IANA-Lan] originally defined in 905 [RFC5646]. A particular language tag value MUST NOT be used more 906 than once in the description attribute list. 908 Area of Scene attribute 910 The area of scene attribute for a capture scene has the same format 911 as the area of capture attribute for a media capture. The area of 912 scene is for the entire scene, which is captured by the one or more 913 media captures in the capture scene entries. If the provider does 914 not specify the area of scene, but does specify areas of capture, 915 then the consumer may assume the area of scene is greater than or 916 equal to the outer extents of the individual areas of capture. 918 Scale attribute 920 An optional attribute indicating if the numbers used for area of 921 scene, area of capture and point of capture are in terms of 922 millimeters, unknown scale factor, or not any scale, as described 923 in Section 5. If any media captures have an area of capture 924 attribute or point of capture attribute, then this scale attribute 925 must also be defined. The possible values for this attribute are: 927 "millimeters" 929 "unknown" 931 "no scale" 933 END OUTSOURCE 935 6.2.2. Capture Scene Entry attributes 937 A Capture Scene can include one or more Capture Scene Entries in 938 addition to the Capture Scene wide attributes described above. 939 Capture Scene Entry attributes apply to the Capture Scene Entry as 940 a whole, i.e. to all Captures that are part of the Capture Scene 941 Entry, but only if the Capture is invoked through this Capture 942 Scene. 944 Capture Scene Entry attributes include: 946 . Scene-switch-policy: {site-switch, segment-switch} 948 BEGIN OUTSOURCE to data model 949 A media provider uses this scene-switch-policy attribute to 950 indicate its support for different switching policies. In the 951 provider's Advertisement, this attribute can have multiple values, 952 which means the provider supports each of the indicated policies. 953 The consumer, when it requests media captures from this Capture 954 Scene Entry, should also include this attribute but with only the 955 single value (from among the values indicated by the provider) 956 indicating the Consumer's choice for which policy it wants the 957 provider to use. If the provider does not support any of these 958 policies, it should omit this attribute. 960 The "site-switch" policy means all captures are switched at the 961 same time to keep captures from the same endpoint site together. 962 Let's say the speaker is at site A and everyone else is at a 963 "remote" site. 965 When the room at site A shown, all the camera images from site A 966 are forwarded to the remote sites. Therefore at each receiving 967 remote site, all the screens display camera images from site A. 968 This can be used to preserve full size image display, and also 969 provide full visual context of the displayed far end, site A. In 970 site switching, there is a fixed relation between the cameras in 971 each room and the displays in remote rooms. The room or 972 participants being shown is switched from time to time based on who 973 is speaking or by manual control. 975 The "segment-switch" policy means different captures can switch at 976 different times, and can be coming from different endpoints. Still 977 using site A as where the speaker is, and "remote" to refer to all 978 the other sites, in segment switching, rather than sending all the 979 images from site A, only the image containing the speaker at site A 980 is shown. The camera images of the current speaker and previous 981 speakers (if any) are forwarded to the other sites in the 982 conference. 984 Therefore the screens in each site are usually displaying images 985 from different remote sites - the current speaker at site A and the 986 previous ones. This strategy can be used to preserve full size 987 image display, and also capture the non-verbal communication 988 between the speakers. In segment switching, the display depends on 989 the activity in the remote rooms - generally, but not necessarily 990 based on audio / speech detection. 992 END OUTSOURCE 994 6.3. Simultaneous Transmission Set Constraints 996 The Provider may have constraints or limitations on its ability to 997 send Captures. One type is caused by the physical limitations of 998 capture mechanisms; these constraints are represented by a 999 simultaneous transmission set. The second type of limitation 1000 reflects the encoding resources available - bandwidth and 1001 macroblocks/second. This type of constraint is captured by 1002 encoding groups, discussed below. 1004 Some Endpoints or MCUs can send multiple Captures simultaneously, 1005 however sometimes there are constraints that limit which Captures 1006 can be sent simultaneously with other Captures. A device may not 1007 be able to be used in different ways at the same time. Provider 1008 Advertisements are made so that the Consumer can choose one of 1009 several possible mutually exclusive usages of the device. This 1010 type of constraint is expressed in a Simultaneous Transmission Set, 1011 which lists all the Captures of a particular media type (e.g. 1012 audio, video, text) that can be sent at the same time. There are 1013 different Simultaneous Transmission Sets for each media type in the 1014 Advertisement. This is easier to show in an example. 1016 Consider the example of a room system where there are three cameras 1017 each of which can send a separate capture covering two persons 1018 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1019 optical zoom lens) and show all six persons, VC3. But the middle 1020 camera cannot be used in both modes at the same time - it has to 1021 either show the space where two participants sit or the whole six 1022 seats, but not both at the same time. 1024 Simultaneous transmission sets are expressed as sets of the Media 1025 Captures that could physically be transmitted at the same time, 1026 (though it may not make sense to do so). In this example the two 1027 simultaneous sets are shown in Table 1. The Consumer must make 1028 sure that it chooses one and not more of the mutually exclusive 1029 sets. A Consumer may choose any subset of the Captures in a 1030 simultaneous set, it does not have to choose all the Captures in a 1031 simultaneous set if it does not want to receive all of them. 1033 +-------------------+ 1034 | Simultaneous Sets | 1035 +-------------------+ 1036 | {VC0, VC1, VC2} | 1037 | {VC0, VC3, VC2} | 1038 +-------------------+ 1040 Table 1: Two Simultaneous Transmission Sets 1042 A Provider optionally can include the simultaneous sets in its 1043 provider Advertisement. These simultaneous set constraints apply 1044 across all the Capture Scenes in the Advertisement. It is a syntax 1045 conformance requirement that the simultaneous transmission sets 1046 must allow all the media captures in any particular Capture Scene 1047 Entry to be used simultaneously. 1049 If an Advertisement does not include Simultaneous Transmission 1050 Sets, then all Capture Scenes can be provided simultaneously. If 1051 multiple capture Scene Entries are in a Capture Scene then the 1052 Consumer chooses at most one Capture Scene Entry per Capture Scene 1053 for each media type. 1055 If an Advertisement includes multiple Capture Scene Entries in a 1056 Capture Scene then the Consumer should choose one Capture Scene 1057 Entry for each media type, but may choose individual Captures based 1058 on the Simultaneous Transmission Sets. 1060 7. Encodings 1062 Individual encodings and encoding groups are CLUE's mechanisms 1063 allowing a Provider to signal its limitations for sending Captures, 1064 or combinations of Captures, to a Consumer. Consumers can map the 1065 Captures they want to receive onto the Encodings, with encoding 1066 parameters they want. As for the relationship between the CLUE- 1067 specified mechanisms based on Encodings and the SIP Offer-Answer 1068 exchange, please refer to section 4. 1070 7.1. Individual Encodings 1072 An Individual Encoding represents a way to encode a Media Capture 1073 to become a Capture Encoding, to be sent as an encoded media stream 1074 from the Provider to the Consumer. An Individual Encoding has a 1075 set of parameters characterizing how the media is encoded. 1077 Different media types have different parameters, and different 1078 encoding algorithms may have different parameters. An Individual 1079 Encoding can be assigned to at most one Capture Encoding at any 1080 given time. 1082 The parameters of an Individual Encoding represent the maximum 1083 values for certain aspects of the encoding. A particular 1084 instantiation into a Capture Encoding might use lower values than 1085 these maximums. 1087 In general, the parameters of an Individual Encoding have been 1088 chosen to represent those negotiable parameters of media codecs of 1089 the media type that greatly influence computational complexity, 1090 while abstracting from details of particular media codecs used. 1091 The parameters have been chosen with those media codecs in mind 1092 that have seen wide deployment in the video conferencing and 1093 Telepresence industry. 1095 For video codecs (using H.26x compression technologies), those 1096 parameters include: 1098 . Maximum bitrate; 1099 . Maximum picture size in pixels; 1100 . Maxmimum number of pixels to be processed per second; and 1101 . Clue-protocol internal information. 1103 For audio codecs, so far only one parameter has been identified: 1105 . Maximum bitrate. 1107 Edt. note: the maximum number of pixel per second are currently 1108 expressed as H.264maxmbps. 1110 Edt. note: it would be desirable to make the computational 1111 complexity mechanism codec independent so to allow for expressing 1112 that, say, H.264 codecs are less complex than H.265 codecs, and, 1113 therefore, the same hardware can process higher pixel rates for 1114 H.264 than for H.265. To be discussed in the WG. 1116 BEGIN OUTSOURCE to data model 1117 The following tables show the variables for audio and video 1118 encoding. 1120 +--------------+--------------------------------------------------- 1121 -+ 1122 | Name | Description 1123 | 1124 +--------------+--------------------------------------------------- 1125 -+ 1126 | encodeID | A unique identifier for the individual encoding 1127 | 1128 | maxBandwidth | Maximum number of bits per second 1129 | 1130 | maxH264Mbps | Maximum number of macroblocks per second: ((width 1131 | 1132 | | + 15) / 16) * ((height + 15) / 16) * 1133 | 1134 | | framesPerSecond 1135 | 1136 | maxWidth | Video resolution's maximum supported width, 1137 | 1138 | | expressed in pixels 1139 | 1140 | maxHeight | Video resolution's maximum supported height, 1141 | 1142 | | expressed in pixels 1143 | 1144 | maxFrameRate | Maximum supported frame rate 1145 | 1146 +--------------+--------------------------------------------------- 1147 -+ 1149 Table 2: Individual Video Encoding Parameters 1151 +--------------+-----------------------------------+ 1152 | Name | Description | 1153 +--------------+-----------------------------------+ 1154 | maxBandwidth | Maximum number of bits per second | 1155 +--------------+-----------------------------------+ 1157 Table 3: Individual Audio Encoding Parameters 1159 END OUTSOURCE 1161 7.2. Encoding Group 1163 An Encoding Group includes a set of one or more Individual 1164 Encodings, and parameters that apply to the group as a whole. By 1165 grouping multiple individual Encodings together, an Encoding Group 1166 describes additional constraints on bandwidth and other parameters 1167 for the group. 1169 The Encoding Group data structure contains: 1171 . Maximum bitrate for all encodings in the group combined; 1172 . Maximum number of pixels per second for all video encodings of 1173 the group combined. 1174 . A list of identifiers for audio and video encodings, 1175 respectively, belonging to the group. 1177 BEGIN OUTSOURCE to data model 1179 Table 4 shows the parameters and individual encoding sets that are 1180 part of an encoding group. 1182 +-------------------+---------------------------------------------- 1183 -+ 1184 | Name | Description 1185 | 1186 +-------------------+---------------------------------------------- 1187 -+ 1188 | encodeGroupID | A unique identifier for the encoding group 1189 | 1190 | maxGroupBandwidth | Maximum number of bits per second relating to 1191 | 1192 | | all encodings combined 1193 | 1194 | maxGroupH264Mbps | Maximum number of macroblocks per second 1195 | 1196 | | relating to all video encodings combined 1197 | 1198 | videoEncodings[] | Set of potential encodings (list of 1199 | 1200 | | encodeIDs) 1201 | 1202 | audioEncodings[] | Set of potential encodings (list of 1203 | 1204 | | encodeIDs) 1205 | 1206 +-------------------+---------------------------------------------- 1207 -+ 1209 Table 4: Encoding Group 1211 END OUTSOURCE 1213 When the Individual Encodings in a group are instantiated into 1214 Capture Encodings, each Capture Encoding has a bandwidth that must 1215 be less than or equal to the maxBandwidth for the particular 1216 individual encoding. The maxGroupBandwidth parameter gives the 1217 additional restriction that the sum of all the individual capture 1218 encoding bandwidths must be less than or equal to the 1219 maxGroupBandwidth value. 1221 Likewise, the sum of the macroblocks per second of each 1222 instantiated encoding in the group must not exceed the 1223 maxGroupH264Mbps value. 1225 The following diagram illustrates one example of the structure of a 1226 media provider's Encoding Groups and their contents. 1228 ,-------------------------------------------------. 1229 | Media Provider | 1230 | | 1231 | ,--------------------------------------. | 1232 | | ,--------------------------------------. | 1233 | | | ,--------------------------------------. | 1234 | | | | Encoding Group | | 1235 | | | | ,-----------. | | 1236 | | | | | | ,---------. | | 1237 | | | | | | | | ,---------.| | 1238 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1239 | `.| | | | | | `---------'| | 1240 | `.| `-----------' `---------' | | 1241 | `--------------------------------------' | 1242 `-------------------------------------------------' 1244 Figure 1: Encoding Group Structure 1246 A Provider advertises one or more Encoding Groups. Each Encoding 1247 Group includes one or more Individual Encodings. Each Individual 1248 Encoding can represent a different way of encoding media. For 1249 example one Individual Encoding may be 1080p60 video, another could 1250 be 720p30, with a third being CIF, all in, for example, H.264 1251 format. 1253 While a typical three codec/display system might have one Encoding 1254 Group per "codec box" (physical codec, connected to one camera and 1255 one screen), there are many possibilities for the number of 1256 Encoding Groups a Provider may be able to offer and for the 1257 encoding values in each Encoding Group. 1259 There is no requirement for all Encodings within an Encoding Group 1260 to be instantiated at the same time. 1262 8. Associating Captures with Encoding Groups 1264 Every Capture is associated with an Encoding Group, which is used 1265 to instantiate that Capture into one or more Capture Encodings. 1266 Each Capture has an Encoding Group attribute. The value of this 1267 attribute is the encodeGroupID for the Encoding Group with which it 1268 is associated. More than one Capture may use the same Encoding 1269 Group. 1271 The maximum number of streams that can result from a particular 1272 Encoding Group constraint is equal to the number of individual 1273 Encodings in the group. The actual number of Capture Encodings 1274 used at any time may be less than this maximum. Any of the 1275 Captures that use a particular Encoding Group can be encoded 1276 according to any of the Individual Encodings in the group. If 1277 there are multiple Individual Encodings in the group, then the 1278 Consumer can configure the Provider, via a Configure message, to 1279 encode a single Media Capture into multiple different Capture 1280 Encodings at the same time, subject to the Max Capture Encodings 1281 constraint, with each capture encoding following the constraints of 1282 a different Individual Encoding. 1284 It is a protocol conformance requirement that the Encoding Groups 1285 must allow all the Captures in a particular Capture Scene Entry to 1286 be used simultaneously. 1288 9. Consumer's Choice of Streams to Receive from the Provider 1290 After receiving the Provider's Advertisement message (that includes 1291 media captures and associated constraints), the Consumer composes 1292 its reply to the Provider in the form of a Configure message. The 1293 Consumer is free to use the information in the Advertisement as it 1294 chooses, but there are a few obviously sensible design choices, 1295 which are outlined below. 1297 If multiple Providers connect to the same Consumer (i.e. in a n 1298 MCU-less multiparty call), it is the repsonsibility of the Consumer 1299 to compose Configures for each Provider that both fulfill each 1300 Provider's constraints as expressed in the Advertisement, as well 1301 as its own capabilities. 1303 In an MCU-based multiparty call, the MCU can logically terminate 1304 the Advertisement/Configure negotiation in that it can hide the 1305 characteristics of the receiving endpoint and rely on its own 1306 capabilities (transcoding/transrating/...) to create Media Streams 1307 that can be decoded at the Endpoint Consumers. The timing of an 1308 MCU's sending of Advertisements (for its outgoing ports) and 1309 Configures (for its incoming ports, in response to Advertisements 1310 received there) is up to the MCU and implementation dependent. 1312 As a general outline, A Consumer can choose, based on the 1313 Advertisement it has received, which Captures it wishes to receive, 1314 and which Individual Encodings it wants the Provider to use to 1315 encode the Captures. Each Capture has an Encoding Group ID 1316 attribute which specifies which Individual Encodings are available 1317 to be used for that Capture. 1319 For each Capture the Consumer wants to receive, it configures one 1320 or more of the encodings in that capture's encoding group. The 1321 Consumer does this by telling the Provider, in its Configure 1322 message, parameters such as the resolution, frame rate, bandwidth, 1323 etc. for each Capture Encodings for its chosen Captures. Upon 1324 receipt of this Configure from the Consumer, common knowledge is 1325 established between Provider and Consumer regarding sensible 1326 choices for the media streams and their parameters. The setup of 1327 the actual media channels, at least in the simplest case, is left 1328 to a following offer-answer exchange. Optimized implementations 1329 may speed up the reaction to the offer-answer exchange by reserving 1330 the resources at the time of finalization of the CLUE handshake. 1331 Even more advanced devices may choose to establish media streams 1332 without an offer-answer exchange, for example by overloading 1333 existing 5 tuple connections with the negotiated media. 1335 The Consumer must have received at least one Advertisement from the 1336 Provider to be able to create and send a Configure. Each 1337 Advertisement is acknowledged by a corresponding Configure. 1339 In addition, the Consumer can send a Configure at any time during 1340 the call. The Configure must be valid according to the most 1341 recently received Advertisement. The Consumer can send a Configure 1342 either in response to a new Advertisement from the Provider or as 1343 by its own, for example because of a local change in conditions 1344 (people leaving the room, connectivity changes, multipoint related 1345 considerations). 1347 The Consumer need not send a new Configure message to the Provider 1348 when it receives a new Advertisement from the Provider unless the 1349 contents of the new Advertisement cause the Consumer's current 1350 Configure message to become invalid. 1352 Edt. Note: The editors solicit input from the working group as to 1353 whether or not a Consumer must respond to every Advertisement with 1354 a new Configure message. 1356 When choosing which Media Streams to receive from the Provider, and 1357 the encoding characteristics of those Media Streams, the Consumer 1358 advantageously takes several things into account: its local 1359 preference, simultaneity restrictions, and encoding limits. 1361 9.1. Local preference 1363 A variety of local factors influence the Consumer's choice of 1364 Media Streams to be received from the Provider: 1366 o if the Consumer is an Endpoint, it is likely that it would 1367 choose, where possible, to receive video and audio Captures that 1368 match the number of display devices and audio system it has 1370 o if the Consumer is a middle box such as an MCU, it may choose to 1371 receive loudest speaker streams (in order to perform its own 1372 media composition) and avoid pre-composed video Captures 1374 o user choice (for instance, selection of a new layout) may result 1375 in a different set of Captures, or different encoding 1376 characteristics, being required by the Consumer 1378 9.2. Physical simultaneity restrictions 1380 There may be physical simultaneity constraints imposed by the 1381 Provider that affect the Provider's ability to simultaneously send 1382 all of the captures the Consumer would wish to receive. For 1383 instance, a middle box such as an MCU, when connected to a multi- 1384 camera room system, might prefer to receive both individual video 1385 streams of the people present in the room and an overall view of 1386 the room from a single camera. Some Endpoint systems might be 1387 able to provide both of these sets of streams simultaneously, 1388 whereas others may not (if the overall room view were produced by 1389 changing the optical zoom level on the center camera, for 1390 instance). 1392 9.3. Encoding and encoding group limits 1394 Each of the Provider's encoding groups has limits on bandwidth and 1395 computational complexity, and the constituent potential encodings 1396 have limits on the bandwidth, computational complexity, video 1397 frame rate, and resolution that can be provided. When choosing 1398 the Captures to be received from a Provider, a Consumer device 1399 must ensure that the encoding characteristics requested for each 1400 individual Capture fits within the capability of the encoding it 1401 is being configured to use, as well as ensuring that the combined 1402 encoding characteristics for Captures fit within the capabilities 1403 of their associated encoding groups. In some cases, this could 1404 cause an otherwise "preferred" choice of capture encodings to be 1405 passed over in favour of different Capture Encodings - for 1406 instance, if a set of three Captures could only be provided at a 1407 low resolution then a three screen device could switch to favoring 1408 a single, higher quality, Capture Encoding. 1410 10. Extensibility 1412 One of the most important characteristics of the Framework is its 1413 extensibility. Telepresence is a relatively new industry and 1414 while we can foresee certain directions, we also do not know 1415 everything about how it will develop. The standard for 1416 interoperability and handling multiple streams must be future- 1417 proof. The framework itself is inherently extensible through 1418 expanding the data model types. For example: 1420 o Adding more types of media, such as telemetry, can done by 1421 defining additional types of Captures in addition to audio and 1422 video. 1424 o Adding new functionalities , such as 3-D, say, may require 1425 additional attributes describing the Captures. 1427 o Adding a new codecs, such as H.265, can be accomplished by 1428 defining new encoding variables. 1430 The infrastructure is designed to be extended rather than 1431 requiring new infrastructure elements. Extension comes through 1432 adding to defined types. 1434 11. Examples - Using the Framework 1435 EDT. Note: these examples are currently out of date with respect 1436 to H264Mbps codepoints, which will be fixed in the next release 1437 once an agreement about codec computational complexity has been 1438 found. Other than that, the examples are still valid. 1440 Suggest outsourcing all examples to data model doc or dedicated 1441 example document. Or rewrite the examples in XML. Meeting 1442 session question. 1444 This section gives some examples, first from the point of view of 1445 the Provider, then the Consumer. 1447 11.1. Provider Behavior 1449 This section shows some examples in more detail of how a Provider 1450 can use the framework to represent a typical case for telepresence 1451 rooms. First an endpoint is illustrated, then an MCU case is 1452 shown. 1454 11.1.1. Three screen Endpoint Provider 1456 Consider an Endpoint with the following description: 1458 3 cameras, 3 displays, a 6 person table 1460 o Each camera can provide one Capture for each 1/3 section of the 1461 table 1463 o A single Capture representing the active speaker can be provided 1464 (voice activity based camera selection to a given encoder input 1465 port implemented locally in the Endpoint) 1467 o A single Capture representing the active speaker with the other 1468 2 Captures shown picture in picture within the stream can be 1469 provided (again, implemented inside the endpoint) 1471 o A Capture showing a zoomed out view of all 6 seats in the room 1472 can be provided 1474 The audio and video Captures for this Endpoint can be described as 1475 follows. 1477 Video Captures: 1479 o VC0- (the camera-left camera stream), encoding group=EG0, 1480 content=main, switched=false 1482 o VC1- (the center camera stream), encoding group=EG1, 1483 content=main, switched=false 1485 o VC2- (the camera-right camera stream), encoding group=EG2, 1486 content=main, switched=false 1488 o VC3- (the loudest panel stream), encoding group=EG1, 1489 content=main, switched=true 1491 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1492 content=main, composed=true, switched=true 1494 o VC5- (the zoomed out view of all people in the room), encoding 1495 group=EG1, content=main, composed=false, switched=false 1497 o VC6- (presentation stream), encoding group=EG1, content=slides, 1498 switched=false 1500 The following diagram is a top view of the room with 3 cameras, 3 1501 displays, and 6 seats. Each camera is capturing 2 people. The 1502 six seats are not all in a straight line. 1504 ,-. D 1505 ( )`--.__ +---+ 1506 `-' / `--.__ | | 1507 ,-. | `-.._ |_-+Camera 2 (VC2) 1508 ( ).' ___..-+-''`+-+ 1509 `-' |_...---'' | | 1510 ,-.c+-..__ +---+ 1511 ( )| ``--..__ | | 1512 `-' | ``+-..|_-+Camera 1 (VC1) 1513 ,-. | __..--'|+-+ 1514 ( )| __..--' | | 1515 `-'b|..--' +---+ 1516 ,-. |``---..___ | | 1517 ( )\ ```--..._|_-+Camera 0 (VC0) 1518 `-' \ _..-''`-+ 1519 ,-. \ __.--'' | | 1520 ( ) |..-'' +---+ 1521 `-' a 1523 The two points labeled b and c are intended to be at the midpoint 1524 between the seating positions, and where the fields of view of the 1525 cameras intersect. 1527 The plane of interest for VC0 is a vertical plane that intersects 1528 points 'a' and 'b'. 1530 The plane of interest for VC1 intersects points 'b' and 'c'. The 1531 plane of interest for VC2 intersects points 'c' and 'd'. 1533 This example uses an area scale of millimeters. 1535 Areas of capture: 1537 bottom left bottom right top left top right 1538 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1539 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1540 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1541 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1542 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1543 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1544 VC6 none 1546 Points of capture: 1547 VC0 (-1678,0,800) 1548 VC1 (0,0,800) 1549 VC2 (1678,0,800) 1550 VC3 none 1551 VC4 none 1552 VC5 (0,0,800) 1553 VC6 none 1555 In this example, the right edge of the VC0 area lines up with the 1556 left edge of the VC1 area. It doesn't have to be this way. There 1557 could be a gap or an overlap. One additional thing to note for 1558 this example is the distance from a to b is equal to the distance 1559 from b to c and the distance from c to d. All these distances are 1560 1346 mm. This is the planar width of each area of capture for VC0, 1561 VC1, and VC2. 1563 Note the text in parentheses (e.g. "the camera-left camera 1564 stream") is not explicitly part of the model, it is just 1565 explanatory text for this example, and is not included in the 1566 model with the media captures and attributes. Also, the 1567 "composed" boolean attribute doesn't say anything about how a 1568 capture is composed, so the media consumer can't tell based on 1569 this attribute that VC4 is composed of a "loudest panel with 1570 PiPs". 1572 Audio Captures: 1574 o AC0 (camera-left), encoding group=EG3, content=main, channel 1575 format=mono 1577 o AC1 (camera-right), encoding group=EG3, content=main, channel 1578 format=mono 1580 o AC2 (center) encoding group=EG3, content=main, channel 1581 format=mono 1583 o AC3 being a simple pre-mixed audio stream from the room (mono), 1584 encoding group=EG3, content=main, channel format=mono 1586 o AC4 audio stream associated with the presentation video (mono) 1587 encoding group=EG3, content=slides, channel format=mono 1589 Areas of capture: 1591 bottom left bottom right top left top right 1593 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1594 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1595 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1596 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1597 AC4 none 1599 The physical simultaneity information is: 1601 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1603 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1605 This constraint indicates it is not possible to use all the VCs at 1606 the same time. VC5 can not be used at the same time as VC1 or VC3 1607 or VC4. Also, using every member in the set simultaneously may 1608 not make sense - for example VC3(loudest) and VC4 (loudest with 1609 PIP). (In addition, there are encoding constraints that make 1610 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1611 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1612 in the encoding groups, not in the simultaneous transmission 1613 sets.) 1615 In this example there are no restrictions on which audio captures 1616 can be sent simultaneously. 1618 Encoding Groups: 1620 This example has three encoding groups associated with the video 1621 captures. Each group can have 3 encodings, but with each 1622 potential encoding having a progressively lower specification. In 1623 this example, 1080p60 transmission is possible (as ENC0 has a 1624 maxMbps value compatible with that) as long as it is the only 1625 active encoding in the group(as maxMbps for the entire encoding 1626 group is also 489600). Significantly, as up to 3 encodings are 1627 available per group, it is possible to transmit some video 1628 captures simultaneously that are not in the same entry in the 1629 capture scene. For example VC1 and VC3 at the same time. 1631 It is also possible to transmit multiple capture encodings of a 1632 single video capture. For example VC0 can be encoded using ENC0 1633 and ENC1 at the same time, as long as the encoding parameters 1634 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1635 1080p30 and one at 720p30. 1637 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1638 maxGroupBandwidth=6000000 1639 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1640 maxH264Mbps=489600, maxBandwidth=4000000 1641 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1642 maxH264Mbps=108000, maxBandwidth=4000000 1643 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1644 maxH264Mbps=61200, maxBandwidth=4000000 1645 encodeGroupID=EG1 maxGroupH264Mbps=489600 1646 maxGroupBandwidth=6000000 1647 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1648 maxH264Mbps=489600, maxBandwidth=4000000 1649 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1650 maxH264Mbps=108000, maxBandwidth=4000000 1651 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1652 maxH264Mbps=61200, maxBandwidth=4000000 1653 encodeGroupID=EG2 maxGroupH264Mbps=489600 1654 maxGroupBandwidth=6000000 1655 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1656 maxH264Mbps=489600, maxBandwidth=4000000 1657 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1658 maxH264Mbps=108000, maxBandwidth=4000000 1659 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1660 maxH264Mbps=61200, maxBandwidth=4000000 1662 Figure 2: Example Encoding Groups for Video 1664 For audio, there are five potential encodings available, so all 1665 five audio captures can be encoded at the same time. 1667 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1668 encodeID=ENC9, maxBandwidth=64000 1669 encodeID=ENC10, maxBandwidth=64000 1670 encodeID=ENC11, maxBandwidth=64000 1671 encodeID=ENC12, maxBandwidth=64000 1672 encodeID=ENC13, maxBandwidth=64000 1674 Figure 3: Example Encoding Group for Audio 1676 Capture Scenes: 1678 The following table represents the capture scenes for this 1679 provider. Recall that a capture scene is composed of alternative 1680 capture scene entries covering the same spatial region. Capture 1681 Scene #1 is for the main people captures, and Capture Scene #2 is 1682 for presentation. 1684 Each row in the table is a separate entry in the capture scene 1686 +------------------+ 1687 | Capture Scene #1 | 1688 +------------------+ 1689 | VC0, VC1, VC2 | 1690 | VC3 | 1691 | VC4 | 1692 | VC5 | 1693 | AC0, AC1, AC2 | 1694 | AC3 | 1695 +------------------+ 1697 +------------------+ 1698 | Capture Scene #2 | 1699 +------------------+ 1700 | VC6 | 1701 | AC4 | 1702 +------------------+ 1704 Different capture scenes are unique to each other, non- 1705 overlapping. A consumer can choose an entry from each capture 1706 scene. In this case the three captures VC0, VC1, and VC2 are one 1707 way of representing the video from the endpoint. These three 1708 captures should appear adjacent next to each other. 1709 Alternatively, another way of representing the Capture Scene is 1710 with the capture VC3, which automatically shows the person who is 1711 talking. Similarly for the VC4 and VC5 alternatives. 1713 As in the video case, the different entries of audio in Capture 1714 Scene #1 represent the "same thing", in that one way to receive 1715 the audio is with the 3 audio captures (AC0, AC1, AC2), and 1716 another way is with the mixed AC3. The Media Consumer can choose 1717 an audio capture entry it is capable of receiving. 1719 The spatial ordering is understood by the media capture attributes 1720 area and point of capture. 1722 A Media Consumer would likely want to choose a capture scene entry 1723 to receive based in part on how many streams it can simultaneously 1724 receive. A consumer that can receive three people streams would 1725 probably prefer to receive the first entry of Capture Scene #1 1726 (VC0, VC1, VC2) and not receive the other entries. A consumer 1727 that can receive only one people stream would probably choose one 1728 of the other entries. 1730 If the consumer can receive a presentation stream too, it would 1731 also choose to receive the only entry from Capture Scene #2 (VC6). 1733 11.1.2. Encoding Group Example 1735 This is an example of an encoding group to illustrate how it can 1736 express dependencies between encodings. 1738 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1739 maxGroupBandwidth=6000000 1740 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1741 maxFrameRate=60, 1742 maxH264Mbps=244800, maxBandwidth=4000000 1743 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1744 maxFrameRate=60, 1745 maxH264Mbps=244800, maxBandwidth=4000000 1746 encodeID=AUDENC0, maxBandwidth=96000 1747 encodeID=AUDENC1, maxBandwidth=96000 1748 encodeID=AUDENC2, maxBandwidth=96000 1750 Here, the encoding group is EG0. It can transmit up to two 1751 1080p30 capture encodings (Mbps for 1080p = 244800), but it is 1752 capable of transmitting a maxFrameRate of 60 frames per second 1753 (fps). To achieve the maximum resolution (1920 x 1088) the frame 1754 rate is limited to 30 fps. However 60 fps can be achieved at a 1755 lower resolution if required by the consumer. Although the 1756 encoding group is capable of transmitting up to 6Mbit/s, no 1757 individual video encoding can exceed 4Mbit/s. 1759 This encoding group also allows up to 3 audio encodings, AUDENC<0- 1760 2>. It is not required that audio and video encodings reside 1761 within the same encoding group, but if so then the group's overall 1762 maxBandwidth value is a limit on the sum of all audio and video 1763 encodings configured by the consumer. A system that does not wish 1764 or need to combine bandwidth limitations in this way should 1765 instead use separate encoding groups for audio and video in order 1766 for the bandwidth limitations on audio and video to not interact. 1768 Audio and video can be expressed in separate encoding groups, as 1769 in this illustration. 1771 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1772 maxGroupBandwidth=6000000 1773 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1774 maxFrameRate=60, 1775 maxH264Mbps=244800, maxBandwidth=4000000 1776 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1777 maxFrameRate=60, 1778 maxH264Mbps=244800, maxBandwidth=4000000 1779 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1780 encodeID=AUDENC0, maxBandwidth=96000 1781 encodeID=AUDENC1, maxBandwidth=96000 1782 encodeID=AUDENC2, maxBandwidth=96000 1784 11.1.3. The MCU Case 1786 This section shows how an MCU might express its Capture Scenes, 1787 intending to offer different choices for consumers that can handle 1788 different numbers of streams. A single audio capture stream is 1789 provided for all single and multi-screen configurations that can 1790 be associated (e.g. lip-synced) with any combination of video 1791 captures at the consumer. 1793 +--------------------+-------------------------------------------- 1794 -+ 1795 | Capture Scene #1 | note 1796 | 1797 +--------------------+-------------------------------------------- 1798 -+ 1799 | VC0 | video capture for single screen consumer 1800 | 1801 | VC1, VC2 | video capture for 2 screen consumer 1802 | 1803 | VC3, VC4, VC5 | video capture for 3 screen consumer 1804 | 1805 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer 1806 | 1807 | AC0 | audio capture representing all participants 1808 | 1809 +--------------------+-------------------------------------------- 1810 -+ 1812 If / when a presentation stream becomes active within the 1813 conference the MCU might re-advertise the available media as: 1815 +------------------+--------------------------------------+ 1816 | Capture Scene #2 | note | 1817 +------------------+--------------------------------------+ 1818 | VC10 | video capture for presentation | 1819 | AC1 | presentation audio to accompany VC10 | 1820 +------------------+--------------------------------------+ 1822 11.2. Media Consumer Behavior 1824 This section gives an example of how a Media Consumer might behave 1825 when deciding how to request streams from the three screen 1826 endpoint described in the previous section. 1828 The receive side of a call needs to balance its requirements, 1829 based on number of screens and speakers, its decoding capabilities 1830 and available bandwidth, and the provider's capabilities in order 1831 to optimally configure the provider's streams. Typically it would 1832 want to receive and decode media from each Capture Scene 1833 advertised by the Provider. 1835 A sane, basic, algorithm might be for the consumer to go through 1836 each Capture Scene in turn and find the collection of Video 1837 Captures that best matches the number of screens it has (this 1838 might include consideration of screens dedicated to presentation 1839 video display rather than "people" video) and then decide between 1840 alternative entries in the video Capture Scenes based either on 1841 hard-coded preferences or user choice. Once this choice has been 1842 made, the consumer would then decide how to configure the 1843 provider's encoding groups in order to make best use of the 1844 available network bandwidth and its own decoding capabilities. 1846 11.2.1. One screen Media Consumer 1848 VC3, VC4 and VC5 are all different entries by themselves, not 1849 grouped together in a single entry, so the receiving device should 1850 choose between one of those. The choice would come down to 1851 whether to see the greatest number of participants simultaneously 1852 at roughly equal precedence (VC5), a switched view of just the 1853 loudest region (VC3) or a switched view with PiPs (VC4). An 1854 endpoint device with a small amount of knowledge of these 1855 differences could offer a dynamic choice of these options, in- 1856 call, to the user. 1858 11.2.2. Two screen Media Consumer configuring the example 1860 Mixing systems with an even number of screens, "2n", and those 1861 with "2n+1" cameras (and vice versa) is always likely to be the 1862 problematic case. In this instance, the behavior is likely to be 1863 determined by whether a "2 screen" system is really a "2 decoder" 1864 system, i.e., whether only one received stream can be displayed 1865 per screen or whether more than 2 streams can be received and 1866 spread across the available screen area. To enumerate 3 possible 1867 behaviors here for the 2 screen system when it learns that the far 1868 end is "ideally" expressed via 3 capture streams: 1870 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1871 per the 1 screen consumer case above) and either leave one 1872 screen blank or use it for presentation if / when a 1873 presentation becomes active. 1875 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 1876 screens (either with each capture being scaled to 2/3 of a 1877 screen and the center capture being split across 2 screens) or, 1878 as would be necessary if there were large bezels on the 1879 screens, with each stream being scaled to 1/2 the screen width 1880 and height and there being a 4th "blank" panel. This 4th panel 1881 could potentially be used for any presentation that became 1882 active during the call. 1884 3. Receive 3 streams, decode all 3, and use control information 1885 indicating which was the most active to switch between showing 1886 the left and center streams (one per screen) and the center and 1887 right streams. 1889 For an endpoint capable of all 3 methods of working described 1890 above, again it might be appropriate to offer the user the choice 1891 of display mode. 1893 11.2.3. Three screen Media Consumer configuring the example 1895 This is the most straightforward case - the Media Consumer would 1896 look to identify a set of streams to receive that best matched its 1897 available screens and so the VC0 plus VC1 plus VC2 should match 1898 optimally. The spatial ordering would give sufficient information 1899 for the correct video capture to be shown on the correct screen, 1900 and the consumer would either need to divide a single encoding 1901 group's capability by 3 to determine what resolution and frame 1902 rate to configure the provider with or to configure the individual 1903 video captures' encoding groups with what makes most sense (taking 1904 into account the receive side decode capabilities, overall call 1905 bandwidth, the resolution of the screens plus any user preferences 1906 such as motion vs sharpness). 1908 12. Acknowledgements 1910 Allyn Romanow and Brian Baldino were authors of early versions. 1911 Mark Gorzyinski contributed much to the approach. We want to 1912 thank Stephen Botzko for helpful discussions on audio. 1914 13. IANA Considerations 1916 TBD 1918 14. Security Considerations 1920 TBD 1922 15. Changes Since Last Version 1924 NOTE TO THE RFC-Editor: Please remove this section prior to 1925 publication as an RFC. 1927 Changes from 08 to 09: 1929 1. Use "document" instead of "memo". 1931 2. Add basic call flow sequence diagram to introduction. 1933 3. Add definitions for Advertisement and Configure messages. 1935 4. Add definitions for Capture and Provider. 1937 5. Update definition of Capture Scene. 1939 6. Update definition of Individual Encoding. 1941 7. Shorten definition of Media Capture and add key points in the 1942 Media Captures section. 1944 8. Reword a bit about capture scenes in overview. 1946 9. Reword about labeling Media Captures. 1948 10. Remove the Consumer Capability message. 1950 11. New example section heading for media provider behavior 1952 12. Clarifications in the Capture Scene section. 1954 13. Clarifications in the Simultaneous Transmission Set section. 1956 14. Capitalize defined terms. 1958 15. Move call flow example from introduction to overview section 1960 16. General editorial cleanup 1962 17. Add some editors' notes requesting input on issues 1964 18. Summarize some sections, and propose details be outsourced 1965 to other documents. 1967 Changes from 06 to 07: 1969 1. Ticket #9. Rename Axis of Capture Point attribute to Point 1970 on Line of Capture. Clarify the description of this 1971 attribute. 1973 2. Ticket #17. Add "capture encoding" definition. Use this new 1974 term throughout document as appropriate, replacing some usage 1975 of the terms "stream" and "encoding". 1977 3. Ticket #18. Add Max Capture Encodings media capture 1978 attribute. 1980 4. Add clarification that different capture scene entries are 1981 not necessarily mutually exclusive. 1983 Changes from 05 to 06: 1985 1. Capture scene description attribute is a list of text strings, 1986 each in a different language, rather than just a single string. 1988 2. Add new Axis of Capture Point attribute. 1990 3. Remove appendices A.1 through A.6. 1992 4. Clarify that the provider must use the same coordinate system 1993 with same scale and origin for all coordinates within the same 1994 capture scene. 1996 Changes from 04 to 05: 1998 1. Clarify limitations of "composed" attribute. 2000 2. Add new section "capture scene entry attributes" and add the 2001 attribute "scene-switch-policy". 2003 3. Add capture scene description attribute and description 2004 language attribute. 2006 4. Editorial changes to examples section for consistency with the 2007 rest of the document. 2009 Changes from 03 to 04: 2011 1. Remove sentence from overview - "This constitutes a significant 2012 change ..." 2014 2. Clarify a consumer can choose a subset of captures from a 2015 capture scene entry or a simultaneous set (in section "capture 2016 scene" and "consumer's choice..."). 2018 3. Reword first paragraph of Media Capture Attributes section. 2020 4. Clarify a stereo audio capture is different from two mono audio 2021 captures (description of audio channel format attribute). 2023 5. Clarify what it means when coordinate information is not 2024 specified for area of capture, point of capture, area of scene. 2026 6. Change the term "producer" to "provider" to be consistent (it 2027 was just in two places). 2029 7. Change name of "purpose" attribute to "content" and refer to 2030 RFC4796 for values. 2032 8. Clarify simultaneous sets are part of a provider advertisement, 2033 and apply across all capture scenes in the advertisement. 2035 9. Remove sentence about lip-sync between all media captures in a 2036 capture scene. 2038 10. Combine the concepts of "capture scene" and "capture set" 2039 into a single concept, using the term "capture scene" to 2040 replace the previous term "capture set", and eliminating the 2041 original separate capture scene concept. 2043 Informative References 2045 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2046 Requirement Levels", BCP 14, RFC 2119, March 1997. 2048 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 2049 Johnston, 2050 A., Peterson, J., Sparks, R., Handley, M., and E. 2051 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 2052 June 2002. 2054 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 2055 Jacobson, "RTP: A Transport Protocol for Real-Time 2056 Applications", STD 64, RFC 3550, July 2003. 2058 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 2059 Session Initiation Protocol (SIP)", RFC 4353, 2060 February 2006. 2062 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session 2063 Description 2064 Protocol (SDP) Content Attribute", RFC 4796, 2065 February 2007. 2067 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 2068 5117, 2069 January 2008. 2071 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 2072 Languages", BCP 47, RFC 5646, September 2009. 2074 [IANA-Lan] 2075 IANA, "Language Subtag Registry", 2076 . 2079 16. Authors' Addresses 2081 Mark Duckworth (editor) 2082 Polycom 2083 Andover, MA 01810 2084 USA 2086 Email: mark.duckworth@polycom.com 2088 Andrew Pepperell 2089 Silverflare 2090 Uxbridge, England 2091 UK 2093 Email: apeppere@gmail.com 2095 Stephan Wenger 2096 Vidyo, Inc. 2097 433 Hackensack Ave. 2098 Hackensack, N.J. 07601 2099 USA 2101 Email: stewe@stewe.org