idnits 2.17.1 draft-ietf-clue-framework-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1533 has weird spacing: '...om left bot...' == Line 1587 has weird spacing: '...om left bot...' -- The document date (October 19, 2013) is 3841 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC4566' is mentioned on line 1197, but not defined ** Obsolete undefined reference: RFC 4566 (Obsoleted by RFC 8866) == Unused Reference: 'RFC4579' is defined on line 2098, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 CLUE WG M. Duckworth, Ed. 2 Internet Draft Polycom 3 Intended status: Standards Track A. Pepperell 4 Expires: April 19, 2014 Acano 5 S. Wenger 6 Vidyo 7 October 19, 2013 9 Framework for Telepresence Multi-Streams 10 draft-ietf-clue-framework-12.txt 12 Abstract 14 This document defines a framework for a protocol to enable devices 15 in a telepresence conference to interoperate. The protocol enables 16 communication of information about multiple media streams so a 17 sending system and receiving system can make reasonable decisions 18 about transmitting, selecting and rendering the media streams. 19 This protocol is used in addition to SIP signaling for setting up a 20 telepresence session. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current 30 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six 33 months and may be updated, replaced, or obsoleted by other 34 documents at any time. It is inappropriate to use Internet-Drafts 35 as reference material or to cite them other than as "work in 36 progress." 38 This Internet-Draft will expire on April 19, 2013. 40 Copyright Notice 42 Copyright (c) 2013 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with 50 respect to this document. Code Components extracted from this 51 document must include Simplified BSD License text as described in 52 Section 4.e of the Trust Legal Provisions and are provided without 53 warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction...................................................3 58 2. Terminology....................................................3 59 3. Definitions....................................................4 60 4. Overview & Motivation..........................................6 61 5. Overview of the Framework/Model................................9 62 6. Spatial Relationships.........................................15 63 7. Media Captures and Capture Scenes.............................16 64 7.1. Media Captures...........................................16 65 7.1.1. Media Capture Attributes............................17 66 7.2. Capture Scene............................................22 67 7.2.1. Capture Scene attributes............................25 68 7.2.2. Capture Scene Entry attributes......................25 69 7.3. Simultaneous Transmission Set Constraints................26 70 8. Encodings.....................................................28 71 8.1. Individual Encodings.....................................28 72 8.2. Encoding Group...........................................29 73 9. Associating Captures with Encoding Groups.....................30 74 10. Consumer's Choice of Streams to Receive from the Provider....31 75 10.1. Local preference........................................33 76 10.2. Physical simultaneity restrictions......................33 77 10.3. Encoding and encoding group limits......................33 78 11. Extensibility................................................34 79 12. Examples - Using the Framework (Informative).................34 80 12.1. Provider Behavior.......................................34 81 12.1.1. Three screen Endpoint Provider.....................35 82 12.1.2. Encoding Group Example.............................42 83 12.1.3. The MCU Case.......................................42 84 12.2. Media Consumer Behavior.................................43 85 12.2.1. One screen Media Consumer..........................44 86 12.2.2. Two screen Media Consumer configuring the example..44 87 12.2.3. Three screen Media Consumer configuring the example45 88 13. Acknowledgements.............................................45 89 14. IANA Considerations..........................................45 90 15. Security Considerations......................................46 91 16. Changes Since Last Version...................................46 92 17. Authors' Addresses...........................................50 94 1. Introduction 96 Current telepresence systems, though based on open standards such 97 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate with 98 each other. A major factor limiting the interoperability of 99 telepresence systems is the lack of a standardized way to describe 100 and negotiate the use of the multiple streams of audio and video 101 comprising the media flows. This document provides a framework for 102 protocols to enable interoperability by handling multiple streams 103 in a standardized way. The framework is intended to support the 104 use cases described in draft-ietf-clue-telepresence-use-cases and 105 to meet the requirements in draft-ietf-clue-telepresence- 106 requirements. 108 The basic session setup for the use cases is based on SIP [RFC3261] 109 and SDP offer/answer [RFC3264]. In addition to basic SIP & SDP 110 offer/answer, CLUE specific signaling is required to exchange the 111 information describing the multiple media streams. The motivation 112 for this framework, an overview of the signaling, and information 113 required to be exchanged is described in subsequent sections of 114 this document. The signaling details and data model are provided 115 in subsequent documents. 117 2. Terminology 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 121 this document are to be interpreted as described in RFC 2119 122 [RFC2119]. 124 3. Definitions 126 The terms defined below are used throughout this document and 127 companion documents and they are normative. In order to easily 128 identify the use of a defined term, those terms are capitalized. 130 Advertisement: a CLUE message a Media Provider sends to a Media 131 Consumer describing specific aspects of the content of the media, 132 the formatting of the media streams it can send, and any 133 restrictions it has in terms of being able to provide certain 134 Streams simultaneously. 136 Audio Capture: Media Capture for audio. Denoted as ACn in the 137 example cases in this document. 139 Camera-Left and Right: For Media Captures, camera-left and camera- 140 right are from the point of view of a person observing the rendered 141 media. They are the opposite of Stage-Left and Stage-Right. 143 Capture: Same as Media Capture. 145 Capture Device: A device that converts audio and video input into 146 an electrical signal, in most cases to be fed into a media encoder. 148 Capture Encoding: A specific encoding of a Media Capture, to be 149 sent by a Media Provider to a Media Consumer via RTP. 151 Capture Scene: a structure representing a spatial region containing 152 one or more Capture Devices, each capturing media representing a 153 portion of the region. The spatial region represented by a Capture 154 Scene MAY or may not correspond to a real region in physical space, 155 such as a room. A Capture Scene includes attributes and one or 156 more Capture Scene Entries, with each entry including one or more 157 Media Captures. 159 Capture Scene Entry: a list of Media Captures of the same media 160 type that together form one way to represent the entire Capture 161 Scene. 163 Conference: used as defined in [RFC4353], A Framework for 164 Conferencing within the Session Initiation Protocol (SIP). 166 Configure Message: A CLUE message a Media Consumer sends to a Media 167 Provider specifying which content and media streams it wants to 168 receive, based on the information in a corresponding Advertisement 169 message. 171 Consumer: short for Media Consumer. 173 Encoding or Individual Encoding: a set of parameters representing a 174 way to encode a Media Capture to become a Capture Encoding. 176 Encoding Group: A set of encoding parameters representing a total 177 media encoding capability to be sub-divided across potentially 178 multiple Individual Encodings. 180 Endpoint: The logical point of final termination through receiving, 181 decoding and rendering, and/or initiation through capturing, 182 encoding, and sending of media streams. An endpoint consists of 183 one or more physical devices which source and sink media streams, 184 and exactly one [RFC4353] Participant (which, in turn, includes 185 exactly one SIP User Agent). Endpoints can be anything from 186 multiscreen/multicamera rooms to handheld devices. 188 Front: the portion of the room closest to the cameras. In going 189 towards back you move away from the cameras. 191 MCU: Multipoint Control Unit (MCU) - a device that connects two or 192 more endpoints together into one single multimedia conference 193 [RFC5117]. An MCU includes an [RFC4353] like Mixer, without the 194 [RFC4353] requirement to send media to each participant. 196 Media: Any data that, after suitable encoding, can be conveyed over 197 RTP, including audio, video or timed text. 199 Media Capture: a source of Media, such as from one or more Capture 200 Devices or constructed from other Media streams. 202 Media Consumer: an Endpoint or middle box that receives Media 203 streams 205 Media Provider: an Endpoint or middle box that sends Media streams 207 Model: a set of assumptions a telepresence system of a given vendor 208 adheres to and expects the remote telepresence system(s) also to 209 adhere to. 211 Plane of Interest: The spatial plane containing the most relevant 212 subject matter. 214 Provider: Same as Media Provider. 216 Render: the process of generating a representation from a media, 217 such as displayed motion video or sound emitted from loudspeakers. 219 Simultaneous Transmission Set: a set of Media Captures that can be 220 transmitted simultaneously from a Media Provider. 222 Spatial Relation: The arrangement in space of two objects, in 223 contrast to relation in time or other relationships. See also 224 Camera-Left and Right. 226 Stage-Left and Right: For Media Captures, Stage-left and Stage- 227 right are the opposite of Camera-left and Camera-right. For the 228 case of a person facing (and captured by) a camera, Stage-left and 229 Stage-right are from the point of view of that person. 231 Stream: a Capture Encoding sent from a Media Provider to a Media 232 Consumer via RTP [RFC3550]. 234 Stream Characteristics: the media stream attributes commonly used 235 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 236 resolution, profile/level etc.) as well as CLUE specific 237 attributes, such as the Capture ID or a spatial location. 239 Video Capture: Media Capture for video. Denoted as VCn in the 240 example cases in this document. 242 Video Composite: A single image that is formed, normally by an RTP 243 mixer inside an MCU, by combining visual elements from separate 244 sources. 246 4. Overview & Motivation 248 This section provides an overview of the functional elements 249 defined in this document to represent a telepresence system. The 250 motivations for the framework described in this document are also 251 provided. 253 Two key concepts introduced in this document are the terms "Media 254 Provider" and "Media Consumer". A Media Provider represents the 255 entity that is sending the media and a Media Consumer represents 256 the entity that is receiving the media. A Media Provider provides 257 Media in the form of RTP packets, a Media Consumer consumes those 258 RTP packets. Media Providers and Media Consumers can reside in 259 Endpoints or in middleboxes such as Multipoint Control Units 260 (MCUs). A Media Provider in an Endpoint is usually associated 261 with the generation of media for Media Captures; these Media 262 Captures are typically sourced from cameras, microphones, and the 263 like. Similarly, the Media Consumer in an Endpoint is usually 264 associated with renderers, such as screens and loudspeakers. In 265 middleboxes, Media Providers and Consumers can have the form of 266 outputs and inputs, respectively, of RTP mixers, RTP translators, 267 and similar devices. Typically, telepresence devices such as 268 Endpoints and middleboxes would perform as both Media Providers 269 and Media Consumers, the former being concerned with those 270 devices' transmitted media and the latter with those devices' 271 received media. In a few circumstances, a CLUE Endpoint middlebox 272 includes only Consumer or Provider functionality, such as 273 recorder-type Consumers or webcam-type Providers. 275 The motivations for the framework outlined in this document 276 include the following: 278 (1) Endpoints in telepresence systems typically have multiple Media 279 Capture and Media Render devices, e.g., multiple cameras and 280 screens. While previous system designs were able to set up calls 281 that would capture media using all cameras and display media on all 282 screens, for example, there is no mechanism that can associate 283 these Media Captures with each other in space and time. 285 (2) The mere fact that there are multiple capture and rendering 286 devices, each of which may be configurable in aspects such as zoom, 287 leads to the difficulty that a variable number of such devices can 288 be used to capture different aspects of a region. The Capture 289 Scene concept allows for the description of multiple setups for 290 those multiple capture devices that could represent sensible 291 operation points of the physical capture devices in a room, chosen 292 by the operator. A Consumer can pick and choose from those 293 configurations based on its rendering abilities and inform the 294 Provider about its choices. Details are provided in section 7. 296 (3) In some cases, physical limitations or other reasons disallow 297 the concurrent use of a device in more than one setup. For 298 example, the center camera in a typical three-camera conference 299 room can set its zoom objective either to capture only the middle 300 few seats, or all seats of a room, but not both concurrently. The 301 Simultaneous Transmission Set concept allows a Provider to signal 302 such limitations. Simultaneous Transmission Sets are part of the 303 Capture Scene description, and discussed in section 7.3. 305 (4) Often, the devices in a room do not have the computational 306 complexity or connectivity to deal with multiple encoding options 307 simultaneously, even if each of these options is sensible in 308 certain scenarios, and even if the simultaneous transmission is 309 also sensible (i.e. in case of multicast media distribution to 310 multiple endpoints). Such constraints can be expressed by the 311 Provider using the Encoding Group concept, described in section 8. 313 (5) Due to the potentially large number of RTP flows required for a 314 Multimedia Conference involving potentially many Endpoints, each of 315 which can have many Media Captures and media renderers, it has 316 become common to multiplex multiple RTP media flows onto the same 317 transport address, so to avoid using the port number as a 318 multiplexing point and the associated shortcomings such as 319 NAT/firewall traversal. While the actual mapping of those RTP 320 flows to the header fields of the RTP packets is not subject of 321 this specification, the large number of possible permutations of 322 sensible options a Media Provider can make available to a Media 323 Consumer makes a mechanism desirable that allows to narrow down the 324 number of possible options that a SIP offer-answer exchange has to 325 consider. Such information is made available using protocol 326 mechanisms specified in this document and companion documents, 327 although it should be stressed that its use in an implementation is 328 OPTIONAL. Also, there are aspects of the control of both Endpoints 329 and middleboxes/MCUs that dynamically change during the progress of 330 a call, such as audio-level based screen switching, layout changes, 331 and so on, which need to be conveyed. Note that these control 332 aspects are complementary to those specified in traditional SIP 333 based conference management such as BFCP. An exemplary call flow 334 can be found in section 4. 336 Finally, all this information needs to be conveyed, and the notion 337 of support for it needs to be established. This is done by the 338 negotiation of a "CLUE channel", a data channel negotiated early 339 during the initiation of a call. An Endpoint or MCU that rejects 340 the establishment of this data channel, by definition, is not 341 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 342 accepts it is REQUIRED to use it to the extent specified in this 343 document and its companion documents. 345 5. Overview of the Framework/Model 347 The CLUE framework specifies how multiple media streams are to be 348 handled in a telepresence conference. 350 A Media Provider (transmitting Endpoint or MCU) describes specific 351 aspects of the content of the media and the formatting of the media 352 streams it can send in an Advertisement; and the Media Consumer 353 responds to the Media Provider by specifying which content and 354 media streams it wants to receive in a Configure message. The 355 Provider then transmits the asked-for content in the specified 356 streams. 358 This Advertisement and Configure MUST occur during call initiation 359 but MAY also happen at any time throughout the call, whenever there 360 is a change in what the Consumer wants to receive or (perhaps less 361 common) the Provider can send. 363 An Endpoint or MCU typically act as both Provider and Consumer at 364 the same time, sending Advertisements and sending Configurations in 365 response to receiving Advertisements. (It is possible to be just 366 one or the other.) 368 The data model is based around two main concepts: a Capture and an 369 Encoding. A Media Capture (MC), such as audio or video, describes 370 the content a Provider can send. Media Captures are described in 371 terms of CLUE-defined attributes, such as spatial relationships and 372 purpose of the capture. Providers tell Consumers which Media 373 Captures they can provide, described in terms of the Media Capture 374 attributes. 376 A Provider organizes its Media Captures into one or more Capture 377 Scenes, each representing a spatial region, such as a room. A 378 Consumer chooses which Media Captures it wants to receive from each 379 Capture Scene. 381 In addition, the Provider can send the Consumer a description of 382 the Individual Encodings it can send in terms of the media 383 attributes of the Encodings, in particular, audio and video 384 parameters such as bandwidth, frame rate, macroblocks per second. 385 Note that this is OPTIONAL, and intended to minimize the number of 386 options a later SDP offer-answer would have to include in the SDP 387 in case of complex setups, as should become clearer shortly when 388 discussing an outline of the call flow. 390 The Provider can also specify constraints on its ability to provide 391 Media, and a sensible design choice for a Consumer is to take these 392 into account when choosing the content and Capture Encodings it 393 requests in the later offer-answer exchange. Some constraints are 394 due to the physical limitations of devices--for example, a camera 395 may not be able to provide zoom and non-zoom views simultaneously. 396 Other constraints are system based, such as maximum bandwidth and 397 maximum video coding performance measured in macroblocks/second. 399 The following diagram illustrates the information contained in an 400 Advertisement. 402 ................................................................... 403 . Provider Advertisement . 404 . . 405 . +------------------------+ +--------------------+ . 406 . | Capture Scene N | | Simultaneous | . 407 . +-+----------------------+ | +--------------------+ . 408 . | Capture Scene 2 | | . 409 . +-+----------------------+ | | +----------------------+ . 410 . | Capture Scene 1 | | | | Encoding Group N | . 411 . | +---------------+ | | | +-+--------------------+ | . 412 . | | Attributes | | | | | Encoding Group 2 | | . 413 . | +---------------+ | | | +-+--------------------+ | | . 414 . | | | | | Encoding Group 1 | | | . 415 . | +----------------+ | | | | parameters | | | . 416 . | | E n t r i e s | | | | | | | | . 417 . | | +---------+ | | | | | +-------------------+| | | . 418 . | | |Attribute| | | | | | | V i d e o || | | . 419 . | | +---------+ | | | | | | E n c o d i n g s || | | . 420 . | | | | | | | | Encoding 1 || | | . 421 . | | Entry 1 | | | | | | (parameters) || | | . 422 . | | (list of MCs) | | |-+ | +-------------------+| | | . 423 . | +----|-|--|------+ |-+ | | | | . 424 . +---------|-|--|---------+ | +-------------------+| | | . 425 . | | | | | A u d i o || | | . 426 . | | | | | E n c o d i n g s || | | . 427 . v | | | | Encoding 1 || | | . 428 . +---------|--|--------+ | | (ID,maxBandwidth) || | | . 429 . | Media Capture N |------>| +-------------------+| | | . 430 . +-+---------v--|------+ | | | | | . 431 . | Media Capture 2 | | | | |-+ . 432 . +-+--------------v----+ |-------->| | | . 433 . | Media Capture 1 | | | | |-+ . 434 . | +----------------+ |---------->| | . 435 . | | Attributes | | |_+ +----------------------+ . 436 . | +----------------+ |_+ . 437 . +---------------------+ . 438 . . 439 ................................................................... 441 A very brief outline of the call flow used by a simple system (two 442 Endpoints) in compliance with this document can be described as 443 follows, and as shown in the following figure. 445 +-----------+ +-----------+ 446 | Endpoint1 | | Endpoint2 | 447 +----+------+ +-----+-----+ 448 | INVITE (BASIC SDP+CLUECHANNEL) | 449 |--------------------------------->| 450 | 200 0K (BASIC SDP+CLUECHANNEL)| 451 |<---------------------------------| 452 | ACK | 453 |--------------------------------->| 454 | | 455 |<################################>| 456 | BASIC SDP MEDIA SESSION | 457 |<################################>| 458 | | 459 | CONNECT (CLUE CTRL CHANNEL) | 460 |=================================>| 461 | ... | 462 |<================================>| 463 | CLUE CTRL CHANNEL ESTABLISHED | 464 |<================================>| 465 | | 466 | ADVERTISEMENT 1 | 467 |*********************************>| 468 | ADVERTISEMENT 2 | 469 |<*********************************| 470 | | 471 | CONFIGURE 1 | 472 |<*********************************| 473 | CONFIGURE 2 | 474 |*********************************>| 475 | | 476 | REINVITE (UPDATED SDP) | 477 |--------------------------------->| 478 | 200 0K (UPDATED SDP)| 479 |<---------------------------------| 480 | ACK | 481 |--------------------------------->| 482 | | 483 |<################################>| 484 | UPDATED SDP MEDIA SESSION | 485 |<################################>| 486 | | 487 v v 489 An initial offer/answer exchange establishes a basic media session, 490 for example audio-only, and a CLUE channel between two Endpoints. 491 With the establishment of that channel, the endpoints have 492 consented to use the CLUE protocol mechanisms and, therefore, MUST 493 adhere to the CLUE protocol suite as outlined herein. 495 Over this CLUE channel, the Provider in each Endpoint conveys its 496 characteristics and capabilities by sending an Advertisement as 497 specified herein. The Advertisement is typically not sufficient to 498 set up all media. The Consumer in the Endpoint receives the 499 information provided by the Provider, and can use it for two 500 purposes. First, it MUST construct and send a CLUE Configure 501 message to tell the Provider what the Consumer wishes to receive. 502 Second, it MAY, but is not necessarily REQUIRED to, use the 503 information provided to tailor the SDP it is going to send during 504 the following SIP offer/answer exchange, and its reaction to SDP it 505 receives in that step. It is often a sensible implementation 506 choice to do so, as the representation of the media information 507 conveyed over the CLUE channel can dramatically cut down on the 508 size of SDP messages used in the O/A exchange that follows. 509 Spatial relationships associated with the Media can be included in 510 the Advertisement, and it is often sensible for the Media Consumer 511 to take those spatial relationships into account when tailoring the 512 SDP. 514 This CLUE exchange MUST be followed by an SDP offer answer exchange 515 that not only establishes those aspects of the media that have not 516 been "negotiated" over CLUE, but has also the side effect of 517 setting up the media transmission itself, involving potentially 518 security exchanges, ICE, and whatnot. This step is plain vanilla 519 SIP, with the exception that the SDP used herein, in most (but not 520 necessarily all) cases can be considerably smaller than the SDP a 521 system would typically need to exchange if there were no pre- 522 established knowledge about the Provider and Consumer 523 characteristics. (The need for cutting down SDP size is not quite 524 obvious for a point-to-point call involving simple endpoints; 525 however, when considering a large multipoint conference involving 526 many multi-screen/multi-camera endpoints, each of which can operate 527 using multiple codecs for each camera and microphone, it becomes 528 perhaps somewhat more intuitive.) 530 During the lifetime of a call, further exchanges MAY occur over the 531 CLUE channel. In some cases, those further exchanges lead to a 532 modified system behavior of Provider or Consumer (or both) without 533 any other protocol activity such as further offer/answer exchanges. 535 For example, voice-activated screen switching, signaled over the 536 CLUE channel, ought not to lead to heavy-handed mechanisms like SIP 537 re-invites. However, in other cases, after the CLUE negotiation an 538 additional offer/answer exchange becomes necessary. For example, 539 if both sides decide to upgrade the call from a single screen to a 540 multi-screen call and more bandwidth is required for the additional 541 video channels compared to what was previously negotiated using 542 offer/answer, a new O/A exchange is REQUIRED. 544 Numerous optimizations are possible, and are the implementer's 545 choice. For example, it can be sensible to establish one or more 546 initial media channels during the initial offer/answer exchange, 547 which would allow, for example, for a fast startup of audio. 548 Depending on the system design, it can be possible to re-use this 549 established channel for more advanced media negotiated only by CLUE 550 mechanisms, thereby avoiding further offer/answer exchanges. 552 Edt. note: The editors are not sure whether the mentioned 553 overloading of established RTP channels using only CLUE messages is 554 possible, or desired by the WG. If it were, certainly there is 555 need for specification work. One possible issue: a Provider which 556 thinks that it can switch, say, a audio codec algorithm by CLUE 557 only, talks to a Consumer which thinks that it has to faithfully 558 answer the Providers Advertisement through a Configure, but does 559 not dare setting up its internal resource until such time it has 560 received its authoritative O/A exchange. Working group input is 561 solicited. 563 One aspect of the protocol outlined herein and specified in more 564 detail in companion documents is that it makes available 565 information regarding the Provider's capabilities to deliver Media, 566 and attributes related to that Media such as their spatial 567 relationship, to the Consumer. The operation of the renderer 568 inside the Consumer is unspecified in that it can choose to ignore 569 some information provided by the Provider, and/or not render media 570 streams available from the Provider (although it MUST follow the 571 CLUE protocol and, therefore, MUST gracefully receive and respond 572 (through a Configure) to the Provider's information). All CLUE 573 protocol mechanisms are OPTIONAL in the Consumer in the sense that, 574 while the Consumer MUST be able to receive (and, potentially, 575 gracefully acknowledge) CLUE messages, it is free to ignore the 576 information provided therein. Obviously, this is not a 577 particularly sensible design choice in almost all conceivable 578 cases. 580 A CLUE-implementing device interoperates with a device that does 581 not support CLUE, because the non-CLUE device does, by definition, 582 not understand the offer of a CLUE channel in the initial 583 offer/answer exchange and, therefore, will reject it. This 584 rejection MUST be used as the indication to the CLUE-implementing 585 device that the other side of the communication is not compliant 586 with CLUE, and to fall back to behavior that does not require CLUE. 588 As for the media, Provider and Consumer have an end-to-end 589 communication relationship with respect to (RTP transported) media; 590 and the mechanisms described herein and in companion documents do 591 not change the aspects of setting up those RTP flows and sessions. 592 In other words, the RTP media sessions conform to the negotiated 593 SDP whether or not CLUE is used. 595 Edt. note (StW): what's written below is likely correct, but is not 596 the result of the introduction of CLUE, but rather the result of a 597 generational overhaul of RTP usage that would have happened with or 598 without CLUE. Suggest to delete the sentences below until begin of 599 section 6. Is having a CLUE RTP Mapping document still the plan? 600 If yes, we should have a real draft and a real reference. 602 However, some form of RTP multiplexing is likely to be used by CLUE 603 devices. More information about relating RTP flows to CLUE 604 entities is in the CLUE RTP Mapping document. 606 6. Spatial Relationships 608 In order for a Consumer to perform a proper rendering, it is often 609 necessary or at least helpful for the Consumer to have received 610 spatial information about the streams it is receiving. CLUE 611 defines a coordinate system that allows Media Providers to describe 612 the spatial relationships of their Media Captures to enable proper 613 scaling and spatially sensible rendering of their streams. The 614 coordinate system is based on a few principles: 616 o Simple systems which do not have multiple Media Captures to 617 associate spatially need not use the coordinate model. 619 o Coordinates can either be in real, physical units (millimeters), 620 have an unknown scale or have no physical scale. Systems which 621 know their physical dimensions (for example professionally 622 installed Telepresence room systems) MUST always provide those 623 real-world measurements. Systems which don't know specific 624 physical dimensions but still know relative distances MUST use 625 'unknown scale'. 'No scale' is intended to be used where Media 626 Captures from different devices (with potentially different 627 scales) will be forwarded alongside one another (e.g. in the 628 case of a middle box). 630 * "millimeters" means the scale is in millimeters 632 * "Unknown" means the scale is not necessarily millimeters, but 633 the scale is the same for every Capture in the Capture Scene. 635 * "No Scale" means the scale could be different for each 636 capture- an MCU provider that advertises two adjacent 637 captures and picks sources (which can change quickly) from 638 different endpoints might use this value; the scale could be 639 different and changing for each capture. But the areas of 640 capture still represent a spatial relation between captures. 642 o The coordinate system is Cartesian X, Y, Z with the origin at a 643 spatial location of the provider's choosing. The Provider MUST 644 use the same coordinate system with same scale and origin for 645 all coordinates within the same Capture Scene. 647 The direction of increasing coordinate values is: 648 X increases from Camera-Left to Camera-Right 649 Y increases from Front to back 650 Z increases from low to high (i.e. floor to ceiling) 652 7. Media Captures and Capture Scenes 654 This section describes how Providers can describe the content of 655 media to Consumers. 657 7.1. Media Captures 659 Media Captures are the fundamental representations of streams that 660 a device can transmit. What a Media Capture actually represents is 661 flexible: 663 o It can represent the immediate output of a physical source (e.g. 664 camera, microphone) or 'synthetic' source (e.g. laptop computer, 665 DVD player). 667 o It can represent the output of an audio mixer or video composer 669 o It can represent a concept such as 'the loudest speaker' 671 o It can represent a conceptual position such as 'the leftmost 672 stream' 674 To identify and distinguish between multiple instances, video and 675 audio captures are labeled. For instance: VC1, VC2 and AC1, AC2, 676 where VC1 and VC2 refer to two different video captures and AC1 677 and AC2 refer to two different audio captures. 679 Some key points about Media Captures: 681 . A Media Capture is of a single media type (e.g. audio or 682 video) 683 . A Media Capture is associated with exactly one Capture Scene 684 . A Media Capture is associated with one or more Capture Scene 685 Entries 686 . A Media Capture has exactly one set of spatial information 687 . A Media Capture can be the source of one or more Capture 688 Encodings 690 Each Media Capture can be associated with attributes to describe 691 what it represents. 693 7.1.1. Media Capture Attributes 695 Media Capture Attributes describe information about the Captures. 696 A Provider can use the Media Capture Attributes to describe the 697 Captures for the benefit of the Consumer in the Advertisement 698 message. Media Capture Attributes include: 700 . spatial information, such as point of capture, point on line 701 of capture, and area of capture, all of which, in combination 702 define the capture field of, for example, a camera; 703 . Capture multiplexing information (composed/switched video, 704 mono/stereo audio, maximum number of simultaneous encodings 705 per Capture and so on); and 707 . Other descriptive information to help the Consumer choose 708 between captures (description, presentation, view, priority, 709 language, role). 710 . Control information for use inside the CLUE protocol suite. 712 Point of Capture: 714 A field with a single Cartesian (X, Y, Z) point value which 715 describes the spatial location of the capturing device (such as 716 camera). 718 Point on Line of Capture: 720 A field with a single Cartesian (X, Y, Z) point value which 721 describes a position in space of a second point on the axis of the 722 capturing device; the first point being the Point of Capture (see 723 above). 725 Together, the Point of Capture and Point on Line of Capture define 726 an axis of the capturing device, for example the optical axis of a 727 camera. The Media Consumer can use this information to adjust how 728 it renders the received media if it so chooses. 730 Area of Capture: 732 A field with a set of four (X, Y, Z) points as a value which 733 describe the spatial location of what is being "captured". By 734 comparing the Area of Capture for different Media Captures within 735 the same Capture Scene a consumer can determine the spatial 736 relationships between them and render them correctly. 738 The four points MUST be co-planar, forming a quadrilateral, which 739 defines the Plane of Interest for the particular media capture. 741 If the Area of Capture is not specified, it means the Media Capture 742 is not spatially related to any other Media Capture. 744 For a switched capture that switches between different sections 745 within a larger area, the area of capture MUST use coordinates for 746 the larger potential area. 748 Mobility of Capture: 750 This attribute indicates whether or not the point of capture, line 751 on point of capture, and area of capture values stay the same over 752 time, or are expected to change (potentially frequently). Possible 753 values are static, dynamic, and highly dynamic. 755 An example for "dynamic" is a camera mounted on a stand which is 756 occasionally hand-carried and placed at different positions in 757 order to provide the best angle to capture a work task. A camera 758 worn by a participant who moves around the room is an example for 759 "highly dynamic". In either case, the effect is that the capture 760 point, capture axis and area of capture change with time. 762 The capture point of a static capture MUST NOT move for the life of 763 the conference. The capture point of dynamic captures is 764 categorized by a change in position followed by a reasonable period 765 of stability--in the order of magnitude of minutes. High dynamic 766 captures are categorized by a capture point that is constantly 767 moving. If the "area of capture", "capture point" and "line of 768 capture" attributes are included with dynamic or highly dynamic 769 captures they indicate spatial information at the time of the 770 Advertisement. 772 Composed: 774 A boolean field which indicates whether or not the Media Capture is 775 a mix (audio) or composition (video) of streams. 777 This attribute is useful for a media consumer to avoid nesting a 778 composed video capture into another composed capture or rendering. 779 This attribute is not intended to describe the layout a media 780 provider uses when composing video streams. 782 Switched: 784 A boolean field which indicates whether or not the Media Capture 785 represents the (dynamic) most appropriate subset of a 'whole'. 786 What is 'most appropriate' is up to the provider and could be the 787 active speaker, a lecturer or a VIP. 789 Audio Channel Format: 791 A field with enumerated values which describes the method of 792 encoding used for audio. A value of 'mono' means the Audio Capture 793 has one channel. 'stereo' means the Audio Capture has two audio 794 channels, left and right. 796 This attribute applies only to Audio Captures. A single stereo 797 capture is different from two mono captures that have a left-right 798 spatial relationship. A stereo capture maps to a single Capture 799 Encoding, while each mono audio capture maps to a separate Capture 800 Encoding. 802 Max Capture Encodings: 804 An optional attribute indicating the maximum number of Capture 805 Encodings that can be simultaneously active for the Media Capture. 806 The number of simultaneous Capture Encodings is also limited by the 807 restrictions of the Encoding Group for the Media Capture. 809 Description: 811 Human-readable description of the Capture, which could be in 812 multiple languages. 814 Presentation: 816 This attribute indicates that the capture originates from a 817 presentation device, that is one that provides supplementary 818 information to a conference through slides, video, still images, 819 data etc. Where more information is known about the capture it MAY 820 be expanded hierarchically to indicate the different types of 821 presentation media, e.g. presentation.slides, presentation.image 822 etc. 824 Note: It is expected that a number of keywords will be defined that 825 provide more detail on the type of presentation. 827 View: 829 A field with enumerated values, indicating what type of view the 830 capture relates to. The Consumer can use this information to help 831 choose which Media Captures it wishes to receive. The value MUST 832 be one of: 834 Room - Captures the entire scene 836 Table - Captures the conference table with seated participants 838 Individual - Captures an individual participant 839 Lectern - Captures the region of the lectern including the 840 presenter, for example in a classroom style conference room 842 Audience - Captures a region showing the audience in a classroom 843 style conference room 845 Language: 847 This attribute indicates one or more languages used in the content 848 of the media capture. Captures MAY be offered in different 849 languages in case of multilingual and/or accessible conferences. A 850 Consumer can use this attribute to differentiate between them and 851 pick the appropriate one. 853 Note that the Language atttribute is defined and meaningful both 854 for audio and video captures. In case of audio captures, the 855 meaning is obvious. For a video capture, "Language" could, for 856 example, be sign interpretation or text. 858 Role: 860 Edt. Note -- this is a placeholder for a role attribute, as 861 discussed in draft-groves-clue-capture-attr. We expect to continue 862 discussing the role attribute in the context of that draft, and 863 follow-on drafts, before adding it to this framework document. 865 Priority: 867 This attribute indicates a relative priority between different 868 Media Captures. The Provider sets this priority, and the Consumer 869 MAY use the priority to help decide which captures it wishes to 870 receive. 872 The "priority" attribute is an integer which indicates a relative 873 priority between captures. For example it is possible to assign a 874 priority between two presentation captures that would allow a 875 remote endpoint to determine which presentation is more important. 876 Priority is assigned at the individual capture level. It represents 877 the Provider's view of the relative priority between captures with 878 a priority. The same priority number MAY be used across multiple 879 captures. It indicates they are equally important. If no priority 880 is assigned no assumptions regarding relative important of the 881 capture can be assumed. 883 Embedded Text: 885 This attribute indicates that a capture provides embedded textual 886 information. For example the video capture MAY contain speech to 887 text information composed with the video image. This attribute is 888 only applicable to video captures and presentation streams with 889 visual information. 891 Related To: 893 This attribute indicates the capture contains additional 894 complementary information related to another capture. The value 895 indicates the other capture to which this capture is providing 896 additional information. 898 For example, a conferences can utilize translators or facilitators 899 that provide an additional audio stream (i.e. a translation or 900 description or commentary of the conference). Where multiple 901 captures are available, it may be advantageous for a Consumer to 902 select a complementary capture instead of or in addition to a 903 capture it relates to. 905 7.2. Capture Scene 907 In order for a Provider's individual Captures to be used 908 effectively by a Consumer, the provider organizes the Captures into 909 one or more Capture Scenes, with the structure and contents of 910 these Capture Scenes being sent from the Provider to the Consumer 911 in the Advertisement. 913 A Capture Scene is a structure representing a spatial region 914 containing one or more Capture Devices, each capturing media 915 representing a portion of the region. A Capture Scene includes one 916 or more Capture Scene entries, with each entry including one or 917 more Media Captures. A Capture Scene represents, for example, the 918 video image of a group of people seated next to each other, along 919 with the sound of their voices, which could be represented by some 920 number of VCs and ACs in the Capture Scene Entries. A middle box 921 can also describe in Capture Scenes what it constructs from media 922 Streams it receives. 924 A Provider MAY advertise one or more Capture Scenes . What 925 constitutes an entire Capture Scene is up to the Provider. A 926 simple Provider might typically use one Capture Scene for 927 participant media (live video from the room cameras) and another 928 Capture Scene for a computer generated presentation. In more 929 complex systems, the use of additional Capture Scenes is also 930 sensible. For example, a classroom may advertise two Capture 931 Scenes involving live video, one including only the camera 932 capturing the instructor (and associated audio), the other 933 including camera(s) capturing students (and associated audio). 935 A Capture Scene MAY (and typically will) include more than one type 936 of media. For example, a Capture Scene can include several Capture 937 Scene Entries for Video Captures, and several Capture Scene Entries 938 for Audio Captures. A particular Capture MAY be included in more 939 than one Capture Scene Entry. 941 A provider MAY express spatial relationships between Captures that 942 are included in the same Capture Scene. However, there is not 943 necessarily the same spatial relationship between Media Captures 944 that are in different Capture Scenes. In other words, Capture 945 Scenes can use their own spatial measurement system as outlined 946 above in section 6. 948 A Provider arranges Captures in a Capture Scene to help the 949 Consumer choose which captures it wants to render. The Capture 950 Scene Entries in a Capture Scene are different alternatives the 951 Provider is suggesting for representing the Capture Scene. The 952 order of Capture Scene Entries within a Capture Scene has no 953 significance. The Media Consumer can choose to receive all Media 954 Captures from one Capture Scene Entry for each media type (e.g. 955 audio and video), or it can pick and choose Media Captures 956 regardless of how the Provider arranges them in Capture Scene 957 Entries. Different Capture Scene Entries of the same media type 958 are not necessarily mutually exclusive alternatives. Also note 959 that the presence of multiple Capture Scene Entries (with 960 potentially multiple encoding options in each entry) in a given 961 Capture Scene does not necessarily imply that a Provider is able to 962 serve all the associated media simultaneously (although the 963 construction of such an over-rich Capture Scene is probably not 964 sensible in many cases). What a Provider can send simultaneously 965 is determined through the Simultaneous Transmission Set mechanism, 966 described in section 7.3. 968 Captures within the same Capture Scene entry MUST be of the same 969 media type - it is not possible to mix audio and video captures in 970 the same Capture Scene Entry, for instance. The Provider MUST be 971 capable of encoding and sending all Captures in a single Capture 972 Scene Entry simultaneously. The order of Captures within a Capture 973 Scene Entry has no significance. A Consumer can decide to receive 974 all the Captures in a single Capture Scene Entry, but a Consumer 975 could also decide to receive just a subset of those captures. A 976 Consumer can also decide to receive Captures from different Capture 977 Scene Entries, all subject to the constraints set by Simultaneous 978 Transmission Sets, as discussed in section 7.3. 980 When a Provider advertises a Capture Scene with multiple entries, 981 it is essentially signaling that there are multiple representations 982 of the same Capture Scene available. In some cases, these multiple 983 representations would typically be used simultaneously (for 984 instance a "video entry" and an "audio entry"). In some cases the 985 entries would conceptually be alternatives (for instance an entry 986 consisting of three Video Captures covering the whole room versus 987 an entry consisting of just a single Video Capture covering only 988 the center if a room). In this latter example, one sensible choice 989 for a Consumer would be to indicate (through its Configure and 990 possibly through an additional offer/answer exchange) the Captures 991 of that Capture Scene Entry that most closely matched the 992 Consumer's number of display devices or screen layout. 994 The following is an example of 4 potential Capture Scene Entries 995 for an endpoint-style Provider: 997 1. (VC0, VC1, VC2) - left, center and right camera Video Captures 999 2. (VC3) - Video Capture associated with loudest room segment 1001 3. (VC4) - Video Capture zoomed out view of all people in the room 1003 4. (AC0) - main audio 1005 The first entry in this Capture Scene example is a list of Video 1006 Captures which have a spatial relationship to each other. 1007 Determination of the order of these captures (VC0, VC1 and VC2) for 1008 rendering purposes is accomplished through use of their Area of 1009 Capture attributes. The second entry (VC3) and the third entry 1010 (VC4) are alternative representations of the same room's video, 1011 which might be better suited to some Consumers' rendering 1012 capabilities. The inclusion of the Audio Capture in the same 1013 Capture Scene indicates that AC0 is associated with all of those 1014 Video Captures, meaning it comes from the same spatial region. 1015 Therefore, if audio were to be rendered at all, this audio would be 1016 the correct choice irrespective of which Video Captures were 1017 chosen. 1019 7.2.1. Capture Scene attributes 1021 Capture Scene Attributes can be applied to Capture Scenes as well 1022 as to individual media captures. Attributes specified at this 1023 level apply to all constituent Captures. Capture Scene attributes 1024 include 1026 . Human-readable description of the Capture Scene, which could 1027 be in multiple languages; 1028 . Scale information (millimeters, unknown, no scale), as 1029 described in Section 5. 1031 7.2.2. Capture Scene Entry attributes 1033 A Capture Scene can include one or more Capture Scene Entries in 1034 addition to the Capture Scene wide attributes described above. 1035 Capture Scene Entry attributes apply to the Capture Scene Entry as 1036 a whole, i.e. to all Captures that are part of the Capture Scene 1037 Entry. 1039 Capture Scene Entry attributes include: 1041 . Human-readable description of the Capture Scene Entry, which 1042 could be in multiple languages; 1043 . Scene-switch-policy: {site-switch, segment-switch} 1045 A media provider uses this scene-switch-policy attribute to 1046 indicate its support for different switching policies. If a 1047 provider supports both policies, it MAY advertise separate Capture 1048 Scene Entries containing separate Captures, each entry with a 1049 separate scene-switch-policy value. If the provider does not 1050 support any of these policies, it MUST omit this attribute. 1052 The "site-switch" policy means all captures are switched at the 1053 same time to keep captures from the same endpoint site together. 1054 Let's say the speaker is at site A and everyone else is at a 1055 "remote" site. 1057 When the room at site A shown, all the camera images from site A 1058 are forwarded to the remote sites. Therefore at each receiving 1059 remote site, all the screens display camera images from site A. 1061 This can be used to preserve full size image display, and also 1062 provide full visual context of the displayed far end, site A. In 1063 site switching, there is a fixed relation between the cameras in 1064 each room and the displays in remote rooms. The room or 1065 participants being shown can be switched from time to time based 1066 on, for example, who is speaking or by manual control. 1068 The "segment-switch" policy means different captures can switch at 1069 different times, and can be coming from different endpoints. Still 1070 using site A as where the speaker is, and "remote" to refer to all 1071 the other sites, in segment switching, rather than sending all the 1072 images from site A, only the image containing the speaker at site A 1073 is shown. The camera images of the current speaker and previous 1074 speakers (if any) are forwarded to the other sites in the 1075 conference. 1077 Therefore the screens in each site are usually displaying images 1078 from different remote sites - the current speaker at site A and the 1079 previous ones. This strategy can be used to preserve full size 1080 image display, and also capture the non-verbal communication 1081 between the speakers. In segment switching, the display depends on 1082 the activity in the remote rooms - generally, but not necessarily 1083 based on audio / speech detection. 1085 7.3. Simultaneous Transmission Set Constraints 1087 In many practical cases, a Provider has constraints or limitations 1088 on its ability to send Captures simultaneously. One type of 1089 limitation is caused by the physical limitations of capture 1090 mechanisms; these constraints are represented by a simultaneous 1091 transmission set. The second type of limitation reflects the 1092 encoding resources available, such as bandwidth or video encoding 1093 throughput (macroblocks/second). This type of constraint is 1094 captured by encoding groups, discussed below. 1096 Some Endpoints or MCUs can send multiple Captures simultaneously, 1097 however sometimes there are constraints that limit which Captures 1098 can be sent simultaneously with other Captures. A device may not 1099 be able to be used in different ways at the same time. Provider 1100 Advertisements are made so that the Consumer can choose one of 1101 several possible mutually exclusive usages of the device. This 1102 type of constraint is expressed in a Simultaneous Transmission Set, 1103 which lists all the Captures of a particular media type (e.g. 1104 audio, video, text) that can be sent at the same time. There are 1105 different Simultaneous Transmission Sets for each media type in the 1106 Advertisement. This is easier to show in an example. 1108 Consider the example of a room system where there are three cameras 1109 each of which can send a separate capture covering two persons 1110 each- VC0, VC1, VC2. The middle camera can also zoom out (using an 1111 optical zoom lens) and show all six persons, VC3. But the middle 1112 camera cannot be used in both modes at the same time - it has to 1113 either show the space where two participants sit or the whole six 1114 seats, but not both at the same time. As a result, VC1 and VC3 1115 cannot be sent simultaneously. 1117 Simultaneous transmission sets are expressed as sets of the Media 1118 Captures that the Provider could transmit at the same time (though, 1119 in some cases, it is not intuitive to do so). In this example the 1120 two simultaneous sets are shown in Table 1. If a Provider 1121 advertises one or more mutually exclusive Simultaneous Transmission 1122 Sets, then for each media type the Consumer MUST ensure that it 1123 chooses Media Captures that lie wholly within one of those 1124 Simultaneous Transmission Sets. 1126 +-------------------+ 1127 | Simultaneous Sets | 1128 +-------------------+ 1129 | {VC0, VC1, VC2} | 1130 | {VC0, VC3, VC2} | 1131 +-------------------+ 1133 Table 1: Two Simultaneous Transmission Sets 1135 A Provider OPTIONALLY can include the simultaneous sets in its 1136 provider Advertisement. These simultaneous set constraints apply 1137 across all the Capture Scenes in the Advertisement. It is a syntax 1138 conformance requirement that the simultaneous transmission sets 1139 MUST allow all the media captures in any particular Capture Scene 1140 Entry to be used simultaneously. 1142 For shorthand convenience, a Provider MAY describe a Simultaneous 1143 Transmission Set in terms of Capture Scene Entries and Capture 1144 Scenes. If a Capture Scene Entry is included in a Simultaneous 1145 Transmission Set, then all Media Captures in the Capture Scene 1146 Entry are included in the Simultaneous Transmission Set. If a 1147 Capture Scene is included in a Simultaneous Transmission Set, then 1148 all its Capture Scene Entries (of the corresponding media type) are 1149 included in the Simultaneous Transmission Set. The end result 1150 reduces to a set of Media Captures in either case. 1152 If an Advertisement does not include Simultaneous Transmission 1153 Sets, then the Provider MUST be able to provide all Capture Scenes 1154 simultaneously. If multiple capture Scene Entries are in a Capture 1155 Scene then the Consumer chooses at most one Capture Scene Entry per 1156 Capture Scene for each media type. 1158 If an Advertisement includes multiple Capture Scene Entries in a 1159 Capture Scene then the Consumer MAY choose one Capture Scene Entry 1160 for each media type, or MAY choose individual Captures based on the 1161 Simultaneous Transmission Sets. 1163 8. Encodings 1165 Individual encodings and encoding groups are CLUE's mechanisms 1166 allowing a Provider to signal its limitations for sending Captures, 1167 or combinations of Captures, to a Consumer. Consumers can map the 1168 Captures they want to receive onto the Encodings, with encoding 1169 parameters they want. As for the relationship between the CLUE- 1170 specified mechanisms based on Encodings and the SIP Offer-Answer 1171 exchange, please refer to section 4. 1173 8.1. Individual Encodings 1175 An Individual Encoding represents a way to encode a Media Capture 1176 to become a Capture Encoding, to be sent as an encoded media stream 1177 from the Provider to the Consumer. An Individual Encoding has a 1178 set of parameters characterizing how the media is encoded. 1180 Different media types have different parameters, and different 1181 encoding algorithms may have different parameters. An Individual 1182 Encoding can be assigned to at most one Capture Encoding at any 1183 given time. 1185 The parameters of an Individual Encoding represent the maximum 1186 values for certain aspects of the encoding. A particular 1187 instantiation into a Capture Encoding MAY use lower values than 1188 these maximums if that is applicable for the media in question. 1189 For example, most video codec specifications require a conformant 1190 decoder to decode resolutions and frame rates smaller than what has 1191 been negotiated as a maximum, so downgrading the CLUE maximum 1192 values for macroblocks/second is appropriate. On the other hand, 1193 downgrading the sample rate of G.711 audio below 8kHz is not 1194 specified in G.711 and therefore not applicable in the sense 1195 described here. 1197 Individual Encoding parameters are represented in SDP [RFC4566], 1198 not in CLUE messages. For example, for a video encoding using 1199 H.26x compression technologies, this can include parameters such 1200 as: 1202 . Maximum bandwidth; 1203 . Maximum picture size in pixels; 1204 . Maxmimum number of pixels to be processed per second; 1206 The bandwidth parameter is the only one that specifically relates 1207 to a CLUE Advertisement, as it can be further constrained by the 1208 maximum group bandwidth in an Encoding Group. 1210 8.2. Encoding Group 1212 An Encoding Group includes a set of one or more Individual 1213 Encodings, and parameters that apply to the group as a whole. By 1214 grouping multiple individual Encodings together, an Encoding Group 1215 describes additional constraints on bandwidth for the group. 1217 The Encoding Group data structure contains: 1219 . Maximum bitrate for all encodings in the group combined; 1220 . A list of identifiers for audio and video encodings, 1221 respectively, belonging to the group. 1223 When the Individual Encodings in a group are instantiated into 1224 Capture Encodings, each Capture Encoding has a bitrate that MUST be 1225 less than or equal to the max bitrate for the particular individual 1226 encoding. The "maximum bitrate for all encodings in the group" 1227 parameter gives the additional restriction that the sum of all the 1228 individual capture encoding bitrates MUST be less than or equal to 1229 the this group value. 1231 The following diagram illustrates one example of the structure of a 1232 media provider's Encoding Groups and their contents. 1234 ,-------------------------------------------------. 1235 | Media Provider | 1236 | | 1237 | ,--------------------------------------. | 1238 | | ,--------------------------------------. | 1239 | | | ,--------------------------------------. | 1240 | | | | Encoding Group | | 1241 | | | | ,-----------. | | 1242 | | | | | | ,---------. | | 1243 | | | | | | | | ,---------.| | 1244 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1245 | `.| | | | | | `---------'| | 1246 | `.| `-----------' `---------' | | 1247 | `--------------------------------------' | 1248 `-------------------------------------------------' 1250 Figure 1: Encoding Group Structure 1252 A Provider advertises one or more Encoding Groups. Each Encoding 1253 Group includes one or more Individual Encodings. Each Individual 1254 Encoding can represent a different way of encoding media. For 1255 example one Individual Encoding may be 1080p60 video, another could 1256 be 720p30, with a third being CIF, all in, for example, H.264 1257 format. 1259 While a typical three codec/display system might have one Encoding 1260 Group per "codec box" (physical codec, connected to one camera and 1261 one screen), there are many possibilities for the number of 1262 Encoding Groups a Provider may be able to offer and for the 1263 encoding values in each Encoding Group. 1265 There is no requirement for all Encodings within an Encoding Group 1266 to be instantiated at the same time. 1268 9. Associating Captures with Encoding Groups 1270 Every Capture MUST be associated with at least one Encoding Group, 1271 which is used to instantiate that Capture into one or more Capture 1272 Encodings. More than one Capture MAY use the same Encoding Group. 1274 The maximum number of streams that can result from a particular 1275 Encoding Group constraint is equal to the number of individual 1276 Encodings in the group. The actual number of Capture Encodings 1277 used at any time MAY be less than this maximum. Any of the 1278 Captures that use a particular Encoding Group can be encoded 1279 according to any of the Individual Encodings in the group. If 1280 there are multiple Individual Encodings in the group, then the 1281 Consumer can configure the Provider, via a Configure message, to 1282 encode a single Media Capture into multiple different Capture 1283 Encodings at the same time, subject to the Max Capture Encodings 1284 constraint, with each capture encoding following the constraints of 1285 a different Individual Encoding. 1287 It is a protocol conformance requirement that the Encoding Groups 1288 MUST allow all the Captures in a particular Capture Scene Entry to 1289 be used simultaneously. 1291 10. Consumer's Choice of Streams to Receive from the Provider 1293 After receiving the Provider's Advertisement message (that includes 1294 media captures and associated constraints), the Consumer composes 1295 its reply to the Provider in the form of a Configure message. The 1296 Consumer is free to use the information in the Advertisement as it 1297 chooses, but there are a few obviously sensible design choices, 1298 which are outlined below. 1300 If multiple Providers connect to the same Consumer (i.e. in a n 1301 MCU-less multiparty call), it is the responsibility of the Consumer 1302 to compose Configures for each Provider that both fulfill each 1303 Provider's constraints as expressed in the Advertisement, as well 1304 as its own capabilities. 1306 In an MCU-based multiparty call, the MCU can logically terminate 1307 the Advertisement/Configure negotiation in that it can hide the 1308 characteristics of the receiving endpoint and rely on its own 1309 capabilities (transcoding/transrating/...) to create Media Streams 1310 that can be decoded at the Endpoint Consumers. The timing of an 1311 MCU's sending of Advertisements (for its outgoing ports) and 1312 Configures (for its incoming ports, in response to Advertisements 1313 received there) is up to the MCU and implementation dependent. 1315 As a general outline, A Consumer can choose, based on the 1316 Advertisement it has received, which Captures it wishes to receive, 1317 and which Individual Encodings it wants the Provider to use to 1318 encode the Captures. Each Capture has an Encoding Group ID 1319 attribute which specifies which Individual Encodings are available 1320 to be used for that Capture. 1322 A Configure Message includes a list of Capture Encodings. These 1323 are the Capture Encodings the Consumer wishes to receive from the 1324 Provider. Each Capture Encoding refers to one Media Capture, one 1325 Individual Encoding, and includes the encoding parameter values. A 1326 Configure Message does not include references to Capture Scenes or 1327 Capture Scene Entries. 1329 For each Capture the Consumer wants to receive, it configures one 1330 or more of the encodings in that capture's encoding group. The 1331 Consumer does this by telling the Provider, in its Configure 1332 Message, parameters such as the resolution, frame rate, bandwidth, 1333 etc. for each Capture Encodings for its chosen Captures. Upon 1334 receipt of this Configure from the Consumer, common knowledge is 1335 established between Provider and Consumer regarding sensible 1336 choices for the media streams and their parameters. The setup of 1337 the actual media channels, at least in the simplest case, is left 1338 to a following offer-answer exchange. Optimized implementations 1339 MAY speed up the reaction to the offer-answer exchange by reserving 1340 the resources at the time of finalization of the CLUE handshake. 1342 Edt. Note (StW): is the sentence below still correct? 1344 Even more advanced devices MAY choose to establish media streams 1345 without an offer-answer exchange, for example by overloading 1346 existing 5 tuple connections with the negotiated media. 1348 In order to meaningfully create and send an initial Configure, the 1349 Consumer needs to have received at least one Advertisement from the 1350 Provider. 1352 In addition, the Consumer can send a Configure at any time during 1353 the call. The Configure MUST be valid according to the most 1354 recently received Advertisement. The Consumer can send a Configure 1355 either in response to a new Advertisement from the Provider or on 1356 its own, for example because of a local change in conditions 1357 (people leaving the room, connectivity changes, multipoint related 1358 considerations). 1360 When choosing which Media Streams to receive from the Provider, and 1361 the encoding characteristics of those Media Streams, the Consumer 1362 advantageously takes several things into account: its local 1363 preference, simultaneity restrictions, and encoding limits. 1365 10.1. Local preference 1367 A variety of local factors influence the Consumer's choice of 1368 Media Streams to be received from the Provider: 1370 o if the Consumer is an Endpoint, it is likely that it would 1371 choose, where possible, to receive video and audio Captures that 1372 match the number of display devices and audio system it has 1374 o if the Consumer is a middle box such as an MCU, it MAY choose to 1375 receive loudest speaker streams (in order to perform its own 1376 media composition) and avoid pre-composed video Captures 1378 o user choice (for instance, selection of a new layout) MAY result 1379 in a different set of Captures, or different encoding 1380 characteristics, being required by the Consumer 1382 10.2. Physical simultaneity restrictions 1384 Often there are physical simultaneity constraints of the Provider 1385 that affect the Provider's ability to simultaneously send all of 1386 the captures the Consumer would wish to receive. For instance, a 1387 middle box such as an MCU, when connected to a multi-camera room 1388 system, might prefer to receive both individual video streams of 1389 the people present in the room and an overall view of the room 1390 from a single camera. Some Endpoint systems might be able to 1391 provide both of these sets of streams simultaneously, whereas 1392 others might not (if the overall room view were produced by 1393 changing the optical zoom level on the center camera, for 1394 instance). 1396 10.3. Encoding and encoding group limits 1398 Each of the Provider's encoding groups has limits on bandwidth and 1399 computational complexity, and the constituent potential encodings 1400 have limits on the bandwidth, computational complexity, video 1401 frame rate, and resolution that can be provided. When choosing 1402 the Captures to be received from a Provider, a Consumer device 1403 MUST ensure that the encoding characteristics requested for each 1404 individual Capture fits within the capability of the encoding it 1405 is being configured to use, as well as ensuring that the combined 1406 encoding characteristics for Captures fit within the capabilities 1407 of their associated encoding groups. In some cases, this could 1408 cause an otherwise "preferred" choice of capture encodings to be 1409 passed over in favor of different Capture Encodings--for instance, 1410 if a set of three Captures could only be provided at a low 1411 resolution then a three screen device could switch to favoring a 1412 single, higher quality, Capture Encoding. 1414 11. Extensibility 1416 One important characteristics of the Framework is its 1417 extensibility. Telepresence is a relatively new industry and 1418 while we can foresee certain directions, we also do not know 1419 everything about how it will develop. The standard for 1420 interoperability and handling multiple streams must be future- 1421 proof. The framework itself is inherently extensible through 1422 expanding the data model types. For example: 1424 o Adding more types of media, such as telemetry, can done by 1425 defining additional types of Captures in addition to audio and 1426 video. 1428 o Adding new functionalities , such as 3-D, say, may require 1429 additional attributes describing the Captures. 1431 o Adding a new codecs, such as H.265, can be accomplished by 1432 defining new encoding variables. 1434 The infrastructure is designed to be extended rather than 1435 requiring new infrastructure elements. Extension comes through 1436 adding to defined types. 1438 12. Examples - Using the Framework (Informative) 1440 This section gives some examples, first from the point of view of 1441 the Provider, then the Consumer. 1443 12.1. Provider Behavior 1445 This section shows some examples in more detail of how a Provider 1446 can use the framework to represent a typical case for telepresence 1447 rooms. First an endpoint is illustrated, then an MCU case is 1448 shown. 1450 12.1.1. Three screen Endpoint Provider 1452 Consider an Endpoint with the following description: 1454 3 cameras, 3 displays, a 6 person table 1456 o Each camera can provide one Capture for each 1/3 section of the 1457 table 1459 o A single Capture representing the active speaker can be provided 1460 (voice activity based camera selection to a given encoder input 1461 port implemented locally in the Endpoint) 1463 o A single Capture representing the active speaker with the other 1464 2 Captures shown picture in picture within the stream can be 1465 provided (again, implemented inside the endpoint) 1467 o A Capture showing a zoomed out view of all 6 seats in the room 1468 can be provided 1470 The audio and video Captures for this Endpoint can be described as 1471 follows. 1473 Video Captures: 1475 o VC0- (the camera-left camera stream), encoding group=EG0, 1476 switched=false, view=table 1478 o VC1- (the center camera stream), encoding group=EG1, 1479 switched=false, view=table 1481 o VC2- (the camera-right camera stream), encoding group=EG2, 1482 switched=false, view=table 1484 o VC3- (the loudest panel stream), encoding group=EG1, 1485 switched=true, view=table 1487 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1488 composed=true, switched=true, view=room 1490 o VC5- (the zoomed out view of all people in the room), encoding 1491 group=EG1, composed=false, switched=false, view=room 1493 o VC6- (presentation stream), encoding group=EG1, presentation, 1494 switched=false 1496 The following diagram is a top view of the room with 3 cameras, 3 1497 displays, and 6 seats. Each camera is capturing 2 people. The 1498 six seats are not all in a straight line. 1500 ,-. d 1501 ( )`--.__ +---+ 1502 `-' / `--.__ | | 1503 ,-. | `-.._ |_-+Camera 2 (VC2) 1504 ( ).' ___..-+-''`+-+ 1505 `-' |_...---'' | | 1506 ,-.c+-..__ +---+ 1507 ( )| ``--..__ | | 1508 `-' | ``+-..|_-+Camera 1 (VC1) 1509 ,-. | __..--'|+-+ 1510 ( )| __..--' | | 1511 `-'b|..--' +---+ 1512 ,-. |``---..___ | | 1513 ( )\ ```--..._|_-+Camera 0 (VC0) 1514 `-' \ _..-''`-+ 1515 ,-. \ __.--'' | | 1516 ( ) |..-'' +---+ 1517 `-' a 1519 The two points labeled b and c are intended to be at the midpoint 1520 between the seating positions, and where the fields of view of the 1521 cameras intersect. 1523 The plane of interest for VC0 is a vertical plane that intersects 1524 points 'a' and 'b'. 1526 The plane of interest for VC1 intersects points 'b' and 'c'. The 1527 plane of interest for VC2 intersects points 'c' and 'd'. 1529 This example uses an area scale of millimeters. 1531 Areas of capture: 1533 bottom left bottom right top left top right 1534 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1535 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1536 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1537 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1538 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1539 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1540 VC6 none 1542 Points of capture: 1543 VC0 (-1678,0,800) 1544 VC1 (0,0,800) 1545 VC2 (1678,0,800) 1546 VC3 none 1547 VC4 none 1548 VC5 (0,0,800) 1549 VC6 none 1551 In this example, the right edge of the VC0 area lines up with the 1552 left edge of the VC1 area. It doesn't have to be this way. There 1553 could be a gap or an overlap. One additional thing to note for 1554 this example is the distance from a to b is equal to the distance 1555 from b to c and the distance from c to d. All these distances are 1556 1346 mm. This is the planar width of each area of capture for VC0, 1557 VC1, and VC2. 1559 Note the text in parentheses (e.g. "the camera-left camera 1560 stream") is not explicitly part of the model, it is just 1561 explanatory text for this example, and is not included in the 1562 model with the media captures and attributes. Also, the 1563 "composed" boolean attribute doesn't say anything about how a 1564 capture is composed, so the media consumer can't tell based on 1565 this attribute that VC4 is composed of a "loudest panel with 1566 PiPs". 1568 Audio Captures: 1570 o AC0 (camera-left), encoding group=EG3, content=main, channel 1571 format=mono 1573 o AC1 (camera-right), encoding group=EG3, content=main, channel 1574 format=mono 1576 o AC2 (center) encoding group=EG3, content=main, channel 1577 format=mono 1579 o AC3 being a simple pre-mixed audio stream from the room (mono), 1580 encoding group=EG3, content=main, channel format=mono 1582 o AC4 audio stream associated with the presentation video (mono) 1583 encoding group=EG3, content=slides, channel format=mono 1585 Areas of capture: 1587 bottom left bottom right top left top right 1589 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1590 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1591 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1592 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1593 AC4 none 1595 The physical simultaneity information is: 1597 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1599 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1601 This constraint indicates it is not possible to use all the VCs at 1602 the same time. VC5 can not be used at the same time as VC1 or VC3 1603 or VC4. Also, using every member in the set simultaneously may 1604 not make sense - for example VC3(loudest) and VC4 (loudest with 1605 PIP). (In addition, there are encoding constraints that make 1606 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1607 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1608 in the encoding groups, not in the simultaneous transmission 1609 sets.) 1611 In this example there are no restrictions on which audio captures 1612 can be sent simultaneously. 1614 Encoding Groups: 1616 This example has three encoding groups associated with the video 1617 captures. Each group can have 3 encodings, but with each 1618 potential encoding having a progressively lower specification. In 1619 this example, 1080p60 transmission is possible (as ENC0 has a 1620 maxPps value compatible with that). Significantly, as up to 3 1621 encodings are available per group, it is possible to transmit some 1622 video captures simultaneously that are not in the same entry in 1623 the capture scene. For example VC1 and VC3 at the same time. 1625 It is also possible to transmit multiple capture encodings of a 1626 single video capture. For example VC0 can be encoded using ENC0 1627 and ENC1 at the same time, as long as the encoding parameters 1628 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1629 4000000 bps and one at 2000000 bps. 1631 encodeGroupID=EG0, maxGroupBandwidth=6000000 1632 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1633 maxPps=124416000, maxBandwidth=4000000 1635 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1636 maxPps=27648000, maxBandwidth=4000000 1637 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1638 maxPps=15552000, maxBandwidth=4000000 1639 encodeGroupID=EG1 maxGroupBandwidth=6000000 1640 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1641 maxPps=124416000, maxBandwidth=4000000 1642 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1643 maxPps=27648000, maxBandwidth=4000000 1644 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1645 maxPps=15552000, maxBandwidth=4000000 1646 encodeGroupID=EG2 maxGroupBandwidth=6000000 1647 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1648 maxPps=124416000, maxBandwidth=4000000 1649 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1650 maxPps=27648000, maxBandwidth=4000000 1651 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1652 maxPps=15552000, maxBandwidth=4000000 1654 Figure 2: Example Encoding Groups for Video 1656 For audio, there are five potential encodings available, so all 1657 five audio captures can be encoded at the same time. 1659 encodeGroupID=EG3, maxGroupBandwidth=320000 1660 encodeID=ENC9, maxBandwidth=64000 1661 encodeID=ENC10, maxBandwidth=64000 1662 encodeID=ENC11, maxBandwidth=64000 1663 encodeID=ENC12, maxBandwidth=64000 1664 encodeID=ENC13, maxBandwidth=64000 1666 Figure 3: Example Encoding Group for Audio 1668 Capture Scenes: 1670 The following table represents the capture scenes for this 1671 provider. Recall that a capture scene is composed of alternative 1672 capture scene entries covering the same spatial region. Capture 1673 Scene #1 is for the main people captures, and Capture Scene #2 is 1674 for presentation. 1676 Each row in the table is a separate Capture Scene Entry 1678 +------------------+ 1679 | Capture Scene #1 | 1680 +------------------+ 1681 | VC0, VC1, VC2 | 1682 | VC3 | 1683 | VC4 | 1684 | VC5 | 1685 | AC0, AC1, AC2 | 1686 | AC3 | 1687 +------------------+ 1689 +------------------+ 1690 | Capture Scene #2 | 1691 +------------------+ 1692 | VC6 | 1693 | AC4 | 1694 +------------------+ 1696 Different capture scenes are unique to each other, non- 1697 overlapping. A consumer can choose an entry from each capture 1698 scene. In this case the three captures VC0, VC1, and VC2 are one 1699 way of representing the video from the endpoint. These three 1700 captures should appear adjacent next to each other. 1701 Alternatively, another way of representing the Capture Scene is 1702 with the capture VC3, which automatically shows the person who is 1703 talking. Similarly for the VC4 and VC5 alternatives. 1705 As in the video case, the different entries of audio in Capture 1706 Scene #1 represent the "same thing", in that one way to receive 1707 the audio is with the 3 audio captures (AC0, AC1, AC2), and 1708 another way is with the mixed AC3. The Media Consumer can choose 1709 an audio capture entry it is capable of receiving. 1711 The spatial ordering is understood by the media capture attributes 1712 Area of Capture and Point of Capture. 1714 A Media Consumer would likely want to choose a capture scene entry 1715 to receive based in part on how many streams it can simultaneously 1716 receive. A consumer that can receive three people streams would 1717 probably prefer to receive the first entry of Capture Scene #1 1718 (VC0, VC1, VC2) and not receive the other entries. A consumer 1719 that can receive only one people stream would probably choose one 1720 of the other entries. 1722 If the consumer can receive a presentation stream too, it would 1723 also choose to receive the only entry from Capture Scene #2 (VC6). 1725 12.1.2. Encoding Group Example 1727 This is an example of an encoding group to illustrate how it can 1728 express dependencies between encodings. 1730 encodeGroupID=EG0 maxGroupBandwidth=6000000 1731 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1732 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1733 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1734 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1735 encodeID=AUDENC0, maxBandwidth=96000 1736 encodeID=AUDENC1, maxBandwidth=96000 1737 encodeID=AUDENC2, maxBandwidth=96000 1739 Here, the encoding group is EG0. Although the encoding group is 1740 capable of transmitting up to 6Mbit/s, no individual video 1741 encoding can exceed 4Mbit/s. 1743 This encoding group also allows up to 3 audio encodings, AUDENC<0- 1744 2>. It is not required that audio and video encodings reside 1745 within the same encoding group, but if so then the group's overall 1746 maxBandwidth value is a limit on the sum of all audio and video 1747 encodings configured by the consumer. A system that does not wish 1748 or need to combine bandwidth limitations in this way should 1749 instead use separate encoding groups for audio and video in order 1750 for the bandwidth limitations on audio and video to not interact. 1752 Audio and video can be expressed in separate encoding groups, as 1753 in this illustration. 1755 encodeGroupID=EG0 maxGroupBandwidth=6000000 1756 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1757 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1758 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1759 maxFrameRate=60, maxPps=62208000, maxBandwidth=4000000 1760 encodeGroupID=EG1 maxGroupBandwidth=500000 1761 encodeID=AUDENC0, maxBandwidth=96000 1762 encodeID=AUDENC1, maxBandwidth=96000 1763 encodeID=AUDENC2, maxBandwidth=96000 1765 12.1.3. The MCU Case 1767 This section shows how an MCU might express its Capture Scenes, 1768 intending to offer different choices for consumers that can handle 1769 different numbers of streams. A single audio capture stream is 1770 provided for all single and multi-screen configurations that can 1771 be associated (e.g. lip-synced) with any combination of video 1772 captures at the consumer. 1774 +--------------------+-------------------------------------------- 1775 | Capture Scene #1 | note 1776 | 1777 +--------------------+-------------------------------------------- 1778 | VC0 | video capture for single screen consumer 1779 | 1780 | VC1, VC2 | video capture for 2 screen consumer 1781 | 1782 | VC3, VC4, VC5 | video capture for 3 screen consumer 1783 | 1784 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer 1785 | 1786 | AC0 | audio capture representing all participants 1787 | 1788 +--------------------+-------------------------------------------- 1790 If / when a presentation stream becomes active within the 1791 conference the MCU might re-advertise the available media as: 1793 +------------------+--------------------------------------+ 1794 | Capture Scene #2 | note | 1795 +------------------+--------------------------------------+ 1796 | VC10 | video capture for presentation | 1797 | AC1 | presentation audio to accompany VC10 | 1798 +------------------+--------------------------------------+ 1800 12.2. Media Consumer Behavior 1802 This section gives an example of how a Media Consumer might behave 1803 when deciding how to request streams from the three screen 1804 endpoint described in the previous section. 1806 The receive side of a call needs to balance its requirements, 1807 based on number of screens and speakers, its decoding capabilities 1808 and available bandwidth, and the provider's capabilities in order 1809 to optimally configure the provider's streams. Typically it would 1810 want to receive and decode media from each Capture Scene 1811 advertised by the Provider. 1813 A sane, basic, algorithm might be for the consumer to go through 1814 each Capture Scene in turn and find the collection of Video 1815 Captures that best matches the number of screens it has (this 1816 might include consideration of screens dedicated to presentation 1817 video display rather than "people" video) and then decide between 1818 alternative entries in the video Capture Scenes based either on 1819 hard-coded preferences or user choice. Once this choice has been 1820 made, the consumer would then decide how to configure the 1821 provider's encoding groups in order to make best use of the 1822 available network bandwidth and its own decoding capabilities. 1824 12.2.1. One screen Media Consumer 1826 VC3, VC4 and VC5 are all different entries by themselves, not 1827 grouped together in a single entry, so the receiving device should 1828 choose between one of those. The choice would come down to 1829 whether to see the greatest number of participants simultaneously 1830 at roughly equal precedence (VC5), a switched view of just the 1831 loudest region (VC3) or a switched view with PiPs (VC4). An 1832 endpoint device with a small amount of knowledge of these 1833 differences could offer a dynamic choice of these options, in- 1834 call, to the user. 1836 12.2.2. Two screen Media Consumer configuring the example 1838 Mixing systems with an even number of screens, "2n", and those 1839 with "2n+1" cameras (and vice versa) is always likely to be the 1840 problematic case. In this instance, the behavior is likely to be 1841 determined by whether a "2 screen" system is really a "2 decoder" 1842 system, i.e., whether only one received stream can be displayed 1843 per screen or whether more than 2 streams can be received and 1844 spread across the available screen area. To enumerate 3 possible 1845 behaviors here for the 2 screen system when it learns that the far 1846 end is "ideally" expressed via 3 capture streams: 1848 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1849 per the 1 screen consumer case above) and either leave one 1850 screen blank or use it for presentation if / when a 1851 presentation becomes active. 1853 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 1854 screens (either with each capture being scaled to 2/3 of a 1855 screen and the center capture being split across 2 screens) or, 1856 as would be necessary if there were large bezels on the 1857 screens, with each stream being scaled to 1/2 the screen width 1858 and height and there being a 4th "blank" panel. This 4th panel 1859 could potentially be used for any presentation that became 1860 active during the call. 1862 3. Receive 3 streams, decode all 3, and use control information 1863 indicating which was the most active to switch between showing 1864 the left and center streams (one per screen) and the center and 1865 right streams. 1867 For an endpoint capable of all 3 methods of working described 1868 above, again it might be appropriate to offer the user the choice 1869 of display mode. 1871 12.2.3. Three screen Media Consumer configuring the example 1873 This is the most straightforward case - the Media Consumer would 1874 look to identify a set of streams to receive that best matched its 1875 available screens and so the VC0 plus VC1 plus VC2 should match 1876 optimally. The spatial ordering would give sufficient information 1877 for the correct video capture to be shown on the correct screen, 1878 and the consumer would either need to divide a single encoding 1879 group's capability by 3 to determine what resolution and frame 1880 rate to configure the provider with or to configure the individual 1881 video captures' encoding groups with what makes most sense (taking 1882 into account the receive side decode capabilities, overall call 1883 bandwidth, the resolution of the screens plus any user preferences 1884 such as motion vs sharpness). 1886 13. Acknowledgements 1888 Allyn Romanow and Brian Baldino were authors of early versions. 1889 Mark Gorzyinski contributed much to the approach. We want to 1890 thank Stephen Botzko for helpful discussions on audio. 1892 14. IANA Considerations 1894 None. 1896 15. Security Considerations 1898 TBD 1900 16. Changes Since Last Version 1902 NOTE TO THE RFC-Editor: Please remove this section prior to 1903 publication as an RFC. 1905 Changes from 11 to 12: 1907 1. Ticket #44. Remove note questioning about requiring a 1908 Consumer to send a Configure after receiving Advertisement. 1910 2. Ticket #43. Remove ability for consumer to choose value of 1911 attribute for scene-switch-policy. 1913 3. Ticket #36. Remove computational complexity parameter, 1914 MaxGroupPps, from Encoding Groups. 1916 4. Reword the Abstract and parts of sections 1 and 4 (now 5) 1917 based on Mary's suggestions as discussed on the list. Move 1918 part of the Introduction into a new section Overview & 1919 Motivation. 1921 5. Add diagram of an Advertisement, in the Overview of the 1922 Framework/Model section. 1924 6. Change Intended Status to Standards Track. 1926 7. Clean up RFC2119 keyword language. 1928 Changes from 10 to 11: 1930 1. Add description attribute to Media Capture and Capture Scene 1931 Entry. 1933 2. Remove contradiction and change the note about open issue 1934 regarding always responding to Advertisement with a Configure 1935 message. 1937 3. Update example section, to cleanup formatting and make the 1938 media capture attributes and encoding parameters consistent 1939 with the rest of the document. 1941 Changes from 09 to 10: 1943 1. Several minor clarifications such as about SDP usage, Media 1944 Captures, Configure message. 1946 2. Simultaneous Set can be expressed in terms of Capture Scene 1947 and Capture Scene Entry. 1949 3. Removed Area of Scene attribute. 1951 4. Add attributes from draft-groves-clue-capture-attr-01. 1953 5. Move some of the Media Capture attribute descriptions back 1954 into this document, but try to leave detailed syntax to the 1955 data model. Remove the OUTSOURCE sections, which are already 1956 incorporated into the data model document. 1958 Changes from 08 to 09: 1960 1. Use "document" instead of "memo". 1962 2. Add basic call flow sequence diagram to introduction. 1964 3. Add definitions for Advertisement and Configure messages. 1966 4. Add definitions for Capture and Provider. 1968 5. Update definition of Capture Scene. 1970 6. Update definition of Individual Encoding. 1972 7. Shorten definition of Media Capture and add key points in the 1973 Media Captures section. 1975 8. Reword a bit about capture scenes in overview. 1977 9. Reword about labeling Media Captures. 1979 10. Remove the Consumer Capability message. 1981 11. New example section heading for media provider behavior 1983 12. Clarifications in the Capture Scene section. 1985 13. Clarifications in the Simultaneous Transmission Set section. 1987 14. Capitalize defined terms. 1989 15. Move call flow example from introduction to overview section 1991 16. General editorial cleanup 1993 17. Add some editors' notes requesting input on issues 1995 18. Summarize some sections, and propose details be outsourced 1996 to other documents. 1998 Changes from 06 to 07: 2000 1. Ticket #9. Rename Axis of Capture Point attribute to Point 2001 on Line of Capture. Clarify the description of this 2002 attribute. 2004 2. Ticket #17. Add "capture encoding" definition. Use this new 2005 term throughout document as appropriate, replacing some usage 2006 of the terms "stream" and "encoding". 2008 3. Ticket #18. Add Max Capture Encodings media capture 2009 attribute. 2011 4. Add clarification that different capture scene entries are 2012 not necessarily mutually exclusive. 2014 Changes from 05 to 06: 2016 1. Capture scene description attribute is a list of text strings, 2017 each in a different language, rather than just a single string. 2019 2. Add new Axis of Capture Point attribute. 2021 3. Remove appendices A.1 through A.6. 2023 4. Clarify that the provider must use the same coordinate system 2024 with same scale and origin for all coordinates within the same 2025 capture scene. 2027 Changes from 04 to 05: 2029 1. Clarify limitations of "composed" attribute. 2031 2. Add new section "capture scene entry attributes" and add the 2032 attribute "scene-switch-policy". 2034 3. Add capture scene description attribute and description 2035 language attribute. 2037 4. Editorial changes to examples section for consistency with the 2038 rest of the document. 2040 Changes from 03 to 04: 2042 1. Remove sentence from overview - "This constitutes a significant 2043 change ..." 2045 2. Clarify a consumer can choose a subset of captures from a 2046 capture scene entry or a simultaneous set (in section "capture 2047 scene" and "consumer's choice..."). 2049 3. Reword first paragraph of Media Capture Attributes section. 2051 4. Clarify a stereo audio capture is different from two mono audio 2052 captures (description of audio channel format attribute). 2054 5. Clarify what it means when coordinate information is not 2055 specified for area of capture, point of capture, area of scene. 2057 6. Change the term "producer" to "provider" to be consistent (it 2058 was just in two places). 2060 7. Change name of "purpose" attribute to "content" and refer to 2061 RFC4796 for values. 2063 8. Clarify simultaneous sets are part of a provider advertisement, 2064 and apply across all capture scenes in the advertisement. 2066 9. Remove sentence about lip-sync between all media captures in a 2067 capture scene. 2069 10. Combine the concepts of "capture scene" and "capture set" 2070 into a single concept, using the term "capture scene" to 2071 replace the previous term "capture set", and eliminating the 2072 original separate capture scene concept. 2074 Informative References 2075 Edt. Note: Decide which of these really are Normative References. 2077 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2078 Requirement Levels", BCP 14, RFC 2119, March 1997. 2080 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 2081 Johnston, 2082 A., Peterson, J., Sparks, R., Handley, M., and E. 2083 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 2084 June 2002. 2086 [RFC3264] Rosenberg, J., Schulzrinne, H., "An Offer/Answer Model 2087 with the Session Description Protocol (SDP)", RFC 3264, 2088 June 2002. 2090 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 2091 Jacobson, "RTP: A Transport Protocol for Real-Time 2092 Applications", STD 64, RFC 3550, July 2003. 2094 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 2095 Session Initiation Protocol (SIP)", RFC 4353, 2096 February 2006. 2098 [RFC4579] Johnston, A., Levin, O., "SIP Call Control - 2099 Conferencing for User Agents", RFC 4579, August 2006 2101 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 2102 5117, 2103 January 2008. 2105 17. Authors' Addresses 2107 Mark Duckworth (editor) 2108 Polycom 2109 Andover, MA 01810 2110 USA 2112 Email: mark.duckworth@polycom.com 2114 Andrew Pepperell 2115 Acano 2116 Uxbridge, England 2117 UK 2119 Email: apeppere@gmail.com 2121 Stephan Wenger 2122 Vidyo, Inc. 2123 433 Hackensack Ave. 2124 Hackensack, N.J. 07601 2125 USA 2127 Email: stewe@stewe.org