idnits 2.17.1 draft-ietf-clue-framework-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1298 has weird spacing: '...om left bot...' == Line 1352 has weird spacing: '...om left bot...' -- The document date (December 24, 2012) is 4140 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG M. Duckworth, Ed. 3 Internet Draft Polycom 4 Intended status: Informational A. Pepperell 5 Expires: June, 2013 Silverflare 6 S. Wenger 7 Vidyo 8 December 24, 2012 10 Framework for Telepresence Multi-Streams 11 draft-ietf-clue-framework-08.txt 13 Abstract 15 This memo offers a framework for a protocol that enables devices 16 in a telepresence conference to interoperate by specifying the 17 relationships between multiple media streams. 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current 27 Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six 30 months and may be updated, replaced, or obsoleted by other 31 documents at any time. It is inappropriate to use Internet-Drafts 32 as reference material or to cite them other than as "work in 33 progress." 35 This Internet-Draft will expire on June 24, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. Code Components extracted from this 48 document must include Simplified BSD License text as described in 49 Section 4.e of the Trust Legal Provisions and are provided without 50 warranty as described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction...................................................3 55 2. Terminology....................................................6 56 3. Definitions....................................................6 57 4. Overview of the Framework/Model................................9 58 5. Spatial Relationships.........................................11 59 6. Media Captures and Capture Scenes.............................12 60 6.1. Media Captures...........................................12 61 6.1.1. Media Capture Attributes............................12 62 6.2. Capture Scene............................................15 63 6.2.1. Capture scene attributes............................17 64 6.2.2. Capture scene entry attributes......................18 65 6.3. Simultaneous Transmission Set Constraints................19 66 7. Encodings.....................................................20 67 7.1. Individual Encodings.....................................21 68 7.2. Encoding Group...........................................22 69 8. Associating Media Captures with Encoding Groups...............24 70 9. Consumer's Choice of Streams to Receive from the Provider.....25 71 9.1. Local preference.........................................26 72 9.2. Physical simultaneity restrictions.......................26 73 9.3. Encoding and encoding group limits.......................26 74 9.4. Message Flow.............................................27 75 10. Extensibility................................................28 76 11. Examples - Using the Framework...............................28 77 11.1. Three screen endpoint media provider....................28 78 11.2. Encoding Group Example..................................35 79 11.3. The MCU Case............................................36 80 11.4. Media Consumer Behavior.................................37 81 11.4.1. One screen consumer................................37 82 11.4.2. Two screen consumer configuring the example........38 83 11.4.3. Three screen consumer configuring the example......38 84 12. Acknowledgements.............................................39 85 13. IANA Considerations..........................................39 86 14. Security Considerations......................................39 87 15. Changes Since Last Version...................................39 88 16. Authors' Addresses...........................................42 90 1. Introduction 92 Current telepresence systems, though based on open standards such 93 as RTP [RFC3550] and SIP [RFC3261], cannot easily interoperate 94 with each other. A major factor limiting the interoperability of 95 telepresence systems is the lack of a standardized way to describe 96 and negotiate the use of the multiple streams of audio and video 97 comprising the media flows. This draft provides a framework for a 98 protocol to enable interoperability by handling multiple streams 99 in a standardized way. It is intended to support the use cases 100 described in draft-ietf-clue-telepresence-use-cases-02 and to meet 101 the requirements in draft-ietf-clue-telepresence-requirements-01. 103 Conceptually distinguished are Media Providers and Media 104 Consumers. A Media Provider provides Media in the form of RTP 105 packets, a Media Consumer consumes those RTP packets. Media 106 Providers and Media Consumers can reside in Endpoints or in 107 middleboxes such as Multipoint Control Units (MCUs). A Media 108 Provider in an Endpoint is usually associated with the generation 109 of media for Media Captures; these Media Captures are typically 110 sourced from cameras, microphones, and the like. Similarly, the 111 Media Consumer in an Endpoint is usually associated with 112 Renderers, such as screens and loudspeakers. In middleboxes, 113 Media Providers and Consumers can have the form of outputs and 114 inputs, respectively, of RTP mixers, RTP translators, and similar 115 devices. Typically, telepresence devices such as Endpoints and 116 middleboxes would perform as both Media Providers and Media 117 Consumers, the former being concerned with those devices' 118 transmitted media and the latter with those devices' received 119 media. In a few circumstances, a CLUE Endpoint middlebox may 120 include only Consumer or Provider functionality, such as recorder- 121 type Consumers or webcam-type Providers. 123 One initial motivation for this memo and its companion documents 124 has been that Endpoints according to this memo can, and usually 125 do, have multiple Media Captures and Media Renderers. While 126 previous system designs can deal with such a situation, what was 127 missing was a mechanism that can associate the Media Captures with 128 each other in space and time. Further, due to the potentially 129 large number of RTP flows required for a Multimedia Conference 130 involving potentially many Endpoints, each of which can have many 131 Media Captures and Media Renderers, a sensible system design is to 132 multiplex multiple RTP media flows onto the same transport 133 address, so to avoid using the port number as a multiplexing point 134 and the associated shortcomings such as NAT/firewall traversal. 136 While the actual mapping of those RTP flows to the header fields 137 of the RTP packets is not subject of this specification, the large 138 number of possible permutations of sensible options a Media 139 Provider may make available to a Media Consumer makes a mechanism 140 desirable that allows to narrow down the number of possible 141 options that a SIP offer-answer exchange has to consider. Such 142 information is made available using protocol mechanisms specified 143 in this memo and companion documents, although it should be 144 stressed that its use in an implementation is optional. Also, 145 there are aspects of the control of both Endpoints and 146 middleboxes/MCUs that dynamically change during the progress of a 147 call, such as audio-level based screen switching, layout changes, 148 and so on, which need to be conveyed. Note that these control 149 aspects are complementary to those specified in traditional SIP 150 based conference management such as BFCP. Finally, all this 151 information needs to be conveyed, and the notion of support for it 152 needs to be established. This is done by the negotiation of a 153 "CLUE channel", a data channel negotiated early during the 154 initiation of a call. An Endpoint or MCU that rejects the 155 establishment of this data channel, by definition, is not 156 supporting CLUE based mechanisms, whereas an Endpoint or MCU that 157 accepts it is required to use it to the extent specified in this 158 memo and its companion documents. 160 A very brief outline of the call flow used by a simple system in 161 compliance with this memo can be described as follows. 163 An initial offer/answer exchange establishes a CLUE channel 164 between two Endpoints. With the establishment of that channel, 165 the endpoints have consented to use the CLUE protocol mechanisms 166 and have to adhere to them. 168 Over this CLUE channel, the Provider in each Endpoint conveys its 169 characteristics and capabilities as specified herein (which will 170 typically not be sufficient to set up all media). The Consumer in 171 the Endpoint receives the information provided by the Provider, 172 and can use it for two purposes. First, it can, but is not 173 necessarily required to, use the information provided to tailor 174 the SDP it is going to send during the following SIP offer/answer 175 exchange, and its reaction to SDP it receives in that step. It is 176 often a sensible implementation choice to do so, as the 177 representation of the media information conveyed over the CLUE 178 channel can dramatically cut down on the size of SDP messages used 179 in the O/A exchange that follows. Second, it takes note of the 180 spatial relationship associated with the Media that are described. 182 It is often sensible to take that spatial relationship into 183 account when tailoring the SDP. 185 This CLUE exchange is followed by an SDP offer answer exchange 186 that not only establishes those aspects of the media that have not 187 been "negotiated" over CLUE, but has also the side effect of 188 setting up the media transmission itself, involving potentially 189 security exchanges, ICE, and whatnot. This step is plain vanilla 190 SIP, with the exception that the SDP used herein, in most cases 191 can (but not necessarily must) be considerably smaller than the 192 SDP a system would typically need to exchange if there were no 193 pre-established knowledge about the Provider and Consumer 194 characteristics. 196 During the lifetime of a call, further exchanges can occur over 197 the CLUE channel. In some cases, those further exchanges can be 198 dealt with by Provider or Consumer without any other protocol 199 activity. For example, voice-activated screen switching, signaled 200 over the CLUE channel, ought not to lead to heavy-handed 201 mechanisms like SIP re-invites. However, in other cases, after 202 the CLUE negotiation an additional offer/answer exchange may 203 become necessary. For example, if both sides decide to upgrade 204 the call from a single screen to a multi-screen call and more 205 bandwidth is required for the additional video channels, that 206 could require a new O/A exchange. 208 Numerous optimizations may be possible, and are the implementer's 209 choice. For example, it may be sensible to establish one or more 210 initial media channels during the initial offer/answer exchange, 211 which would allow, for example, for a fast startup of audio. 212 Depending on the system design, it may be possible to re-use this 213 established channel using only CLUE mechanisms, thereby avoiding 214 further offer/answer exchanges. 216 One aspect of the protocol outlined herein and specified in 217 normative detail in companion documents is that it makes available 218 information regarding the Provider's capabilities to deliver 219 Media, and attributes related to that media such as their spatial 220 relationship, to the Media Consumer. The operation of the 221 Renderer inside the Consumer is unspecified in that it can choose 222 to ignore some information provided by the Provider, and/or not 223 render media streams available from the Provider (although it has 224 to follow the CLUE protocol and, therefore, has to "accept" the 225 Provider's information). All CLUE protocol mechanisms are 226 optional in the Consumer in the sense that, while the Consumer 227 must be able to receive (and, potentially, gracefully acknowledge) 228 CLUE messages, it is free to ignore the information provided 229 therein. Obviously, this is not a particularly sensible design 230 choice. 232 Legacy devices are defined here in as those Endpoints and MCUs 233 that do not support the setup and use of the CLUE channel. The 234 notion of a device being a legacy device is established during the 235 initial offer/answer exchange, in which the legacy device will not 236 understand the offer for the CLUE channel and, therefore, reject 237 it. This is the indication for the CLUE-implementing Endpoint or 238 MCU that the other side of the communication is not compliant with 239 CLUE, and to fall back to whatever mechanism was used before the 240 introduction of CLUE. 242 As for the media, Provider and Consumer have an end-to-end 243 communication relationship with respect to (RTP transported) 244 media; and the mechanisms described herein and in companion 245 documents do not change the aspects of setting up those RTP flows 246 and sessions. However, it should be noted that forms of RTP 247 multiplexing of multiple RTP flows onto the same transport address 248 are developed concurrently with the CLUE suite of specifications, 249 and it is widely expected that most, if not all, Endpoints or MCUs 250 supporting CLUE will also support those mechanisms. Some design 251 choices made in this memo reflect this coincidence in spec 252 development timing. 254 2. Terminology 256 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 257 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" 258 in this document are to be interpreted as described in RFC 2119 259 [RFC2119]. 261 3. Definitions 263 The terms defined below are used throughout this memo and 264 companion documents and they are normative. In order to easily 265 identify the use of a defined term, those terms are capitalized. 267 Audio Capture: Media Capture for audio. Denoted as ACn. 269 Camera-Left and Right: For media captures, camera-left and camera- 270 right are from the point of view of a person observing the 271 rendered media. They are the opposite of stage-left and stage- 272 right. 274 Capture Device: A device that converts audio and video input into 275 an electrical signal, in most cases to be fed into a media 276 encoder. 278 Cameras and microphones are examples for capture devices. 280 Capture Encoding: A specific encoding of a media capture, to be 281 sent by a media provider to a media consumer via RTP. 283 Capture Scene: a structure representing the scene that is captured 284 by a collection of capture devices. A capture scene includes 285 attributes and one or more capture scene entries, with each entry 286 including one or more media captures. 288 Capture Scene Entry: a list of media captures of the same media 289 type that together form one way to represent the capture scene. 291 Conference: used as defined in [RFC4353], A Framework for 292 Conferencing within the Session Initiation Protocol (SIP). 294 Individual Encoding: A variable with a set of attributes that 295 describes the maximum values of a single audio or video capture 296 encoding. The attributes include: maximum bandwidth- and for 297 video maximum macroblocks (for H.264), maximum width, maximum 298 height, maximum frame rate. 300 Encoding Group: A set of encoding parameters representing a total 301 media encoding capability to be sub-divided across potentially 302 multiple Individual Encodings. 304 Endpoint: The logical point of final termination through 305 receiving, decoding and rendering, and/or initiation through 306 capturing, encoding, and sending of media streams. An endpoint 307 consists of one or more physical devices which source and sink 308 media streams, and exactly one [RFC4353] Participant (which, in 309 turn, includes exactly one SIP User Agent). In contrast to an 310 endpoint, an MCU may also send and receive media streams, but it 311 is not the initiator nor the final terminator in the sense that 312 Media is Captured or Rendered. Endpoints can be anything from 313 multiscreen/multicamera rooms to handheld devices. 315 Front: the portion of the room closest to the cameras. In going 316 towards back you move away from the cameras. 318 MCU: Multipoint Control Unit (MCU) - a device that connects two or 319 more endpoints together into one single multimedia conference 320 [RFC5117]. An MCU includes an [RFC4353] Mixer. [Edt. RFC4353 is 321 tardy in requiring that media from the mixer be sent to EACH 322 participant. I think we have practical use cases where this is 323 not the case. But the bug (if it is one) is in 4353 and not 324 herein.] 326 Media: Any data that, after suitable encoding, can be conveyed 327 over RTP, including audio, video or timed text. 329 Media Capture: a source of Media, such as from one or more Capture 330 Devices. A Media Capture (MC) may be the source of one or more 331 capture encodings. A Media Capture may also be constructed from 332 other Media streams. A middle box can express Media Captures that 333 it constructs from Media streams it receives. 335 Media Consumer: an Endpoint or middle box that receives media 336 streams 338 Media Provider: an Endpoint or middle box that sends Media streams 340 Model: a set of assumptions a telepresence system of a given 341 vendor adheres to and expects the remote telepresence system(s) 342 also to adhere to. 344 Plane of Interest: The spatial plane containing the most relevant 345 subject matter. 347 Render: the process of generating a representation from a media, 348 such as displayed motion video or sound emitted from loudspeakers. 350 Simultaneous Transmission Set: a set of media captures that can be 351 transmitted simultaneously from a Media Provider. 353 Spatial Relation: The arrangement in space of two objects, in 354 contrast to relation in time or other relationships. See also 355 Camera-Left and Right. 357 Stage-Left and Right: For media captures, stage-left and stage- 358 right are the opposite of camera-left and camera-right. For the 359 case of a person facing (and captured by) a camera, stage-left and 360 stage-right are from the point of view of that person. 362 Stream: a capture encoding sent from a media provider to a media 363 consumer via RTP [RFC3550]. 365 Stream Characteristics: the media stream attributes commonly used 366 in non-CLUE SIP/SDP environments (such as: media codec, bit rate, 367 resolution, profile/level etc.) as well as CLUE specific 368 attributes, such as the ID of a capture or a spatial location. 370 Telepresence: an environment that gives non co-located users or 371 user groups a feeling of (co-located) presence - the feeling that 372 a Local user is in the same room with other Local users and the 373 Remote parties. The inclusion of Remote parties is achieved 374 through multimedia communication including at least audio and 375 video signals of high fidelity. 377 Video Capture: Media Capture for video. Denoted as VCn. 379 Video composite: A single image that is formed from combining 380 visual elements from separate sources. 382 4. Overview of the Framework/Model 384 The CLUE framework specifies how multiple media streams are to be 385 handled in a telepresence conference. 387 The main goals include: 389 o Interoperability 391 o Extensibility 393 o Flexibility 395 Interoperability is achieved by the media provider describing the 396 relationships between media streams in constructs that are 397 understood by the consumer, who can then render the media. 398 Extensibility is achieved through abstractions and the generality 399 of the model, making it easy to add new parameters. Flexibility 400 is achieved largely by having the consumer choose what content and 401 format it wants to receive from what the provider is capable of 402 sending. 404 A transmitting endpoint or MCU describes specific aspects of the 405 content of the media and the formatting of the media streams it 406 can send (advertisement); and the receiving end responds to the 407 provider by specifying which content and media streams it wants to 408 receive (configuration). The provider then transmits the asked 409 for content in the specified streams. 411 This advertisement and configuration occurs at call initiation but 412 may also happen at any time throughout the conference, whenever 413 there is a change in what the consumer wants or the provider can 414 send. 416 An endpoint or MCU typically acts as both provider and consumer at 417 the same time, sending advertisements and sending configurations 418 in response to receiving advertisements. (It is possible to be 419 just one or the other.) 421 The data model is based around two main concepts: a capture and an 422 encoding. A media capture (MC), such as audio or video, describes 423 the content a provider can send. Media captures are described in 424 terms of CLUE-defined attributes, such as spatial relationships 425 and purpose of the capture. Providers tell consumers which media 426 captures they can provide, described in terms of the media capture 427 attributes. 429 A provider organizes its media captures that represent the same 430 scene into capture scenes. A consumer chooses which media 431 captures it wants to receive according to the capture scenes sent 432 by the provider. 434 In addition, the provider sends the consumer a description of the 435 individual encodings it can send in terms of the media attributes 436 of the encodings, in particular, well-known audio and video 437 parameters such as bandwidth, frame rate, macroblocks per second. 439 The provider also specifies constraints on its ability to provide 440 media, and the consumer must take these into account in choosing 441 the content and capture encodings it wants. Some constraints are 442 due to the physical limitations of devices - for example, a camera 443 may not be able to provide zoom and non-zoom views simultaneously. 444 Other constraints are system based constraints, such as maximum 445 bandwidth and maximum macroblocks/second. 447 The following sections discuss these constructs and processes in 448 detail, followed by use cases showing how the framework 449 specification can be used. 451 5. Spatial Relationships 453 In order for a consumer to perform a proper rendering, it is often 454 necessary to provide spatial information about the streams it is 455 receiving. CLUE defines a coordinate system that allows media 456 providers to describe the spatial relationships of their media 457 captures to enable proper scaling and spatial rendering of their 458 streams. The coordinate system is based on a few principles: 460 o Simple systems which do not have multiple Media Captures to 461 associate spatially need not use the coordinate model. 463 o Coordinates can either be in real, physical units 464 (millimeters), have an unknown scale or have no physical scale. 465 Systems which know their physical dimensions should always 466 provide those real-world measurements. Systems which don't 467 know specific physical dimensions but still know relative 468 distances should use 'unknown scale'. 'No scale' is intended 469 to be used where Media Captures from different devices (with 470 potentially different scales) will be forwarded alongside one 471 another (e.g. in the case of a middle box). 473 * "millimeters" means the scale is in millimeters 475 * "Unknown" means the scale is not necessarily millimeters, 476 but the scale is the same for every capture in the capture 477 scene. 479 * "No Scale" means the scale could be different for each 480 capture- an MCU provider that advertises two adjacent 481 captures and picks sources (which can change quickly) from 482 different endpoints might use this value; the scale could be 483 different and changing for each capture. But the areas of 484 capture still represent a spatial relation between captures. 486 o The coordinate system is Cartesian X, Y, Z with the origin at a 487 spot of the provider's choosing. The provider must use the 488 same coordinate system with same scale and origin for all 489 coordinates within the same capture scene. 491 The direction of increasing coordinate values is: 492 X increases from camera left to camera right 493 Y increases from front to back 494 Z increases from low to high 496 6. Media Captures and Capture Scenes 498 This section describes how media providers can describe the 499 content of media to consumers. 501 6.1. Media Captures 503 Media captures are the fundamental representations of streams that 504 a device can transmit. What a Media Capture actually represents 505 is flexible: 507 o It can represent the immediate output of a physical source 508 (e.g. camera, microphone) or 'synthetic' source (e.g. laptop 509 computer, DVD player). 511 o It can represent the output of an audio mixer or video composer 513 o It can represent a concept such as 'the loudest speaker' 515 o It can represent a conceptual position such as 'the leftmost 516 stream' 518 To distinguish between multiple instances, video and audio 519 captures are numbered such as: VC1, VC2 and AC1, AC2. VC1 and VC2 520 refer to two different video captures and AC1 and AC2 refer to two 521 different audio captures. 523 Each Media Capture can be associated with attributes to describe 524 what it represents. 526 6.1.1. Media Capture Attributes 528 Media Capture Attributes describe static information about the 529 captures. A provider uses the media capture attributes to 530 describe the media captures to the consumer. The consumer will 531 select the captures it wants to receive. Attributes are defined 532 by a variable and its value. The currently defined attributes and 533 their values are: 535 Content: {slides, speaker, sl, main, alt} 536 A field with enumerated values which describes the role of the 537 media capture and can be applied to any media type. The 538 enumerated values are defined by [RFC4796]. The values for this 539 attribute are the same as the mediacnt values for the content 540 attribute in [RFC4796]. This attribute can have multiple values, 541 for example content={main, speaker}. 543 Composed: {true, false} 545 A field with a Boolean value which indicates whether or not the 546 Media Capture is a mix (audio) or composition (video) of streams. 548 This attribute is useful for a media consumer to avoid nesting a 549 composed video capture into another composed capture or rendering. 550 This attribute is not intended to describe the layout a media 551 provider uses when composing video streams. 553 Audio Channel Format: {mono, stereo} A field with enumerated 554 values which describes the method of encoding used for audio. 556 A value of 'mono' means the Audio Capture has one channel. 558 A value of 'stereo' means the Audio Capture has two audio 559 channels, left and right. 561 This attribute applies only to Audio Captures. A single stereo 562 capture is different from two mono captures that have a left-right 563 spatial relationship. A stereo capture maps to a single RTP 564 stream, while each mono audio capture maps to a separate RTP 565 stream. 567 Switched: {true, false} 569 A field with a Boolean value which indicates whether or not the 570 Media Capture represents the (dynamic) most appropriate subset of 571 a 'whole'. What is 'most appropriate' is up to the provider and 572 could be the active speaker, a lecturer or a VIP. 574 Point of Capture: {(X, Y, Z)} 576 A field with a single Cartesian (X, Y, Z) point value which 577 describes the spatial location, virtual or physical, of the 578 capturing device (such as camera). 580 When the Point of Capture attribute is specified, it must include 581 X, Y and Z coordinates. If the point of capture is not specified, 582 it means the consumer should not assume anything about the spatial 583 location of the capturing device. Even if the provider specifies 584 an area of capture attribute, it does not need to specify the 585 point of capture. 587 Point on Line of Capture: {(X,Y,Z)} 589 A field with a single Cartesian (X, Y, Z) point value (virtual or 590 physical) which describes a position in space of a second point on 591 the axis of the capturing device; the first point being the Point 592 of Capture (see above). This point MUST lie between the Point of 593 Capture and the Area of Capture. 595 The Point on Line of Capture MUST be ignored if the Point of 596 Capture is not present for this capture device. When the Point on 597 Line of Capture attribute is specified, it must include X, Y and Z 598 coordinates. These coordinates MUST NOT be identical to the Point 599 of Capture coordinates. If the Point on Line of Capture is not 600 specified, no assumptions are made about the axis of the capturing 601 device. 603 Area of Capture: 605 {bottom left(X1, Y1, Z1), bottom right(X2, Y2, Z2), top left(X3, 606 Y3, Z3), top right(X4, Y4, Z4)} 608 A field with a set of four (X, Y, Z) points as a value which 609 describe the spatial location of what is being "captured". By 610 comparing the Area of Capture for different Media Captures within 611 the same capture scene a consumer can determine the spatial 612 relationships between them and render them correctly. 614 The four points should be co-planar. The four points form a 615 quadrilateral, not necessarily a rectangle. 617 The quadrilateral described by the four (X, Y, Z) points defines 618 the plane of interest for the particular media capture. 620 If the area of capture attribute is specified, it must include X, 621 Y and Z coordinates for all four points. If the area of capture 622 is not specified, it means the media capture is not spatially 623 related to any other media capture (but this can change in a 624 subsequent provider advertisement). 626 For a switched capture that switches between different sections 627 within a larger area, the area of capture should use coordinates 628 for the larger potential area. 630 EncodingGroup: {} 632 A field with a value equal to the encodeGroupID of the encoding 633 group associated with the media capture. 635 Max Capture Encodings: {unsigned integer} 637 An optional attribute indicating the maximum number of capture 638 encodings that can be simultaneously active for the media capture. 639 If absent, this parameter defaults to 1. The minimum value for 640 this attribute is 1. The number of simultaneous capture encodings 641 is also limited by the restrictions of the encoding group for the 642 media capture. 644 6.2. Capture Scene 646 In order for a provider's individual media captures to be used 647 effectively by a consumer, the provider organizes the media 648 captures into capture scenes, with the structure and contents of 649 these capture scenes being sent from the provider to the consumer. 651 A capture scene is a structure representing the scene that is 652 captured by a collection of capture devices. A capture scene 653 includes one or more capture scene entries, with each entry 654 including one or more media captures. A capture scene represents, 655 for example, the video image of a group of people seated next to 656 each other, along with the sound of their voices, which could be 657 represented by some number of VCs and ACs in the capture scene 658 entries. A middle box may also express capture scenes that it 659 constructs from media streams it receives. 661 A provider may advertise multiple capture scenes or just a single 662 capture scene. A media provider might typically use one capture 663 scene for main participant media and another capture scene for a 664 computer generated presentation. A capture scene may include more 665 than one type of media. For example, a capture scene can include 666 several capture scene entries for video captures, and several 667 capture scene entries for audio captures. 669 A provider can express spatial relationships between media 670 captures that are included in the same capture scene. But there 671 is no spatial relationship between media captures that are in 672 different capture scenes. 674 A media provider arranges media captures in a capture scene to 675 help the media consumer choose which captures it wants. The 676 capture scene entries in a capture scene are different 677 alternatives the provider is suggesting for representing the 678 capture scene. The media consumer can choose to receive all media 679 captures from one capture scene entry for each media type (e.g. 680 audio and video), or it can pick and choose media captures 681 regardless of how the provider arranges them in capture scene 682 entries. Different capture scene entries of the same media type 683 are not necessarily mutually exclusive alternatives. 685 Media captures within the same capture scene entry must be of the 686 same media type - it is not possible to mix audio and video 687 captures in the same capture scene entry, for instance. The 688 provider must be capable of encoding and sending all media 689 captures in a single entry simultaneously. A consumer may decide 690 to receive all the media captures in a single capture scene entry, 691 but a consumer could also decide to receive just a subset of those 692 captures. A consumer can also decide to receive media captures 693 from different capture scene entries. 695 When a provider advertises a capture scene with multiple entries, 696 it is essentially signaling that there are multiple 697 representations of the same scene available. In some cases, these 698 multiple representations would typically be used simultaneously 699 (for instance a "video entry" and an "audio entry"). In some 700 cases the entries would conceptually be alternatives (for instance 701 an entry consisting of 3 video captures versus an entry consisting 702 of just a single video capture). In this latter example, the 703 provider would in the simple case end up providing to the consumer 704 the entry containing the number of video captures that most 705 closely matched the media consumer's number of display devices. 707 The following is an example of 4 potential capture scene entries 708 for an endpoint-style media provider: 710 1. (VC0, VC1, VC2) - left, center and right camera video captures 712 2. (VC3) - video capture associated with loudest room segment 714 3. (VC4) - video capture zoomed out view of all people in the 715 room 716 4. (AC0) - main audio 718 The first entry in this capture scene example is a list of video 719 captures with a spatial relationship to each other. Determination 720 of the order of these captures (VC0, VC1 and VC2) for rendering 721 purposes is accomplished through use of their Area of Capture 722 attributes. The second entry (VC3) and the third entry (VC4) are 723 additional alternatives of how to capture the same room in 724 different ways. The inclusion of the audio capture in the same 725 capture scene indicates that AC0 is associated with those video 726 captures, meaning it comes from the same scene. The audio should 727 be rendered in conjunction with any rendered video captures from 728 the same capture scene. 730 6.2.1. Capture scene attributes 732 Attributes can be applied to capture scenes as well as to 733 individual media captures. Attributes specified at this level 734 apply to all constituent media captures. 736 Description attribute - list of {, } 739 The optional description attribute is a list of human readable 740 text strings which describe the capture scene. If there is more 741 than one string in the list, then each string in the list should 742 contain the same description, but in a different language. A 743 provider that advertises multiple capture scenes can provide 744 descriptions for each of them. This attribute can contain text in 745 any number of languages. 747 The language tag identifies the language of the corresponding 748 description text. The possible values for a language tag are the 749 values of the 'Subtag' column for the "Type: language" entries in 750 the "Language Subtag Registry" at [IANA-Lan] originally defined in 751 [RFC5646]. A particular language tag value MUST NOT be used more 752 than once in the description attribute list. 754 Area of Scene attribute 756 The area of scene attribute for a capture scene has the same 757 format as the area of capture attribute for a media capture. The 758 area of scene is for the entire scene, which is captured by the 759 one or more media captures in the capture scene entries. If the 760 provider does not specify the area of scene, but does specify 761 areas of capture, then the consumer may assume the area of scene 762 is greater than or equal to the outer extents of the individual 763 areas of capture. 765 Scale attribute 767 An optional attribute indicating if the numbers used for area of 768 scene, area of capture and point of capture are in terms of 769 millimeters, unknown scale factor, or not any scale, as described 770 in Section 5. If any media captures have an area of capture 771 attribute or point of capture attribute, then this scale attribute 772 must also be defined. The possible values for this attribute are: 774 "millimeters" 776 "unknown" 778 "no scale" 780 6.2.2. Capture scene entry attributes 782 Attributes can be applied to capture scene entries. Attributes 783 specified at this level apply to the capture scene entry as a 784 whole. 786 Scene-switch-policy: {site-switch, segment-switch} 788 A media provider uses this scene-switch-policy attribute to 789 indicate its support for different switching policies. In the 790 provider's advertisement, this attribute can have multiple values, 791 which means the provider supports each of the indicated policies. 792 The consumer, when it requests media captures from this capture 793 scene entry, should also include this attribute but with only the 794 single value (from among the values indicated by the provider) 795 indicating the consumer's choice for which policy it wants the 796 provider to use. If the provider does not support any of these 797 policies, it should omit this attribute. 799 The "site-switch" policy means all captures are switched at the 800 same time to keep captures from the same endpoint site together. 801 Let's say the speaker is at site A and everyone else is at a 802 "remote" site. 804 When the room at site A shown, all the camera images from site A 805 are forwarded to the remote sites. Therefore at each receiving 806 remote site, all the screens display camera images from site A. 807 This can be used to preserve full size image display, and also 808 provide full visual context of the displayed far end, site A. In 809 site switching, there is a fixed relation between the cameras in 810 each room and the displays in remote rooms. The room or 811 participants being shown is switched from time to time based on 812 who is speaking or by manual control. 814 The "segment-switch" policy means different captures can switch at 815 different times, and can be coming from different endpoints. 816 Still using site A as where the speaker is, and "remote" to refer 817 to all the other sites, in segment switching, rather than sending 818 all the images from site A, only the image containing the speaker 819 at site A is shown. The camera images of the current speaker and 820 previous speakers (if any) are forwarded to the other sites in the 821 conference. 823 Therefore the screens in each site are usually displaying images 824 from different remote sites - the current speaker at site A and 825 the previous ones. This strategy can be used to preserve full 826 size image display, and also capture the non-verbal communication 827 between the speakers. In segment switching, the display depends 828 on the activity in the remote rooms - generally, but not 829 necessarily based on audio / speech detection. 831 6.3. Simultaneous Transmission Set Constraints 833 The provider may have constraints or limitations on its ability to 834 send media captures. One type is caused by the physical 835 limitations of capture mechanisms; these constraints are 836 represented by a simultaneous transmission set. The second type 837 of limitation reflects the encoding resources available - 838 bandwidth and macroblocks/second. This type of constraint is 839 captured by encoding groups, discussed below. 841 An endpoint or MCU can send multiple captures simultaneously, 842 however sometimes there are constraints that limit which captures 843 can be sent simultaneously with other captures. A device may not 844 be able to be used in different ways at the same time. Provider 845 advertisements are made so that the consumer will choose one of 846 several possible mutually exclusive usages of the device. This 847 type of constraint is expressed in a Simultaneous Transmission 848 Set, which lists all the media captures that can be sent at the 849 same time. This is easier to show in an example. 851 Consider the example of a room system where there are 3 cameras 852 each of which can send a separate capture covering 2 persons each- 853 VC0, VC1, VC2. The middle camera can also zoom out and show all 6 854 persons, VC3. But the middle camera cannot be used in both modes 855 at the same time - it has to either show the space where 2 856 participants sit or the whole 6 seats, but not both at the same 857 time. 859 Simultaneous transmission sets are expressed as sets of the MCs 860 that could physically be transmitted at the same time, (though it 861 may not make sense to do so). In this example the two 862 simultaneous sets are shown in Table 1. The consumer must make 863 sure that it chooses one and not more of the mutually exclusive 864 sets. A consumer may choose any subset of the media captures in a 865 simultaneous set, it does not have to choose all the captures in a 866 simultaneous set if it does not want to receive all of them. 868 +-------------------+ 869 | Simultaneous Sets | 870 +-------------------+ 871 | {VC0, VC1, VC2} | 872 | {VC0, VC3, VC2} | 873 +-------------------+ 875 Table 1: Two Simultaneous Transmission Sets 877 A media provider includes the simultaneous sets in its provider 878 advertisement. These simultaneous set constraints apply across 879 all the captures scenes in the advertisement. The simultaneous 880 transmission sets MUST allow all the media captures in a 881 particular capture scene entry to be used simultaneously. 883 7. Encodings 885 We have considered how providers can describe the content of media 886 to consumers. We will now consider how the providers communicate 887 information about their abilities to send streams. We introduce 888 two constructs - individual encodings and encoding groups. 889 Consumers will then map the media captures they want onto the 890 encodings with encoding parameters they want. This process is 891 then described. 893 7.1. Individual Encodings 895 An individual encoding represents a way to encode a media capture 896 to become a capture encoding, to be sent as an encoded media 897 stream from the media provider to the media consumer. An 898 individual encoding has a set of parameters characterizing how the 899 media is encoded. 901 Different media types have different parameters, and different 902 encoding algorithms may have different parameters. An individual 903 encoding can be assigned to only one capture encoding at a time. 905 The parameters of an individual encoding represent the maximum 906 values for certain aspects of the encoding. A particular 907 instantiation into a capture encoding might use lower values than 908 these maximums. 910 The following tables show the variables for audio and video 911 encoding. 913 +--------------+-------------------------------------------------- 914 --+ 915 | Name | Description 916 | 917 +--------------+-------------------------------------------------- 918 --+ 919 | encodeID | A unique identifier for the individual encoding 920 | 921 | maxBandwidth | Maximum number of bits per second 922 | 923 | maxH264Mbps | Maximum number of macroblocks per second: ((width 924 | 925 | | + 15) / 16) * ((height + 15) / 16) * 926 | 927 | | framesPerSecond 928 | 929 | maxWidth | Video resolution's maximum supported width, 930 | 931 | | expressed in pixels 932 | 933 | maxHeight | Video resolution's maximum supported height, 934 | 935 | | expressed in pixels 936 | 937 | maxFrameRate | Maximum supported frame rate 938 | 939 +--------------+-------------------------------------------------- 940 --+ 942 Table 2: Individual Video Encoding Parameters 944 +--------------+-----------------------------------+ 945 | Name | Description | 946 +--------------+-----------------------------------+ 947 | maxBandwidth | Maximum number of bits per second | 948 +--------------+-----------------------------------+ 950 Table 3: Individual Audio Encoding Parameters 952 7.2. Encoding Group 954 An encoding group includes a set of one or more individual 955 encodings, plus some parameters that apply to the group as a 956 whole. By grouping multiple individual encodings together, an 957 encoding group describes additional constraints on bandwidth and 958 other parameters for the group. Table 4 shows the parameters and 959 individual encoding sets that are part of an encoding group. 961 +-------------------+--------------------------------------------- 962 --+ 963 | Name | Description 964 | 965 +-------------------+--------------------------------------------- 966 --+ 967 | encodeGroupID | A unique identifier for the encoding group 968 | 969 | maxGroupBandwidth | Maximum number of bits per second relating 970 to | 971 | | all encodings combined 972 | 973 | maxGroupH264Mbps | Maximum number of macroblocks per second 974 | 975 | | relating to all video encodings combined 976 | 977 | videoEncodings[] | Set of potential encodings (list of 978 | 979 | | encodeIDs) 980 | 981 | audioEncodings[] | Set of potential encodings (list of 982 | 983 | | encodeIDs) 984 | 985 +-------------------+--------------------------------------------- 986 --+ 988 Table 4: Encoding Group 990 When the individual encodings in a group are instantiated into 991 capture encodings, each capture encoding has a bandwidth that must 992 be less than or equal to the maxBandwidth for the particular 993 individual encoding. The maxGroupBandwidth parameter gives the 994 additional restriction that the sum of all the individual capture 995 encoding bandwidths must be less than or equal to the 996 maxGroupBandwidth value. 998 Likewise, the sum of the macroblocks per second of each 999 instantiated encoding in the group must not exceed the 1000 maxGroupH264Mbps value. 1002 The following diagram illustrates the structure of a media 1003 provider's Encoding Groups and their contents. 1005 ,-------------------------------------------------. 1006 | Media Provider | 1007 | | 1008 | ,--------------------------------------. | 1009 | | ,--------------------------------------. | 1010 | | | ,--------------------------------------. | 1011 | | | | Encoding Group | | 1012 | | | | ,-----------. | | 1013 | | | | | | ,---------. | | 1014 | | | | | | | | ,---------.| | 1015 | | | | | Encoding1 | |Encoding2| |Encoding3|| | 1016 | `.| | | | | | `---------'| | 1017 | `.| `-----------' `---------' | | 1018 | `--------------------------------------' | 1019 `-------------------------------------------------' 1021 Figure 1: Encoding Group Structure 1023 A media provider advertises one or more encoding groups. Each 1024 encoding group includes one or more individual encodings. Each 1025 individual encoding can represent a different way of encoding 1026 media. For example one individual encoding may be 1080p60 video, 1027 another could be 720p30, with a third being CIF. 1029 While a typical 3 codec/display system might have one encoding 1030 group per "codec box", there are many possibilities for the number 1031 of encoding groups a provider may be able to offer and for the 1032 encoding values in each encoding group. 1034 There is no requirement for all encodings within an encoding group 1035 to be instantiated at once. 1037 8. Associating Media Captures with Encoding Groups 1039 Every media capture is associated with an encoding group, which is 1040 used to instantiate that media capture into one or more capture 1041 encodings. Each media capture has an encoding group attribute. 1042 The value of this attribute is the encodeGroupID for the encoding 1043 group with which it is associated. More than one media capture 1044 may use the same encoding group. 1046 The maximum number of streams that can result from a particular 1047 encoding group constraint is equal to the number of individual 1048 encodings in the group. The actual number of capture encodings 1049 used at any time may be less than this maximum. Any of the media 1050 captures that use a particular encoding group can be encoded 1051 according to any of the individual encodings in the group. If 1052 there are multiple individual encodings in the group, then the 1053 media consumer can configure the media provider to encode a single 1054 media capture into multiple different capture encodings at the 1055 same time, subject to the Max Capture Encodings constraint, with 1056 each capture encoding following the constraints of a different 1057 individual encoding. 1059 The Encoding Groups MUST allow all the media captures in a 1060 particular capture scene entry to be used simultaneously. 1062 9. Consumer's Choice of Streams to Receive from the Provider 1064 After receiving the provider's advertised media captures and 1065 associated constraints, the consumer must choose which media 1066 captures it wishes to receive, and which individual encodings from 1067 the provider it wants to use to encode the captures. Each media 1068 capture has an encoding group ID attribute which specifies which 1069 individual encodings are available to be used for that media 1070 capture. 1072 For each media capture the consumer wants to receive, it 1073 configures one or more of the encodings in that capture's encoding 1074 group. The consumer does this by telling the provider the 1075 resolution, frame rate, bandwidth, etc. when asking for capture 1076 encodings for its chosen captures. Upon receipt of this 1077 configuration command from the consumer, the provider generates a 1078 stream for each such configured capture encoding and sends those 1079 streams to the consumer. 1081 The consumer must have received at least one capture advertisement 1082 from the provider to be able to configure the provider's 1083 generation of media streams. 1085 The consumer is able to change its configuration of the provider's 1086 encodings any number of times during the call, either in response 1087 to a new capture advertisement from the provider or autonomously. 1088 The consumer need not send a new configure message to the provider 1089 when it receives a new capture advertisement from the provider 1090 unless the contents of the new capture advertisement cause the 1091 consumer's current configure message to become invalid. 1093 When choosing which streams to receive from the provider, and the 1094 encoding characteristics of those streams, the consumer needs to 1095 take several things into account: its local preference, 1096 simultaneity restrictions, and encoding limits. 1098 9.1. Local preference 1100 A variety of local factors will influence the consumer's choice of 1101 streams to be received from the provider: 1103 o if the consumer is an endpoint, it is likely that it would 1104 choose, where possible, to receive video and audio captures 1105 that match the number of display devices and audio system it 1106 has 1108 o if the consumer is a middle box such as an MCU, it may choose 1109 to receive loudest speaker streams (in order to perform its own 1110 media composition) and avoid pre-composed video captures 1112 o user choice (for instance, selection of a new layout) may 1113 result in a different set of media captures, or different 1114 encoding characteristics, being required by the consumer 1116 9.2. Physical simultaneity restrictions 1118 There may be physical simultaneity constraints imposed by the 1119 provider that affect the provider's ability to simultaneously send 1120 all of the captures the consumer would wish to receive. For 1121 instance, a middle box such as an MCU, when connected to a multi- 1122 camera room system, might prefer to receive both individual camera 1123 streams of the people present in the room and an overall view of 1124 the room from a single camera. Some endpoint systems might be 1125 able to provide both of these sets of streams simultaneously, 1126 whereas others may not (if the overall room view were produced by 1127 changing the zoom level on the center camera, for instance). 1129 9.3. Encoding and encoding group limits 1131 Each of the provider's encoding groups has limits on bandwidth and 1132 macroblocks per second, and the constituent potential encodings 1133 have limits on the bandwidth, macroblocks per second, video frame 1134 rate, and resolution that can be provided. When choosing the 1135 media captures to be received from a provider, a consumer device 1136 must ensure that the encoding characteristics requested for each 1137 individual media capture fits within the capability of the 1138 encoding it is being configured to use, as well as ensuring that 1139 the combined encoding characteristics for media captures fit 1140 within the capabilities of their associated encoding groups. In 1141 some cases, this could cause an otherwise "preferred" choice of 1142 capture encodings to be passed over in favour of different capture 1143 encodings - for instance, if a set of 3 media captures could only 1144 be provided at a low resolution then a 3 screen device could 1145 switch to favoring a single, higher quality, capture encoding. 1147 9.4. Message Flow 1149 The following diagram shows the basic flow of messages between a 1150 media provider and a media consumer. The usage of the "capture 1151 advertisement" and "configure encodings" message is described 1152 above. The consumer also sends its own capability message to the 1153 provider which may contain information about its own capabilities 1154 or restrictions. 1156 Diagram for Message Flow 1158 Media Consumer Media Provider 1159 -------------- ------------ 1160 | | 1161 |----- Consumer Capability ---------->| 1162 | | 1163 | | 1164 |<---- Capture advertisement ---------| 1165 | | 1166 | | 1167 |------ Configure encodings --------->| 1168 | | 1170 In order for a maximally-capable provider to be able to advertise 1171 a manageable number of video captures to a consumer, there is a 1172 potential use for the consumer, at the start of CLUE, to be able 1173 to inform the provider of its capabilities. One example here 1174 would be the video capture attribute set - a consumer could tell 1175 the provider the complete set of video capture attributes it is 1176 able to understand and so the provider would be able to reduce the 1177 capture scene it advertises to be tailored to the consumer. 1179 TBD - the content of the consumer capability message needs to be 1180 better defined. The authors believe there is a need for this 1181 message, but have not worked out the details yet. 1183 10. Extensibility 1185 One of the most important characteristics of the Framework is its 1186 extensibility. Telepresence is a relatively new industry and 1187 while we can foresee certain directions, we also do not know 1188 everything about how it will develop. The standard for 1189 interoperability and handling multiple streams must be future- 1190 proof. The framework itself is inherently extensible through 1191 expanding the data model types. For example: 1193 o Adding more types of media, such as telemetry, can done by 1194 defining additional types of captures in addition to audio and 1195 video. 1197 o Adding new functionalities , such as 3-D, say, will require 1198 additional attributes describing the captures. 1200 o Adding a new codecs, such as H.265, can be accomplished by 1201 defining new encoding variables. 1203 The infrastructure is designed to be extended rather than 1204 requiring new infrastructure elements. Extension comes through 1205 adding to defined types. 1207 Assuming the implementation is in something like XML, adding data 1208 elements and attributes makes extensibility easy. 1210 11. Examples - Using the Framework 1212 This section shows some examples in more detail how to use the 1213 framework to represent a typical case for telepresence rooms. 1214 First an endpoint is illustrated, then an MCU case is shown. 1216 11.1. Three screen endpoint media provider 1218 Consider an endpoint with the following description: 1220 3 cameras, 3 displays, a 6 person table 1222 o Each video device can provide one capture for each 1/3 section 1223 of the table 1225 o A single capture representing the active speaker can be 1226 provided 1228 o A single capture representing the active speaker with the other 1229 2 captures shown picture in picture within the stream can be 1230 provided 1232 o A capture showing a zoomed out view of all 6 seats in the room 1233 can be provided 1235 The audio and video captures for this endpoint can be described as 1236 follows. 1238 Video Captures: 1240 o VC0- (the camera-left camera stream), encoding group=EG0, 1241 content=main, switched=false 1243 o VC1- (the center camera stream), encoding group=EG1, 1244 content=main, switched=false 1246 o VC2- (the camera-right camera stream), encoding group=EG2, 1247 content=main, switched=false 1249 o VC3- (the loudest panel stream), encoding group=EG1, 1250 content=main, switched=true 1252 o VC4- (the loudest panel stream with PiPs), encoding group=EG1, 1253 content=main, composed=true, switched=true 1255 o VC5- (the zoomed out view of all people in the room), encoding 1256 group=EG1, content=main, composed=false, switched=false 1258 o VC6- (presentation stream), encoding group=EG1, content=slides, 1259 switched=false 1261 The following diagram is a top view of the room with 3 cameras, 3 1262 displays, and 6 seats. Each camera is capturing 2 people. The 1263 six seats are not all in a straight line. 1265 ,-. D 1266 ( )`--.__ +---+ 1267 `-' / `--.__ | | 1268 ,-. | `-.._ |_-+Camera 2 (VC2) 1269 ( ).' ___..-+-''`+-+ 1270 `-' |_...---'' | | 1271 ,-.c+-..__ +---+ 1272 ( )| ``--..__ | | 1273 `-' | ``+-..|_-+Camera 1 (VC1) 1274 ,-. | __..--'|+-+ 1275 ( )| __..--' | | 1276 `-'b|..--' +---+ 1277 ,-. |``---..___ | | 1278 ( )\ ```--..._|_-+Camera 0 (VC0) 1279 `-' \ _..-''`-+ 1280 ,-. \ __.--'' | | 1281 ( ) |..-'' +---+ 1282 `-' a 1284 The two points labeled b and c are intended to be at the midpoint 1285 between the seating positions, and where the fields of view of the 1286 cameras intersect. 1288 The plane of interest for VC0 is a vertical plane that intersects 1289 points 'a' and 'b'. 1291 The plane of interest for VC1 intersects points 'b' and 'c'. The 1292 plane of interest for VC2 intersects points 'c' and 'd'. 1294 This example uses an area scale of millimeters. 1296 Areas of capture: 1298 bottom left bottom right top left top right 1299 VC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1300 VC1 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1301 VC2 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1302 VC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1303 VC4 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1304 VC5 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1305 VC6 none 1307 Points of capture: 1308 VC0 (-1678,0,800) 1309 VC1 (0,0,800) 1310 VC2 (1678,0,800) 1311 VC3 none 1312 VC4 none 1313 VC5 (0,0,800) 1314 VC6 none 1316 In this example, the right edge of the VC0 area lines up with the 1317 left edge of the VC1 area. It doesn't have to be this way. There 1318 could be a gap or an overlap. One additional thing to note for 1319 this example is the distance from a to b is equal to the distance 1320 from b to c and the distance from c to d. All these distances are 1321 1346 mm. This is the planar width of each area of capture for VC0, 1322 VC1, and VC2. 1324 Note the text in parentheses (e.g. "the camera-left camera 1325 stream") is not explicitly part of the model, it is just 1326 explanatory text for this example, and is not included in the 1327 model with the media captures and attributes. Also, the 1328 "composed" boolean attribute doesn't say anything about how a 1329 capture is composed, so the media consumer can't tell based on 1330 this attribute that VC4 is composed of a "loudest panel with 1331 PiPs". 1333 Audio Captures: 1335 o AC0 (camera-left), encoding group=EG3, content=main, channel 1336 format=mono 1338 o AC1 (camera-right), encoding group=EG3, content=main, channel 1339 format=mono 1341 o AC2 (center) encoding group=EG3, content=main, channel 1342 format=mono 1344 o AC3 being a simple pre-mixed audio stream from the room (mono), 1345 encoding group=EG3, content=main, channel format=mono 1347 o AC4 audio stream associated with the presentation video (mono) 1348 encoding group=EG3, content=slides, channel format=mono 1350 Areas of capture: 1352 bottom left bottom right top left top right 1354 AC0 (-2011,2850,0) (-673,3000,0) (-2011,2850,757) (-673,3000,757) 1355 AC1 ( 673,3000,0) (2011,2850,0) ( 673,3000,757) (2011,3000,757) 1356 AC2 ( -673,3000,0) ( 673,3000,0) ( -673,3000,757) ( 673,3000,757) 1357 AC3 (-2011,2850,0) (2011,2850,0) (-2011,2850,757) (2011,3000,757) 1358 AC4 none 1360 The physical simultaneity information is: 1362 Simultaneous transmission set #1 {VC0, VC1, VC2, VC3, VC4, VC6} 1364 Simultaneous transmission set #2 {VC0, VC2, VC5, VC6} 1366 This constraint indicates it is not possible to use all the VCs at 1367 the same time. VC5 can not be used at the same time as VC1 or VC3 1368 or VC4. Also, using every member in the set simultaneously may 1369 not make sense - for example VC3(loudest) and VC4 (loudest with 1370 PIP). (In addition, there are encoding constraints that make 1371 choosing all of the VCs in a set impossible. VC1, VC3, VC4, VC5, 1372 VC6 all use EG1 and EG1 has only 3 ENCs. This constraint shows up 1373 in the encoding groups, not in the simultaneous transmission 1374 sets.) 1376 In this example there are no restrictions on which audio captures 1377 can be sent simultaneously. 1379 Encoding Groups: 1381 This example has three encoding groups associated with the video 1382 captures. Each group can have 3 encodings, but with each 1383 potential encoding having a progressively lower specification. In 1384 this example, 1080p60 transmission is possible (as ENC0 has a 1385 maxMbps value compatible with that) as long as it is the only 1386 active encoding in the group(as maxMbps for the entire encoding 1387 group is also 489600). Significantly, as up to 3 encodings are 1388 available per group, it is possible to transmit some video 1389 captures simultaneously that are not in the same entry in the 1390 capture scene. For example VC1 and VC3 at the same time. 1392 It is also possible to transmit multiple capture encodings of a 1393 single video capture. For example VC0 can be encoded using ENC0 1394 and ENC1 at the same time, as long as the encoding parameters 1395 satisfy the constraints of ENC0, ENC1, and EG0, such as one at 1396 1080p30 and one at 720p30. 1398 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1399 maxGroupBandwidth=6000000 1400 encodeID=ENC0, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1401 maxH264Mbps=489600, maxBandwidth=4000000 1402 encodeID=ENC1, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1403 maxH264Mbps=108000, maxBandwidth=4000000 1404 encodeID=ENC2, maxWidth=960, maxHeight=544, maxFrameRate=30, 1405 maxH264Mbps=61200, maxBandwidth=4000000 1406 encodeGroupID=EG1 maxGroupH264Mbps=489600 1407 maxGroupBandwidth=6000000 1408 encodeID=ENC3, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1409 maxH264Mbps=489600, maxBandwidth=4000000 1410 encodeID=ENC4, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1411 maxH264Mbps=108000, maxBandwidth=4000000 1412 encodeID=ENC5, maxWidth=960, maxHeight=544, maxFrameRate=30, 1413 maxH264Mbps=61200, maxBandwidth=4000000 1414 encodeGroupID=EG2 maxGroupH264Mbps=489600 1415 maxGroupBandwidth=6000000 1416 encodeID=ENC6, maxWidth=1920, maxHeight=1088, maxFrameRate=60, 1417 maxH264Mbps=489600, maxBandwidth=4000000 1418 encodeID=ENC7, maxWidth=1280, maxHeight=720, maxFrameRate=30, 1419 maxH264Mbps=108000, maxBandwidth=4000000 1420 encodeID=ENC8, maxWidth=960, maxHeight=544, maxFrameRate=30, 1421 maxH264Mbps=61200, maxBandwidth=4000000 1423 Figure 2: Example Encoding Groups for Video 1425 For audio, there are five potential encodings available, so all 1426 five audio captures can be encoded at the same time. 1428 encodeGroupID=EG3, maxGroupH264Mbps=0, maxGroupBandwidth=320000 1429 encodeID=ENC9, maxBandwidth=64000 1430 encodeID=ENC10, maxBandwidth=64000 1431 encodeID=ENC11, maxBandwidth=64000 1432 encodeID=ENC12, maxBandwidth=64000 1433 encodeID=ENC13, maxBandwidth=64000 1435 Figure 3: Example Encoding Group for Audio 1437 Capture Scenes: 1439 The following table represents the capture scenes for this 1440 provider. Recall that a capture scene is composed of alternative 1441 capture scene entries covering the same scene. Capture Scene #1 1442 is for the main people captures, and Capture Scene #2 is for 1443 presentation. 1445 Each row in the table is a separate entry in the capture scene 1447 +------------------+ 1448 | Capture Scene #1 | 1449 +------------------+ 1450 | VC0, VC1, VC2 | 1451 | VC3 | 1452 | VC4 | 1453 | VC5 | 1454 | AC0, AC1, AC2 | 1455 | AC3 | 1456 +------------------+ 1458 +------------------+ 1459 | Capture Scene #2 | 1460 +------------------+ 1461 | VC6 | 1462 | AC4 | 1463 +------------------+ 1465 Different capture scenes are unique to each other, non- 1466 overlapping. A consumer can choose an entry from each capture 1467 scene. In this case the three captures VC0, VC1, and VC2 are one 1468 way of representing the video from the endpoint. These three 1469 captures should appear adjacent next to each other. 1470 Alternatively, another way of representing the Capture Scene is 1471 with the capture VC3, which automatically shows the person who is 1472 talking. Similarly for the VC4 and VC5 alternatives. 1474 As in the video case, the different entries of audio in Capture 1475 Scene #1 represent the "same thing", in that one way to receive 1476 the audio is with the 3 audio captures (AC0, AC1, AC2), and 1477 another way is with the mixed AC3. The Media Consumer can choose 1478 an audio capture entry it is capable of receiving. 1480 The spatial ordering is understood by the media capture attributes 1481 area and point of capture. 1483 A Media Consumer would likely want to choose a capture scene entry 1484 to receive based in part on how many streams it can simultaneously 1485 receive. A consumer that can receive three people streams would 1486 probably prefer to receive the first entry of Capture Scene #1 1487 (VC0, VC1, VC2) and not receive the other entries. A consumer 1488 that can receive only one people stream would probably choose one 1489 of the other entries. 1491 If the consumer can receive a presentation stream too, it would 1492 also choose to receive the only entry from Capture Scene #2 (VC6). 1494 11.2. Encoding Group Example 1496 This is an example of an encoding group to illustrate how it can 1497 express dependencies between encodings. 1499 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1500 maxGroupBandwidth=6000000 1501 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1502 maxFrameRate=60, 1503 maxH264Mbps=244800, maxBandwidth=4000000 1504 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1505 maxFrameRate=60, 1506 maxH264Mbps=244800, maxBandwidth=4000000 1507 encodeID=AUDENC0, maxBandwidth=96000 1508 encodeID=AUDENC1, maxBandwidth=96000 1509 encodeID=AUDENC2, maxBandwidth=96000 1511 Here, the encoding group is EG0. It can transmit up to two 1512 1080p30 capture encodings (Mbps for 1080p = 244800), but it is 1513 capable of transmitting a maxFrameRate of 60 frames per second 1514 (fps). To achieve the maximum resolution (1920 x 1088) the frame 1515 rate is limited to 30 fps. However 60 fps can be achieved at a 1516 lower resolution if required by the consumer. Although the 1517 encoding group is capable of transmitting up to 6Mbit/s, no 1518 individual video encoding can exceed 4Mbit/s. 1520 This encoding group also allows up to 3 audio encodings, AUDENC<0- 1521 2>. It is not required that audio and video encodings reside 1522 within the same encoding group, but if so then the group's overall 1523 maxBandwidth value is a limit on the sum of all audio and video 1524 encodings configured by the consumer. A system that does not wish 1525 or need to combine bandwidth limitations in this way should 1526 instead use separate encoding groups for audio and video in order 1527 for the bandwidth limitations on audio and video to not interact. 1529 Audio and video can be expressed in separate encoding groups, as 1530 in this illustration. 1532 encodeGroupID=EG0, maxGroupH264Mbps=489600, 1533 maxGroupBandwidth=6000000 1534 encodeID=VIDENC0, maxWidth=1920, maxHeight=1088, 1535 maxFrameRate=60, 1536 maxH264Mbps=244800, maxBandwidth=4000000 1537 encodeID=VIDENC1, maxWidth=1920, maxHeight=1088, 1538 maxFrameRate=60, 1539 maxH264Mbps=244800, maxBandwidth=4000000 1540 encodeGroupID=EG1, maxGroupH264Mbps=0, maxGroupBandwidth=500000 1541 encodeID=AUDENC0, maxBandwidth=96000 1542 encodeID=AUDENC1, maxBandwidth=96000 1543 encodeID=AUDENC2, maxBandwidth=96000 1545 11.3. The MCU Case 1547 This section shows how an MCU might express its Capture Scenes, 1548 intending to offer different choices for consumers that can handle 1549 different numbers of streams. A single audio capture stream is 1550 provided for all single and multi-screen configurations that can 1551 be associated (e.g. lip-synced) with any combination of video 1552 captures at the consumer. 1554 +--------------------+-------------------------------------------- 1555 -+ 1556 | Capture Scene #1 | note 1557 | 1558 +--------------------+-------------------------------------------- 1559 -+ 1560 | VC0 | video capture for single screen consumer 1561 | 1562 | VC1, VC2 | video capture for 2 screen consumer 1563 | 1564 | VC3, VC4, VC5 | video capture for 3 screen consumer 1565 | 1566 | VC6, VC7, VC8, VC9 | video capture for 4 screen consumer 1567 | 1568 | AC0 | audio capture representing all participants 1569 | 1570 +--------------------+-------------------------------------------- 1571 -+ 1573 If / when a presentation stream becomes active within the 1574 conference the MCU might re-advertise the available media as: 1576 +------------------+--------------------------------------+ 1577 | Capture Scene #2 | note | 1578 +------------------+--------------------------------------+ 1579 | VC10 | video capture for presentation | 1580 | AC1 | presentation audio to accompany VC10 | 1581 +------------------+--------------------------------------+ 1583 11.4. Media Consumer Behavior 1585 This section gives an example of how a media consumer might behave 1586 when deciding how to request streams from the three screen 1587 endpoint described in the previous section. 1589 The receive side of a call needs to balance its requirements, 1590 based on number of screens and speakers, its decoding capabilities 1591 and available bandwidth, and the provider's capabilities in order 1592 to optimally configure the provider's streams. Typically it would 1593 want to receive and decode media from each capture scene 1594 advertised by th provider. 1596 A sane, basic, algorithm might be for the consumer to go through 1597 eac capture scene in turn and find the collection of video 1598 captures that best matches the number of screens it has (this 1599 might include consideration of screens dedicated to presentation 1600 video display rather than "people" video) and then decide between 1601 alternative entries in the video capture scenes based either on 1602 hard-coded preferences or user choice. Once this choice has been 1603 made, the consumer would then decide how to configure the 1604 provider's encoding groups in order to make best use of the 1605 available network bandwidth and its own decoding capabilities. 1607 11.4.1. One screen consumer 1609 VC3, VC4 and VC5 are all different entries by themselves, not 1610 grouped together in a single entry, so the receiving device should 1611 choose between one of those. The choice would come down to 1612 whether to see the greatest number of participants simultaneously 1613 at roughly equal precedence (VC5), a switched view of just the 1614 loudest region (VC3) or a switched view with PiPs (VC4). An 1615 endpoint device with a small amount of knowledge of these 1616 differences could offer a dynamic choice of these options, in- 1617 call, to the user. 1619 11.4.2. Two screen consumer configuring the example 1621 Mixing systems with an even number of screens, "2n", and those 1622 with "2n+1" cameras (and vice versa) is always likely to be the 1623 problematic case. In this instance, the behavior is likely to be 1624 determined by whether a "2 screen" system is really a "2 decoder" 1625 system, i.e., whether only one received stream can be displayed 1626 per screen or whether more than 2 streams can be received and 1627 spread across the available screen area. To enumerate 3 possible 1628 behaviors here for the 2 screen system when it learns that the far 1629 end is "ideally" expressed via 3 capture streams: 1631 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 1632 per the 1 screen consumer case above) and either leave one 1633 screen blank or use it for presentation if / when a 1634 presentation becomes active. 1636 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 1637 screens (either with each capture being scaled to 2/3 of a 1638 screen and the centre capture being split across 2 screens) or, 1639 as would be necessary if there were large bezels on the 1640 screens, with each stream being scaled to 1/2 the screen width 1641 and height and there being a 4th "blank" panel. This 4th panel 1642 could potentially be used for any presentation that became 1643 active during the call. 1645 3. Receive 3 streams, decode all 3, and use control information 1646 indicating which was the most active to switch between showing 1647 the left and centre streams (one per screen) and the centre and 1648 right streams. 1650 For an endpoint capable of all 3 methods of working described 1651 above, again it might be appropriate to offer the user the choice 1652 of display mode. 1654 11.4.3. Three screen consumer configuring the example 1656 This is the most straightforward case - the consumer would look to 1657 identify a set of streams to receive that best matched its 1658 available screens and so the VC0 plus VC1 plus VC2 should match 1659 optimally. The spatial ordering would give sufficient information 1660 for the correct video capture to be shown on the correct screen, 1661 and the consumer would either need to divide a single encoding 1662 group's capability by 3 to determine what resolution and frame 1663 rate to configure the provider with or to configure the individual 1664 video captures' encoding groups with what makes most sense (taking 1665 into account the receive side decode capabilities, overall call 1666 bandwidth, the resolution of the screens plus any user preferences 1667 such as motion vs sharpness). 1669 12. Acknowledgements 1671 Mark Gorzyinski contributed much to the approach. We want to 1672 thank Stephen Botzko for helpful discussions on audio. 1674 13. IANA Considerations 1676 TBD 1678 14. Security Considerations 1680 TBD 1682 15. Changes Since Last Version 1684 NOTE TO THE RFC-Editor: Please remove this section prior to 1685 publication as an RFC. 1687 Changes from 06 to 07: 1689 1. Ticket #9. Rename Axis of Capture Point attribute to Point on 1690 Line of Capture. Clarify the description of this attribute. 1692 2. Ticket #17. Add "capture encoding" definition. Use this new 1693 term throughout document as appropriate, replacing some usage 1694 of the terms "stream" and "encoding". 1696 3. Ticket #18. Add Max Capture Encodings media capture attribute. 1698 4. Add clarification that different capture scene entries are not 1699 necessarily mutually exclusive. 1701 Changes from 05 to 06: 1703 1. Capture scene description attribute is a list of text strings, 1704 each in a different language, rather than just a single string. 1706 2. Add new Axis of Capture Point attribute. 1708 3. Remove appendices A.1 through A.6. 1710 4. Clarify that the provider must use the same coordinate system 1711 with same scale and origin for all coordinates within the same 1712 capture scene. 1714 Changes from 04 to 05: 1716 1. Clarify limitations of "composed" attribute. 1718 2. Add new section "capture scene entry attributes" and add the 1719 attribute "scene-switch-policy". 1721 3. Add capture scene description attribute and description 1722 language attribute. 1724 4. Editorial changes to examples section for consistency with the 1725 rest of the document. 1727 Changes from 03 to 04: 1729 1. Remove sentence from overview - "This constitutes a significant 1730 change ..." 1732 2. Clarify a consumer can choose a subset of captures from a 1733 capture scene entry or a simultaneous set (in section "capture 1734 scene" and "consumer's choice..."). 1736 3. Reword first paragraph of Media Capture Attributes section. 1738 4. Clarify a stereo audio capture is different from two mono audio 1739 captures (description of audio channel format attribute). 1741 5. Clarify what it means when coordinate information is not 1742 specified for area of capture, point of capture, area of scene. 1744 6. Change the term "producer" to "provider" to be consistent (it 1745 was just in two places). 1747 7. Change name of "purpose" attribute to "content" and refer to 1748 RFC4796 for values. 1750 8. Clarify simultaneous sets are part of a provider advertisement, 1751 and apply across all capture scenes in the advertisement. 1753 9. Remove sentence about lip-sync between all media captures in a 1754 capture scene. 1756 10. Combine the concepts of "capture scene" and "capture set" 1757 into a single concept, using the term "capture scene" to 1758 replace the previous term "capture set", and eliminating the 1759 original separate capture scene concept. 1761 Informative References 1763 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1764 Requirement Levels", BCP 14, RFC 2119, March 1997. 1766 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., 1767 Johnston, 1768 A., Peterson, J., Sparks, R., Handley, M., and E. 1769 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 1770 June 2002. 1772 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1773 Jacobson, "RTP: A Transport Protocol for Real-Time 1774 Applications", STD 64, RFC 3550, July 2003. 1776 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 1777 Session Initiation Protocol (SIP)", RFC 4353, 1778 February 2006. 1780 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session 1781 Description 1782 Protocol (SDP) Content Attribute", RFC 4796, 1783 February 2007. 1785 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 1786 5117, 1787 January 2008. 1789 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1790 Languages", BCP 47, RFC 5646, September 2009. 1792 [IANA-Lan] 1793 IANA, "Language Subtag Registry", 1794 . 1797 16. Authors' Addresses 1799 Mark Duckworth (editor) 1800 Polycom 1801 Andover, MA 01810 1802 USA 1804 Email: mark.duckworth@polycom.com 1806 Andrew Pepperell 1807 Silverflare 1808 Uxbridge, England 1809 UK 1811 Email: apeppere@gmail.com 1813 Stephan Wenger 1814 Vidyo, Inc. 1815 433 Hakcensack Ave. 1816 Hackensack, N.J. 07601 1817 USA 1819 Email: stewe@stewe.org