idnits 2.17.1 draft-romanow-clue-framework-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 18 instances of too long lines in the document, the longest one being 29 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 3, 2011) is 4679 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC3261' is defined on line 1054, but no explicit reference was found in the text -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational M. Duckworth 5 Expires: January 4, 2012 Polycom 6 A. Pepperell 7 B. Baldino 8 Cisco Systems 9 M. Goryzinski 10 HP Visual Collaboration 11 July 3, 2011 13 Framework for Telepresence Multi-Streams 14 draft-romanow-clue-framework-00.txt 16 Abstract 18 This memo offers a framework for a protocol that enables devices in a 19 telepresence conference to interoperate by specif;ying the 20 relationships between multiple RTP streams. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 4, 2012. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 58 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 6 59 4. Two Necessary Functions . . . . . . . . . . . . . . . . . . . 9 60 5. Protocol Features . . . . . . . . . . . . . . . . . . . . . . 9 61 6. Stream Content . . . . . . . . . . . . . . . . . . . . . . . . 10 62 6.1. Media capture . . . . . . . . . . . . . . . . . . . . . . 10 63 6.2. Attributes . . . . . . . . . . . . . . . . . . . . . . . . 11 64 6.3. Capture Set . . . . . . . . . . . . . . . . . . . . . . . 12 65 7. Choosing Streams . . . . . . . . . . . . . . . . . . . . . . . 13 66 7.1. Physical Simultaneity . . . . . . . . . . . . . . . . . . 14 67 7.2. Encoding Groups . . . . . . . . . . . . . . . . . . . . . 15 68 7.2.1. Sample video encoding group specification #1 . . . . . 17 69 7.2.2. Sample video encoding group specification #2 . . . . . 18 70 8. Media provider behavior . . . . . . . . . . . . . . . . . . . 19 71 9. Putting it together - using the Capture Set . . . . . . . . . 19 72 10. Media consumer behaviour . . . . . . . . . . . . . . . . . . . 22 73 10.1. One screen receiver configuring the example 74 capture-side device above . . . . . . . . . . . . . . . . 23 75 10.2. Two screen receiver configuring the example 76 capture-side device above . . . . . . . . . . . . . . . . 23 77 10.3. Three screen receiver configuring the example 78 capture-side device above . . . . . . . . . . . . . . . . 24 79 10.4. Configuration of sender streams by a receiver . . . . . . 24 80 10.5. Advertisement of capabilities sent by receiver to 81 sender . . . . . . . . . . . . . . . . . . . . . . . . . . 24 82 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 83 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 84 13. Security Considerations . . . . . . . . . . . . . . . . . . . 25 85 14. Informative References . . . . . . . . . . . . . . . . . . . . 25 86 Appendix A. Attributes . . . . . . . . . . . . . . . . . . . . . 26 87 A.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 26 88 A.1.1. Main . . . . . . . . . . . . . . . . . . . . . . . . . 26 89 A.1.2. Presentation . . . . . . . . . . . . . . . . . . . . . 26 90 A.2. Audio mixed . . . . . . . . . . . . . . . . . . . . . . . 26 91 A.3. Audio Channel Format . . . . . . . . . . . . . . . . . . . 26 92 A.3.1. Linear Array . . . . . . . . . . . . . . . . . . . . . 26 93 A.3.2. Stereo . . . . . . . . . . . . . . . . . . . . . . . . 27 94 A.3.3. Mono . . . . . . . . . . . . . . . . . . . . . . . . . 27 95 A.4. Audio Linear Position . . . . . . . . . . . . . . . . . . 27 96 A.5. Video Scale . . . . . . . . . . . . . . . . . . . . . . . 28 97 A.6. Video composed . . . . . . . . . . . . . . . . . . . . . . 28 98 A.7. Video Auto-switched . . . . . . . . . . . . . . . . . . . 28 99 Appendix B. Spatial Relationship . . . . . . . . . . . . . . . . 28 100 B.1. Spatial relationship of audio with video . . . . . . . . . 29 101 Appendix C. Capture sets for the MCU Case . . . . . . . . . . . . 29 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 104 1. Introduction 106 Current telepresence systems, though based on open standards such as 107 RTP and SIP, cannot easily interoperate with each other. A major 108 factor limiting the interoperability of telepresence systems is the 109 lack of a standardized way to describe and negotiate the use of the 110 multiple streams of audio and video comprising the media flows. This 111 draft provides a framework for a protocol to enable interoperability 112 by handling multiple streams in a standardized way. It is intended 113 to support the use cases described in 114 draft-ietf-clue-telepresence-use-cases-00 and to meet the 115 requirements in draft-romanow-clue-requirements-xx. 117 The solution described here is strongly focused on what is being done 118 today, rather than a vision of future conferencing. However, the 119 highest priority has been given to creating an extensible framework 120 to make it easy to add new information needed to accommodate future 121 conferencing functionality. 123 The purpose of this effort is to make it possible to handle multiple 124 streams of media in such a way that a satisfactory user experience is 125 possible even when participants are on different vendor equipment and 126 when they are using devices with different types of communication 127 capabilities. Information about the relationship of media streams 128 must be communicated so that audio/video rendering can be done in the 129 best possible manner. In addition, it is necessary to choose which 130 media streams are sent. 132 This first draft of the CLUE framework is to introduce the basic 133 approach. The draft is deliberately as simple as possible in order 134 to make it possible to focus discussion on the basic approach. Some 135 of the more descriptive material has been put into appendices in this 136 version, in order to keep the framework material from being 137 overwhelmed by detail. In addition, only the basic mechanism is 138 described here. In subsequent drafts, additional mechanisms 139 consistent with the basic approach will be added to handle more use 140 cases. 142 Several important use cases require such additional mechanism to be 143 handled. Nonetheless, we feel that it is better to go step by step, 144 and we are defering that material until the next version of the 145 model. It will provide a good illustration of how to use the 146 extensible feature of the framework to handle new use cases. 148 If you look at this framework from the perspective of trying to 149 catch-it-out and see where it breaks down in a special case, you will 150 easily be able to succeed. But we urge you to hold that perspective 151 temporarily in order to concentrate on how this model works in common 152 cases, and how it can be expanded to other use cases. 154 [Edt. Similarly, some of the wording is not as precise and accurate 155 as might be possible. Although of course this is very important, it 156 might be useful to postpone definition issues temporarily where 157 possible in order to concentrate on the framework.] 159 After the following Definitions, two short sections introduce key 160 concepts. The body of the text comprises three sections that deal 161 with in turn stream content, choosing streams and an implementation 162 example. The media provider and media consumer behavior are 163 described in separate sections as well. Several appendices describe 164 further details for using the framework. 166 2. Terminology 168 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 169 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 170 document are to be interpreted as described in RFC 2119 [RFC2119]. 172 3. Definitions 174 The definitions marked with an "*" are new; all the others are from 175 draft-wenger-clue-definitions-00-01.txt. 177 *Audio Capture: Media Capture for audio. Denoted as ACn. 179 Capture Device: A device that converts audio and video input into 180 an electrical signal, in most cases to be fed into a media 181 encoder. Cameras and microphones are examples for capture 182 devices. 184 Capture Scene: the scene that is captured by a collection of 185 Capture Devices. A Capture Scene may be represented by more than 186 one type of Media. A Capture Scene may include more than one 187 Media Capture of the same type. An example of a Capture Scene is 188 the video image of a group of people seated next to each other, 189 along with the sound of their voices, which could be represented 190 by some number of VCs and ACs. A middle box may also express 191 Capture Scenes that it constructs from Media streams it receives. 193 A Capture Set includes Media Captures that all represent some 194 aspect of the same Capture Scene. The items (rows) in a Capture 195 Set represent different alternatives for representing the same 196 Capture Scene. 198 Conference: used as defined in [RFC4353], A Framework for 199 Conferencing within the Session Initiation Protocol (SIP). 201 *Encoding Group: A set of encoding parameters representing one or 202 more media encoders. An Encoding Group describes constraints on 203 encoding parameters used for mapping Media Captures to encoded 204 Streams. 206 Endpoint: The logical point of final termination through 207 receiving, decoding and rendering, and/or initiation through 208 capturing, encoding, and sending of media streams. An endpoint 209 consists of one or more physical devices which source and sink 210 media streams, and exactly one [RFC4353] Participant (which, in 211 turn, includes exactly one SIP User Agent). In contrast to an 212 endpoint, an MCU may also send and receive media streams, but it 213 is not the initiator nor the final terminator in the sense that 214 Media is Captured or Rendered. Endpoints can be anything from 215 multiscreen/multicamera rooms to handheld devices. 217 Endpoint Characteristics: include placement of Capture and 218 Rendering Devices, capture/render angle, resolution of cameras and 219 screens, spatial location and mixing parameters of microphones. 220 Endpoint characteristics are not specific to individual media 221 streams sent by the endpoint. 223 Left: to be interpreted as a stage direction, see also 224 [StageDirection(Wikipadia)] (Edt. note: needs more clarification) 226 MCU: Multipoint Control Unit (MCU) - a device that connects two or 227 more endpoints together into one single multimedia conference 228 [RFC5117]. An MCU includes an [RFC4353] Mixer. Edt. Note: 229 RFC4353 is tardy in requireing that media from the mixer be sent 230 to EACH participant. I think we have practical use cases where 231 this is not the case. But the bug (if it is one) is in 4353 and 232 not herein. 234 Media: Any data that, after suitable encoding, can be conveyed 235 over RTP, including audio, video or timed text. 237 *Media Capture: a source of Media, such as from one or more 238 Capture Devices. A Media Capture may be the source of one or more 239 Media streams. A Media Capture may also be constructed from other 240 Media streams. A middle box can express Media Captures that it 241 constructs from Media streams it receives. 243 *Media Consumer: an Endpoint or middle box that receives Media 244 streams 246 *Media Provider: an Endpoint or middle box that sends Media 247 streams 249 Model: a set of assumptions a telepresence system of a given 250 vendor adheres to and expects the remote telepresence system(s) 251 also to adhere to. 253 Right: to be interpreted as stage direction, see also 254 [StageDirection(Wikipadia)] (Edt. note: needs more clarification) 256 Render: the process of generating a representation from a media, 257 such as displayed motion video or sound emitted from loudspeakers. 259 *Simultaneous Transmission Set: a set of media captures that can 260 be transmitted simultaneously from a Media Sender. 262 Spatial Relation: The arrangement in space of two objects, in 263 contrast to relation in time or other relationships. See also 264 Left and Right. 266 *Stream: RTP stream as in RFC 3550. 268 Stream Characteristics: include media stream attributes commonly 269 used in non-CLUE SIP/SDP environments (such as: media codec, bit 270 rate, resolution, profile/level etc.) as well as CLUE specific 271 attributes (which could include for example and depending on the 272 solution found: the I-D or spatial location of a capture device a 273 stream originates from). 275 Telepresence: an environment that gives non co-located users or 276 user groups a feeling of (co-located) presence - the feeling that 277 a Local user is in the same room with other Local users and the 278 Remote parties. The inclusion of Remote parties is achieved 279 through multimedia communication including at least audio and 280 video signals of high fidelity. 282 *Video Capture: Media Capture for video. Denoted as VCn. 284 Video composite: A single image that is formed from combining 285 visual elements from separate sources. 287 4. Two Necessary Functions 289 In simplified terms, here is a description of the functions in a 290 telepresence conference. 292 1. Capture media 294 2. FIGURE OUT WHICH MEDIA STREAMS TO SEND (CHOOSING STREAMS) 296 3. Encode it 298 4. ADD SOME NOTES (STREAM CONTENT) 300 5. Package it 302 6. Send it 304 7. Unpack it 306 8. Decode it 308 9. Understand the notes 310 10. Render the stream content according to the notes 312 This gross oversimplification is to show clearly that there are only 313 2 functions that the CLUE protocol needs to accomplish - choose which 314 streams the sender should send to the receiver, and add the right 315 information to the streams that get sent. The framework/model that 316 we are presenting can be understood as addressing these two issues. 318 5. Protocol Features 320 Central to the framework are stream providers and media stream 321 consumers. The provider's job is to advertise its capabilities (as 322 described here) to the consumer, whose job it is to configure the 323 provider's encoding capabilities (described below). Both providers 324 and consumers can each send and receive information, that is, we do 325 not have one party as the sender and one as the receiver exclusively, 326 but all parties have both sending and receiving parts to them. Most 327 devices function as both a media provider and as a media consumer. 328 For two devices to communicate bidirectionally, with media flowing in 329 both directions, both devices act as both a media provider and a 330 media consumer. The protocol exchange shown later in the "Choosing 331 Streams" section including hints, announcement and request messages, 332 happens twice independently between the 2 bidirectional devices. 334 For short we will sometimes refer to the media stream provider as the 335 "sender" and the media stream consumer as the "receiver". 337 Both endpoints and MCUs, or more generally a "middleboxes" can be 338 media senders and receivers. 340 The protocol resulting from the framework will be declarative rather 341 than negotiative. What this means here is that information is passed 342 in either direction, but there is no formalized or explicit agreement 343 between participants in the protocol. 345 6. Stream Content 347 This section describes the structure for communicating information 348 between senders and receivers. Figure illustrates how information to 349 be communicated is organized. Each construct is discussed in the 350 sections below. This diagram is for reference. 352 Diagram for Stream Content 354 +---------------+ 355 | | 356 | Capture Set | 357 | | 358 +-------+-------+ 359 _..-' | ``-._ 360 _.-' | ``-._ 361 _.-' | ``-._ 362 +----------------+ +----------------+ +----------------+ 363 | Media Capture | | Media Capture | | Media Capture | 364 | Audio or Video | | Audio or Video | | Audio or Video | 365 +----------------+ +----------------+ +----------------+ 366 .' `. 367 .' `. 368 ,-----. ,---------. 369 ,' Encode`. ,' `. 370 ( Group ) ( Attributes ) 371 `. ,' `. ,' 372 `-----' `---------' 374 6.1. Media capture 376 A media capture (defined in definitions) is a fundamental concept of 377 the model. Media can be captured in different ways, for example by 378 various arrangements of cameras and microphones. The model uses the 379 terms "video capture" (VC) and "audio capture" (AC) to refer to 380 sources of media streams. To distinguish between multiple instances, 381 they are numbered for example VC1, VC2, and VC3 could refer to three 382 different video captures that can be used simultaneously. 384 Media captures are dynamic. They can come and go in a conference - 385 and their parameters can change. A sender can advertise a new list 386 of captures at any time. Both the media sender and media receiver 387 can send "their messages" (i.e., capture set advertisements, stream 388 configurations) any number of times during a call, and the other end 389 is always required to act on any new information received (e.g., 390 stopping streams it had previously configured that are no longer 391 valid). 393 A media capture can be a media source such as video from a specific 394 camera, or it can be more conceptual such as a composite image from 395 several cameras, or an automatic dynamically switched capture 396 choosing from several cameras depending on who is talking or other 397 factors. 399 A media capture is described by Attributes and associated with an 400 Encode Group. Audio and video captures are aggregated into Capture 401 Sets. 403 6.2. Attributes 405 Audio and video capture attributes carry the information about 406 streams and their relationships that a sender or receiver wants to 407 communicate. [Edt: We do not mean to duplicate SDP, if an SDP 408 description can be used, great.] 410 The attributes of media streams refer to the current state of a 411 stream, rather than the capabilities of a video capture device which 412 are described in the encode capabilities, as descried below. 414 The mechanism of Attributes make the framework extensible. Although 415 we are defining some attributes now based on the most common use 416 cases, new attributes can be added for new use cases as they arise. 417 If the model does not do something you want it to, chances are 418 defining an attribute will handle your case. 420 We describe attributes by variables and their values. The current 421 attributes are listed below. The variable is shown in parentheses, 422 and the values follow after the colon: 424 o (Purpose): main audio, main video, presentation 426 o (Audio mixed): true, false 427 o (Audio Channel Format): linear array, mono, stereo, tbd 429 o (Audio linear position): integer 0 to 100 431 o (Video scale): integer indicating scale 433 o (Video composed): true, false 435 o (Video auto-switched): true, false 437 The attributes listed here are discussed in Appendix A, in order to 438 keep the emphasis of this draft on the overall approach, rather than 439 the more specific details. 441 6.3. Capture Set 443 A sender describes its ability to send alternatives of media streams 444 by defining capture sets. 446 A capture set is a list of media captures expressed in rows. Each 447 row of the capture set or list consists of either a single capture or 448 groups of captures. A group means the individual captures in the 449 group are spatially related, and the order of the captures within the 450 group, along with attribute values, defines the spatial ordering of 451 the captures. Spatial relationships are discussed in detail in 452 Appendix B. 454 The items (rows) in a capture set represent different alternatives 455 for representing the same Capture Scene. For example the following 456 are alternative ways of capturing the same Capture Scene - two 457 cameras each viewing half of a room, or one camera viewing the whole 458 room, or one stream that automatically captures the person in the 459 room who is currently speaking. Each row of the Capture Set contains 460 either a single media capture or one group of media captures. 462 The following example shows a capture set for an endpoint media 463 sender where: 465 o (VC0 - left camera capture, VC1 - center camera capture, VC2 - 466 right camera capture) 468 o (VC3 - capture associated with loudest) 470 o (VC4 - zoomed out view of all people in the room.) 472 o (AC0 - room audio) 474 The first item in this capture set example is a group of video 475 captures with a spatial relationship to each other. VC1 is to the 476 left of VC2, and VC0 is to the left of VC1. VC3 and VC4 are other 477 alternatives of how to capture the same room in different ways. The 478 audio capture is included in the same capture set to indicate AC0 is 479 associated with those video captures, meaning the audio should be 480 rendered along with the video in the same set. 482 The idea is to have sets of captures that represent the same 483 information ("information" in this context might be a set of people 484 and their associated audio / video streams, or might be a 485 presentation supplied by a laptop, perhaps with an accompanying audio 486 commentary). Spatial ordering of media captures is imposed here by 487 the simplicity of a left to right ordering among media captures in a 488 group in the set. 490 A media receiver could choose one row of each media type (e.g., audio 491 and video) from a capture set. For example a three stream receiver 492 could choose the first video row plus the audio row, while a single 493 stream receiver could choose the second or third video row plus the 494 audio row. An MCU receiver might choose to receive multiple rows. 496 The simultaneity groups and encoding groups as discussed in the next 497 section apply to media captures listed in capture sets. The 498 simultaneity groups and encoding groups MUST allow all the Media 499 Captures in a particular group to be used simultaneously. 501 7. Choosing Streams 503 The following diagram shows the flow of information messages between 504 a media provider and a media consumer. The provider sends 505 information about its capabilities (as specified in this section), 506 then the consumer chooses which streams it wants, which we refer to 507 as "configure". Optionally, the consumer may send hints to the 508 provider about its own capabilities, in which case the provider might 509 tailor its announcements to the consumer. 511 Diagram for Choosing Streams 512 Media Receiver Media Sender 513 -------------- ------------ 514 | | 515 |------------- Hints ---------------->| 516 | | 517 | | 518 |<---- Capabilities (announce) -------| 519 | | 520 | | 521 |------ Configure (request) --------->| 522 | | 524 In order for appropriate streams to be sent from senders to 525 receivers, certain characteristics of the multiple streams must be 526 understood by both senders and receivers. Two separate aspects of 527 streams suffice to describe the necessary information to be shared by 528 senders and receivers. The first aspect we call "physical 529 simultaneity" and the other aspect we refer to as "encoding group". 530 These are described in the following sections. 532 7.1. Physical Simultaneity 534 An endpoint or MCU can send multiple captures simultaneously. 535 However, there may be constraints that limit which captures can be 536 sent simultaneously with other captures. 538 Physical or device simultaneity refers to fact that a device may not 539 be able to be used in different ways at the same time. This shapes 540 the way that offers are made from the sender. The offers are made so 541 that the receiver will choose one of several possible usages of the 542 device. This is easier to show in an example. 544 Consider the example of a room system where there are 3 cameras each 545 of which can send a separate capture covering 2 persons each- VC0, 546 VC1, VC2. The middle camera can also zoom out and show all 6 547 persons, VC3. But the middle camera cannot be used in both modes at 548 the same time - it has to either show the space where 2 participants 549 sit or the whole 6 seats. We refer to this as a physical device 550 simultaneity constraint. 552 The following illustration shows 3 cameras with 4 video streams. The 553 middle camera can be used as main video zoomed in on 2 people or it 554 could be used in zoomed out mode and capture the whole endpoint. The 555 idea here is that the middle camera cannot be used for both zoomed in 556 and zoomed out captures simultaneously. This is a constraint imposed 557 by the physical limitations of the devices. 559 Diagram for Simultaneity 560 `-. +--------+ VC2 561 .-'_Camera 3|----------> 562 .-' +--------+ 563 VC3 564 --------> 565 `-. +--------+ / 566 .-'|Camera 2|< 567 .-' +--------+ \ VC1 568 --------> 570 `-. +--------+ VC0 571 .-'|Camera 1|----------> 572 .-' +--------+ 574 VC0- video zoomed in on 2 people VC2- video zoomed in on 2 people 575 VC1- video zoomed in on 2 people VC3- video zoomed out on 6 people 577 Simultaneous transmission sets can be expressed as sets of the VCs 578 that could physically be transmitted at the same time, though it may 579 not make sense to do so. 581 In this example the two simultaneous sets are: 583 o {VC0, VC1, VC2} 585 o {VC0, VC3, VC2} 587 In this example VC0, VC1 and VC2 can be sent OR VC0, VC3 and VC2. 588 Only one set can be transmitted at a time. These are physical 589 capabilities describing what can physically be sent at the same time, 590 not what might make sense to send. For example, in the second set 591 both VC0 and VC2 are redundant if VC3 is included. 593 In describing its capabilities, the provider must take physical 594 simultaneity into account and send a list of its simultaneity groups 595 to the consumer. 597 7.2. Encoding Groups 599 The second aspect of multiple streams that must be understood by 600 senders and receivers in order to create the best experience 601 possible, i. e., for the "right" or "best" streams to be sent, is the 602 encoding characteristics of the possible streams that can be sent. 603 Just in the way that there is a constraint imposed on the multiple 604 streams due to the physical limitations, there are also constraints 605 due to encoding limitations. These are described in an Encoding 606 Group as follows. 608 An encoding group is an attribute of a video capture (VC) as 609 discussed above. 611 An encoding group has the following variables, as shown in the 612 following table. 614 +--------------+----------------------------------------------------+ 615 | Name | Description | 616 +--------------+----------------------------------------------------+ 617 | maxBandwidth | Maximum number of bits per second relating to a | 618 | | single video encoding | 619 | maxMbps | Maximum number of macroblocks per second relating | 620 | | to a single video encoding: ((width + 15) / 16) * | 621 | | ((height + 15) / 16) * framesPerSecond | 622 | maxWidth | Video resolution's maximum supported width, | 623 | | expressed in pixels | 624 | maxHeight | Video resolution's maximum supported height, | 625 | | expressed in pixels | 626 | maxFrameRate | Maximum supported frame rate | 627 +--------------+----------------------------------------------------+ 629 An encoding group is the basic method of describing encoding 630 capability. There may be multiple encoding groups per endpoint. For 631 example, each video capture device might have an associated encoding 632 group that describes the video streams that can result from that 633 capture. 635 An encoding group EG comprises one or more potential encodings 636 ENC. For example, 638 EG0: maxMbps=489600, maxBandwidth=6000000 639 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 640 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 641 AUDIO_ENC0: maxBandwidth=96000 642 AUDIO_ENC1: maxBandwidth=96000 643 AUDIO_ENC2: maxBandwidth=96000 645 Here, the encoding group is EG0. It can transmit up to two 1080p30 646 encodings (Mbps for 1080p = 244800), but it is capable of 647 transmitting a maxFrameRate of 60 frames per second (fps). To 648 achieve the maximum resolution (1920 x 1088) the frame rate is 649 limited to 30 fps. However 60 fps can be achieved at a lower 650 resolution if required by the receiver. Although the encoding group 651 is capable of transmitting up to 6Mbit/s, no individual video 652 encoding can exceed 4Mbit/s. 654 This encoding group also allows up to 3 audio encodings, 655 AUDIO_ENC<0-2>. It is not required that audio and video encodings 656 reside within the same encoding group, but if so then the group's 657 overall maxBandwidth value is a limit on the sum of all audio and 658 video encodings configured by the receiver. A system that does not 659 wish or need to combine bandwidth limitations in this way should 660 instead use separate encoding groups for audio and video in order for 661 the bandwidth limitations on audio and video to not interact. 663 Here is an example written with separate audio and video encode 664 groups. 666 VIDEO_EG0: maxMbps=489600, maxBandwidth=6000000 667 VIDEO_ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 668 VIDEO_ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 669 AUDIO_EG0: maxBandwidth=500000 670 AUDIO_ENC0: maxBandwidth=96000 671 AUDIO_ENC1: maxBandwidth=96000 672 AUDIO_ENC2: maxBandwidth=96000 674 The following two sections describe further examples of encoding 675 groups. In the first example, the capability parameters are the same 676 across ENCs. In the second example, they vary. 678 7.2.1. Sample video encoding group specification #1 680 An endpoint that has 3 similar video capture devices would advertise 681 3 encoding groups that can each transmit up to 2 1080p30 encondings, 682 as follows: 684 EG0: maxMbps = 489600, maxBandwidth=6000000 685 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 686 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 687 EG1: maxMbps = 489600, maxBandwidth=6000000 688 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 689 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 690 EG2: maxMbps = 489600, maxBandwidth=6000000 691 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 692 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 694 A remote receiver configures some or all of the specific encodings 695 such that: 697 o The configuration of each active ENC parameter values does not 698 cause that encoding's maxWidth, maxHeight, maxFrameRate to be 699 exceeded 701 o The total bandwidth of the configured ENC encodings does not 702 exceed the maxBandwidth of the encoding group 704 o The sum of the "macroblocks per second" values of each configured 705 encoding does not exceed the maxMbps of the encoding group 707 There is no requirement for all encodings within an encoding group to 708 be activated when configured by the receiver. 710 Depending on the sender's encoding methods, the receiver may be able 711 to request fixed encode values or choose encode values in the range 712 less than the maximum offered. We will discuss receiver behavior in 713 more detail in a section below. 715 7.2.2. Sample video encoding group specification #2 717 An endpoint that has 3 similar video capture devices would advertise 718 3 encoding groups that can each transmit up to 2 1080p30 encondings, 719 as follows: 721 EG0: maxMbps = 489600, maxBandwidth=6000000 722 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 723 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 724 EG1: maxMbps = 489600, maxBandwidth=6000000 725 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 726 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 727 EG2: maxMbps = 489600, maxBandwidth=6000000 728 ENC0: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 729 ENC1: maxWidth=1920, maxHeight=1088, maxFrameRate=60, maxMbps=244800, maxBandwidth=4000000 731 A remote receiver configures some or all of the specific encodings 732 such that: 734 o The configuration of each active ENC parameter values does not 735 cause that encoding's maxWidth, maxHeight, maxFrameRate to be 736 exceeded 738 o The total bandwidth of the configured ENC encodings does not 739 exceed the maxBandwidth of the encoding group 741 o The sum of the "macroblocks per second" values of each configured 742 encoding does not exceed the maxMbps of the encoding group 744 There is no requirement for all encodings within an encoding group to 745 be activated when configured by the receiver. 747 Depending on the sender's encoding methods, the receiver may be able 748 to request fixed encode values or choose encode values in the range 749 less than the maximum offered. We will discuss receiver behavior in 750 more detail in a section below. 752 8. Media provider behavior 754 In summary, what is included in the sender capabilities announce 755 messing includes: 757 o the list of captures and their attributes 759 o the list of capture sets 761 o the list of physical simultaneity groups 763 o the list of the encoding groups 765 9. Putting it together - using the Capture Set 767 This section shows how to use the framework to represent a typical 768 case for telepresence rooms. 770 Appendix B includes an additional example showing the MCU case. 771 [Edt. It is in the Appendix just to allow the body of the document to 772 focus on the basic ideas. It can be brought in to the main text in a 773 later draft.] 775 Consider an endpoint with the following characteristics: 777 o 3 cameras, 3 displays, a 6 person table 779 o Each video device can provide one capture for each 1/3 section of 780 the table 782 o A single capture representing the active speaker can be provided 784 o A single capture representing the active speaker with the other 2 785 captures shown picture in picture within the stream can be 786 provided 788 o A capture showing a zoomed out view of all 6 seats in the room can 789 be provided 791 The audio and video captures for this endpoint can be described as 792 follows. The Encode Group specifications can be found above in 793 section 6.2.2, Sample video encoding group specification #2. 795 Video Captures: 797 1. VC0- (the left camera stream), encoding group:EG0, attributes: 798 purpose=main;auto-switched:no 800 2. VC1- (the center camera stream), encoding group:EG1, attributes: 801 purpose=main; auto-switched:no 803 3. VC2- (the right camera stream), encoding group:EG2, attributes: 804 purpose=main;auto-switched:no 806 4. VC3- (the loudest panel stream), encoding group:EG1, attributes: 807 purpose=main;auto-switched:yes 809 5. VC4- (the loudest panel stream with PiPs), encoding group:EG1, 810 attributes: purpose=main; composed=true; auto-switched:yes 812 6. VC5- (the zoomed out view of all people in the room), encoding 813 group:EG1, attributes: purpose=main;auto-switched:no 815 7. VC6- (presentation stream), encoding group:EG1, attributes: 816 purpose=presentation;auto-switched:no 818 Summary of video captures - 3 codecs, center one is used for center 819 camera stream, presentation stream, auto-switched, and zoomed views. 820 [edt. It is arbitrary that for this example the alternative views 821 are on EG1 - they could have been spread out- it was not a necessary 822 choice.] 824 Audio Captures: 826 o AC0 (left), attributes: purpose=main;channel format=linear array; 827 linear position=0; 829 o AC1 (right), attributes: purpose=main;channel format=linear array; 830 linear position=100; 832 o AC2 (center) attributes: purpose=main;channel format=linear array; 833 linear position=50; 835 o AC3 being a simple pre-mixed audio stream from the room (mono), 836 attributes: purpose=main;channel format=linear array; linear 837 position=50; mixed=true 839 o AC4 audio stream associated with the presentation video (mono) 840 attributes: purpose=presentation;channel format=linear array; 841 linear position=50; 843 The physical simultaneity information is: 845 {VC0, VC1, VC2, VC3, VC4, VC6} 847 {VC0, VC2, VC5, VC6} 849 You can physically do any selection within one set at the same time. 850 This is strictly what is possible from the devices. However, using 851 every member in the set simultaneously may not make sense- for 852 example VC3(loudest) and VC4 (loudest with PIP). (In addition, there 853 are encoding constraints that make choosing all of the VCs in a set 854 impossible. VC1, VC3, VC4, VC5, VC6 all use EG1 and EG1has only 3 855 ENCs. This constraint shows up in the Capture list, not in the 856 physical simultaneity list.) 858 In this example there are no restrictions on which audio captures can 859 be sent simultaneously. 861 The following table represents the capture sets for this sender. 862 Recall that a capture set is composed of alternative captures 863 covering the same scene. Capture Set #1 is for the main people 864 captures, and Capture Set #2 is for presentation. 866 +----------------+ 867 | Capture Set #1 | 868 +----------------+ 869 | VC0, VC1, VC2 | 870 | VC3 | 871 | VC4 | 872 | VC5 | 873 | AC0, AC1, AC2 | 874 | AC3 | 875 +----------------+ 877 +----------------+ 878 | Capture Set #2 | 879 +----------------+ 880 | VC6 | 881 | AC4 | 882 +----------------+ 884 Different capture sets are unique to each other, non-overlapping. A 885 receiver chooses a capture row from each capture set. In this case 886 the three captures VC0, VC1, and VC2 are one way of representing the 887 video from the endpoint. These three captures should appear adjacent 888 next to each other. Alternatively, another way of representing the 889 Capture Scene is with the capture VC3, which automatically shows the 890 person who is talking. Similarly for the VC4 and VC5 alternatives. 892 As in the video case, the different rows of audio in Capture Set #1 893 represent the "same thing", in that one way to receive the audio is 894 with the 3 linear position audio captures (AC0, AC1, AC2), and 895 another way is with the single channel monaural format AC3. The 896 Media Consumer would choose the one audio capture row it is capable 897 of receiving. 899 The spatial ordering is understood by the left to right ordering 900 among the VC7lt;n>r;s on the same row of the table. 902 The receiver finds a "row" in each capture set #x section of the 903 table that it wants. It configures the streams according to the 904 encoding group for the row. 906 A Media Receiver would likely want to choose a row to receive based 907 in part on how many streams it can simultaneously receive. A 908 receiver that can receive three people streams would probably prefer 909 to receive the first row of Capture Set #1 (VC0, VC1, VC2) and not 910 receive the other rows. A receiver that can receive only one people 911 stream would probably choose one of the other rows. 913 If the receiver can receive a presentation stream too, it would also 914 choose to receive the only row from Capture Set #2 (VC6). 916 10. Media consumer behaviour 918 The receive side of a call needs to balance its requirements, based 919 on number of screens and speakers, its decoding capabilities and 920 available bandwidth, and the sender's capabilities in order to 921 optimally configure the sender's streams. Typically it would want to 922 receive and decode media from each capture set advertised by the 923 sender. 925 A sane, basic, algorithm might be for the receiver to go through each 926 capture set in turn and find the collection of video captures that 927 best matches the number of screens it has (this might include 928 consideration of screens dedicated to presentation video display 929 rather than "people" video) and then decide between alternative rows 930 in the video capture sets based either on hard-coded preferences or 931 user choice. Once this choice has been made, the receiver would then 932 decide how to configure the sender's encode groups in order to make 933 best use of the available network bandwidth and its own decoding 934 capabilities. 936 10.1. One screen receiver configuring the example capture-side device 937 above 939 The receive side of a call needs to balance its requirements, based 940 on number of screens and speakers, its decoding capabilities and 941 available bandwidth, and the sender's capabilities in order to 942 optimally configure the sender's streams. Typically it would want to 943 receive and decode media from each capture set advertised by the 944 sender. 946 A sane, basic, algorithm might be for the receiver to go through each 947 capture set in turn and find the collection of video captures that 948 best matches the number of screens it has (this might include 949 consideration of screens dedicated to presentation video display 950 rather than "people" video) and then decide between alternative rows 951 in the video capture sets based either on hard-coded preferences or 952 user choice. Once this choice has been made, the receiver would then 953 decide how to configure the sender's encode groups in order to make 954 best use of the available network bandwidth and its own decoding 955 capabilities. 957 10.2. Two screen receiver configuring the example capture-side device 958 above 960 Mixing systems with an even number of screens, "2n", and those with 961 "2n+1" cameras (and vice versa) is always likely to be the 962 problematic case. In this instance, the behaviour is likely to be 963 determined by whether a "2 screen" system is really a "2 decoder" 964 system, i.e., whether only one received stream can be displayed per 965 screen or whether more than 2 streams can be received and spread 966 across the available screen area. To enumerate 3 possible behaviours 967 here for the 2 screen system when it learns that the far end is 968 "ideally" expressed via 3 capture streams: 970 1. Fall back to receiving just a single stream (VC3, VC4 or VC5 as 971 per the 1 screen receiver case above) and either leave one screen 972 blank or use it for presentation if / when a presentation becomes 973 active 975 2. Receive 3 streams (VC0, VC1 and VC2) and display across 2 screens 976 (either with each capture being scaled to 2/3 of a screen and the 977 centre capture being split across 2 screens) or, as would be 978 necessary if there were large bezels on the screens, with each 979 stream being scaled to 1/2 the screen width and height and there 980 being a 4th "blank" panel. This 4th panel could potentially be 981 used for any presentation that became active during the call. 983 3. Receive 3 streams, decode all 3, and use control information 984 indicating which was the most active to switch between showing 985 the left and centre streams (one per screen) and the centre and 986 right streams. 988 For an endpoint capable of all 3 methods of working described above, 989 again it might be appropriate to offer the user the choice of display 990 mode. 992 10.3. Three screen receiver configuring the example capture-side device 993 above 995 This is the most straightforward case - the receiver would look to 996 identify a set of streams to receive that best matched its available 997 screens and so the VC0 plus VC1 plus VC2 should match optimally. The 998 spatial ordering would give sufficient information for the correct 999 video capture to be shown on the correct screen, and the receiver 1000 would either need to divide a single encode group's capability by 3 1001 to determine what resolution and frame rate to configure the sender 1002 with or to configure the individual video captures' encode groups 1003 with what makes most sense (taking into account the receive side 1004 decode capabilities, overall call bandwidth, the resolution of the 1005 screens plus any user preferences such as motion vs sharpness). 1007 10.4. Configuration of sender streams by a receiver 1009 After receiving a set of video capture information from a sender and 1010 making its choice of what media streams to receive based on the 1011 receiver's own capabilities and any sender-side simultaneity 1012 restrictions, the receiver needs to essentially configure the sender 1013 to transmit the chosen set. 1015 The expectation is that this message will enumerate each of the 1016 encoding groups and potential encoders within those groups that the 1017 receiver wishes to be active (this may well be a subset of the 1018 complete set available). For each such encoder within an encoding 1019 group, the receiver would specify the video capture (i.e., VC. 1072 Appendix A. Attributes 1074 This section discusses the attributes and their values in more 1075 detail, and many have additional details provided elsewhere in the 1076 draft. In general, the way to extend the solution to handle new 1077 features is by adding attributes and/or values. 1079 A.1. Purpose 1081 A variable with enumerated values describing the purpose or role of 1082 the Media Capture. It could be applied to any media type. Possible 1083 values: main, presentation, others TBD. 1085 A.1.1. Main 1087 The audio or video capture is of one or more people participating in 1088 a conference (or where they would be if they were there). It is of 1089 part or all of the Capture Scene. 1091 A.1.2. Presentation 1093 A.2. Audio mixed 1095 A.3. Audio Channel Format 1097 The "channel format" attribute of an Audio Capture indicates how the 1098 meaning of the channels is determined. It is an enumerated variable 1099 describing the type of audio channel or channels in the Aucio 1100 Capture. The possible values of the "channel format" attribute are: 1102 o linear array (linear position) 1104 o stereo 1106 o TBD - other possible future values (to potentially include other 1107 things like 3.0, 3.1, 5.1 surround sound and binaural) 1109 All ACs in the same row of a Capture Set MUST have the same value of 1110 the "channel format" attribute. 1112 A.3.1. Linear Array 1114 An AC with channel format = "linear array" has exactly one audio 1115 channel. For the "linear array" channel format, there is another 1116 required attribute to specify position within the array. This is the 1117 "linear position" attribute, which is an integer value within the 1118 range 0 to 100. 0 means leftmost, 100 means rightmost, with other 1119 values spaced equally between. A value of 50 means in the center, 1120 spatially. Any AC can have any value, even multiple ACs in a capture 1121 set row can have the same value. The 0-100 linear position is 1122 intentionally dimensionless, since we are presuming that receivers 1123 will use different sized video displays, and the audio spatial 1124 location can be adjusted at the receiving side to correspond to the 1125 displays. 1127 The linear position value is fixed until the receiver asks for a 1128 different AC from the capture set, which may be triggered by the 1129 provider sending an updated capture set. 1131 The streams being sent might be correlated (that is, someone talking 1132 might be heard in multiple captures from the same room). Echo 1133 cancellation and stream synchronization in receivers should take this 1134 into account. 1136 With three audio channels representing left, center, and right: 1138 AC0 - channel format = linear array; linear position = 0 1140 AC1 - channel format = linear array; linear position = 50 1142 AC2 - channel format = linear array; linear position = 100 1144 A.3.2. Stereo 1146 An AC with channel format = "stereo" has exactly two audio channels, 1147 left and right, as part of the same AC. [Edt: should we mention RFC 1148 3551 here? The channel format may be related to how Audio Captures 1149 are mapped to RTP streams. This stereo is not the same as the effect 1150 produced from two mono ACs one from the left and one from the right. 1151 ] 1153 A.3.3. Mono 1155 An AC with channel format="mono" has one audio channel. This can be 1156 represented by audio linear position with a single member at a single 1157 integer location. [Edt. Mono can be represented as an as a 1158 particular case of linear array (=1] 1160 A.4. Audio Linear Position 1162 An integer valued variable from 0 - 100, where 0 signifies the left 1163 and 100 signifies the right. 1165 A.5. Video Scale 1167 An optional integer valued variable indicating the spatial scale of 1168 the video capture, for example centimeters for horizontal image 1169 width. 1171 A.6. Video composed 1173 An optional Boolean variable indicating if the VC is constructed by 1174 composing multiple other video captures together. stream incorporates 1175 multiple composed panes (This could indicate for example a continuous 1176 presence view of multiple images in a grid, or a large image with 1177 smaller picture-in-picture images in it.) 1179 A.7. Video Auto-switched 1181 A Boolean variable. In this case the offered VC varies depending on 1182 some rule; it is auto-switched between possible VCs. The most common 1183 example of this is sending the video capture associated with the 1184 "loudest" speaker according to an audio detection algorithm. 1186 Appendix B. Spatial Relationship 1188 Here is an example of a simple capture set with three video captures 1189 and three audio channels, each in a separate row: 1191 (VC0, VC1, VC2) 1193 (AC0, AC1, AC2) 1195 The three ACs together in a row indicate those channels are spatially 1196 related to each other, and spatially related to the VCs in the same 1197 capture set. 1199 Multiple Media Captures of the same media type are often spatially 1200 related to each other. Typically multiple Video Captures should be 1201 rendered next to each other in a particular order, or multiple audio 1202 channels should be rendered to match different speakers in a 1203 particular way. Also, media of different types are often associated 1204 with each other, for example a group of Video Captures can be 1205 associated with a group of Audio Captures meaning they should be 1206 rendered together. 1208 Media Captures of the same media type are associated with each other 1209 by grouping them together in a single row of a Capture Set. Media 1210 Captures of different media types are associated with each other by 1211 putting them in different rows of the same Capture Set. 1213 For video the spatial relationship is horizontal adjacency in one 1214 dimension. So Video Captures can be described as being adjacent to 1215 each other, in a horizontal row, ordered left to right. When VCs are 1216 grouped together in a capture set row, it means they are horizontally 1217 adjacent to each other, such that when more than one of them are 1218 rendered together they should be rendered next to each other in the 1219 proper order. The first VC in the group is the leftmost (from the 1220 point of view of a person looking at the rendered images), and so on 1221 towards the right. 1223 [Edt: Additional attributes can be added, such as the ability to 1224 handle two dimensional array instead of just a one dimensional row of 1225 video images.] 1227 Audio Captures that are in the same Capture Set with Video Captures 1228 are related to each other spatially, such that the multiple audio 1229 channels should be rendered such that the overall audio field covers 1230 roughly the same horizontal extent as the rendered video. This gives 1231 a reasonable spatial correlation between audio and video. A more 1232 exact relationship is out of scope of this framework. 1234 B.1. Spatial relationship of audio with video 1236 A row of audio is spatially related to a row of video in the same 1237 capture set. The audio and video should be rendered such that they 1238 appear spatially coincident. Audio with a linear position of 0 1239 corresponds to the leftmost side of the group of VCs in the same 1240 capture set. Audio with a linear position of 50 corresponds to the 1241 center of the group of VCs. Audio with a linear position of 100 1242 corresponds to the rightmost side of the group of VCs. 1244 Likewise, for stereo audio, the spatial extent of the audio should be 1245 coincident with the spatial extent of the corresponding video. 1247 Appendix C. Capture sets for the MCU Case 1249 This shows how an MCU might express its Capture Sets, intending to 1250 offer different choices for receivers that can handle different 1251 numbers of streams. A single audio capture stream is provided for 1252 all single and multi-screen configurations that can be associated 1253 (e.g. lip-synced) with any combination of video captures at the 1254 receiver. 1256 +--------------------+---------------------------------------------+ 1257 | Capture Set #1 | note | 1258 +--------------------+---------------------------------------------+ 1259 | VC0 | video capture for single screen receiver | 1260 | VC1, VC2 | video capture for 2 screen receiver | 1261 | VC3, VC4, VC5 | video capture for 3 screen receiver | 1262 | VC6, VC7, VC8, VC9 | video capture for 4 screen receiver | 1263 | AC0 | audio capture representing all participants | 1264 +--------------------+---------------------------------------------+ 1266 If / when a presentation stream becomes active within the conference, 1267 the MCU might re-advertise the available media as: 1269 +----------------+--------------------------------------+ 1270 | Capture Set #2 | note | 1271 +----------------+--------------------------------------+ 1272 | VC10 | video capture for presentation | 1273 | AC1 | presentation audio to accompany VC10 | 1274 +----------------+--------------------------------------+ 1276 Authors' Addresses 1278 Allyn Romanow 1279 Cisco Systems 1280 San Jose, CA 95134 1281 USA 1283 Email: allyn@cisco.com 1285 Mark Duckworth 1286 Polycom 1287 Andover, MA 01810 1288 US 1290 Email: mark.duckworth@polycom.com 1292 Andrew Pepperell 1293 Cisco Systems 1294 Langely, England 1295 UK 1297 Email: apeppere@cisco.com 1298 Brian Baldino 1299 Cisco Systems 1300 San Jose, CA 95134 1301 US 1303 Email: bbaldino@polycom.com 1305 Mark Goryzinski 1306 HP Visual Collaboration 1307 Corvallis, OR 1308 USA 1310 Email: mark.gorzynski@hp.com