idnits 2.17.1 draft-romanow-clue-telepresence-prob-statement-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (January 12, 2011) is 4843 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco 4 Intended status: Informational S. Botzko 5 Expires: July 16, 2011 Polycom 6 January 12, 2011 8 Problem Statement for Telepresence Multi-streams 9 draft-romanow-clue-telepresence-prob-statement-00.txt 11 Abstract 13 Telepresence systems create a "being there" conferencing experience. 14 A number of issues need to be solved largely by manipulating multiple 15 audio and video streams. Different systems take different 16 approaches, employ different techniques, and convey information by 17 using different vocabularies, making interoperability extremely 18 challenging. This problem statement describes the typical issues 19 that must be solved and uses examples to illustrate the kind of 20 diversity that makes interworking problematic. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on July 16, 2011. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3. Fundamental Issues for Telepresence . . . . . . . . . . . . . 4 59 4. Manipulating Media Streams . . . . . . . . . . . . . . . . . . 5 60 5. Examples of Interworking Issues . . . . . . . . . . . . . . . 6 61 5.1. Designating Roles and Positions for transmitted streams . 6 62 5.2. Multipoint . . . . . . . . . . . . . . . . . . . . . . . . 7 63 5.3. Capability Negotiation . . . . . . . . . . . . . . . . . . 9 64 5.4. Differences in Media Characteristics . . . . . . . . . . . 9 65 5.4.1. Aspect Ratio . . . . . . . . . . . . . . . . . . . . . 9 66 5.4.2. Visual Scale . . . . . . . . . . . . . . . . . . . . . 11 67 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 68 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 69 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 70 9. Informative References . . . . . . . . . . . . . . . . . . . . 13 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 73 1. Introduction 75 In a Telepresence conference, the idea is to create a feeling of 76 presence - that you are in the same room with the remote parties. In 77 order to create the "being there" or telepresence experience, a 78 number of technical issues need to be solved. These issues are 79 addressed by manipulating multiple media streams, video and audio - 80 by describing them, controlling them, and signaling about them. The 81 fundamental features of telepresence require handling multiple 82 streams of media, and considering additional characteristics of those 83 streams beyond those normally specified in existing videoconferencing 84 standards. 86 Different telepresence systems approach solving the basic issues 87 differently. They use disparate techniques, and they describe, 88 control and signal media in dissimilar fashions. Such diversity 89 creates an interoperability problem. The same issues are solved in 90 different ways by different systems, so that they are not directly 91 interoperable. This makes interworking difficult at best and 92 sometimes impossible. 94 Some degree of interworking is possible through transcoding and 95 translation. This requires additional devices, which are expensive 96 and not entirely automatic. Specialized knowledge is required to 97 operate a telepresence conference where the endpoints use different 98 equipment and a transcoding and translating device is employed for 99 interoperability. Often such conferences are interrupted by 100 difficulties that arise. 102 The general problem that needs to be solved is this. The 103 transmitting side sends audio and video streams based upon a model 104 for rendering a realistic depiction from this information. If the 105 receiving side belongs to the same vendor, it works with the same 106 model and renders the information according to that shared model. 107 However, if the receiver and the sender are from different vendors, 108 the models they each have for rendering presence differ. 110 It is as if Alice and Bob are at different sites. Alice needs to 111 tell Bob information about what her camera and sound equipment see at 112 her site so that Bob's receiver can create a display that will 113 capture the important characteristics of her site. Alice and Bob 114 need to agree on what the salient characteristics are as well as how 115 to represent and communicate them. The telepresence multi-steam work 116 seeks to describe the sender situation in a way that allows the 117 receiver to render it realistically though it may have a different 118 rendering model than the sender. 120 This problem statement identifies the fundamental issues that need to 121 be addressed to provide telepresence in typical use case scenarios. 122 We show how different approaches to solving the problems and 123 different techniques for handling multiple media create a challenge 124 for interoperability. 126 This document describes some of the problems that arise, it is not an 127 complete list, but rather it is more illustrative than exhaustive. 128 Requirements, use cases and solutions are discussed in other 129 documents. 131 2. Terminology 133 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 134 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 135 document are to be interpreted as described in RFC 2119 [RFC2119]. 137 3. Fundamental Issues for Telepresence 139 The fundamental issues that must be handled to produce a typical 140 telepresence conference, either point to point or multipoint include: 142 1. Participant display 144 A. Placement of video 146 B. Size 148 C. Angle 150 D. Overlap 152 E. Display technology 154 2. Audio 156 A. Placement, emanating from right place 158 B. Type of audio 160 3. Different number of screens on sender and receiver sides 162 4. Participant display for multipoint 164 A. Placement of video 165 B. Continuous presence 167 C. Control of display, how does it change? - automatic, user 169 5. Maintaining eye contact and gaze connection 171 6. Panoramic view for site switching 173 7. Mismatches between media characteristics between sender and 174 receiver, such as: 176 A. aspect ratio 178 B. format 180 C. frame rate 182 D. resolution 184 8. Presentation 186 A. What methodology? 188 9. Security 190 A. SRTP? 192 B. Key methodology 194 4. Manipulating Media Streams 196 In addressing the fundamental issues, multiple media streams are 197 handled in the following ways: 199 1. Sender and receiver understand each others capabilities 201 A. Number of video, audio and presentation streams that can be 202 sent/received simultaneously 204 B. What media signaling protocol being used (SDP, proprietary, 205 etc.) 207 2. Streaming control 209 3. Feedback mechanisms 210 4. Signaling about RTP payload 212 5. Media control signaling 214 A. Video refresh 216 B. Flow control 218 6. Signaling media formats and media capabilities 220 7. Signaling content type 222 8. Signaling device type 224 9. Signaling network characteristics per stream 226 10. Floor control signaling 228 5. Examples of Interworking Issues 230 This section describes several examples that illustrate the kinds of 231 incompatibilities that arise when different systems take different 232 approaches to an issue. 234 5.1. Designating Roles and Positions for transmitted streams 236 Senders and receivers need to have the same vocabulary and 237 understanding of stream roles and positions in order to place them 238 appropriately. For example one system may define roles as: center, 239 left, right, legacy center, legacy right, legacy left, auxiliary 1/5 240 fps and auxiliary 30 fps positions. These roles as defined are a 241 combination of "input devices" + "codec type/format" for transmission 242 positions, and a combination of "stream decoders/output devices" + 243 "codec type/format" for receive positions. Another system will not 244 have the exact same vocabulary and meaning, though it still has to 245 accomplish the same placement task. 247 How the cameras and encoders are wired determines how the local scene 248 is displayed on the remote screen. In many systems right and left 249 need to be exchanged to be seen properly, but this depends on the way 250 the equipment is wired. 252 In describing how to display the local scene, the language can be 253 misleading if there is no agreed upon reference for right and left. 254 [for example, more] 256 Although often the video is displayed on separate monitors, it is 257 also possible to use projectors to create a video wall. In this 258 case, there may be an overlap region between cameras which allows for 259 projector blending. Also, although cameras are generally arranged to 260 create a seamless panoramic view of the participants, it is also 261 possible for there to be gaps between cameras (and corresponding gaps 262 between displays). 264 There is also no reference for image size. Some rooms use 265 proportionally larger displays, and set the camera field of view to 266 show participants either standing or sitting at life size. Others 267 use smaller displays, and set the field of view for sitting 268 participants (cropping off heads when people stand). In order to 269 preserve full size display when these systems interoperate, both 270 systems must rescale their video. 272 5.2. Multipoint 274 Multipoint conferences, where there are more than two endpoints, 275 create a wealth of technical issues to be solved. The primary one is 276 which participants to display on each screen at each site. If the 277 number of sites is greater than can be shown on the number of 278 displays at a site, this adds to the complexity. There are, of 279 course, almost unlimited ways this can be handled. We discuss the 280 common approaches and how they differ. 282 The local screens can show all the camera image from a particular 283 remote site (site switching); or each local screen can show a 284 participant or two from each of the remote sites (segment switching); 285 or local displays can show a composite of remote camera shots 286 (continuous presence). 288 The choice of who to display on a screen can be determined 289 statically, by users, or automatically according to some policy, such 290 as voice activity level. 292 [Add user-controlled personal telepresence scenario.] 294 Policies are created and implemented in many ways. They tend to be 295 based on some combination of what H.323 defines as centralized and 296 decentralized. One of the challenges is that the endpoints in the 297 conference may have different number of cameras and displays from 298 each other so a common mode on the number of streams and their 299 priority is required. Also, the various endpoints might have 300 different bandwidth constraints and support different codec profiles. 302 A centralized multipoint conference is one in which all participating 303 endpoints communicate in a point-to-point fashion with an MCU. The 304 endpoints transmit their control, audio, video, and/or data streams 305 to the MCU. The MCUA centrally manages the conference, processes the 306 audio, video and/or data streams, and returns the processed streams 307 to each endpoint. In this mode, the MCU will mix the audio streams; 308 and if using centralized video, will either use voice activated video 309 switch, where everyone will see the active speaker and the speaker 310 will see the previous speaker, or will use continuous presence mode, 311 where the MCU will create a video stream with sub windows for each of 312 the participants. MCUs can support multiple video layouts and they 313 can be created automatically based on the number of participants or 314 by a conference management application. 316 There are three methods commonly used for video stream distribution 317 in centralized multipoint conferences. The three conference policies 318 above can be implemented using any of these technologies. 320 Simple video switching (forwarding) has the advantage of low latency 321 and low complexity. It can be used if all systems are capable of 322 receiving the encodings used by the sending endpoints (including both 323 the video codec and the image resolution/aspect ratio). In some 324 situations it can be wasteful of bandwidth. 326 Full video transcoding usually has higher latency than switching It 327 does not require system to be capable of receiving identical 328 encodings, and different sites can connect with different bandwidths. 330 Layered video encoding combines some of the benefits of video 331 switching and video transcoding. It is more complex than video 332 switching, but less complex than video transcoding. Bandwidth and 333 resolution can be reduced for each site. Since this is done by 334 filtering out layers of the original encoding, the available 335 bandwidths and resolutions are not as fine-grained as full video 336 transcoding. 338 In decentralized mode or full mesh mode each endpoint creates its 339 display mode. This requires each endpoint to receive multiple 340 streams and send its video and audio to all participants, using 341 multicast of unicast. 343 In practice, multicast is not now being used in commercial systems, 344 so the size of a strictly decentralized multipoint conference is 345 limited. 347 There are analogous issues for audio. Like video, the audio is 348 rotated, so there is no clarity on the meaning of left and right. 349 Since the number of streams, microphones, and speakers are not 350 matched, the systems need to re-process the received audio in order 351 to create the correct sound field for their respective rooms. 353 There are two ways in which the audio might be handled in this use 354 case: 356 o A single stereo audio stream is sent to the remote site, just as 357 in standard videoconferencing. 359 o Three monaural audio streams are sent to the remote site, with 360 proprietary signaling to associate each audio stream with a video 361 stream. 363 Microphones and speakers positions vary; and there is no agreed upon 364 way to describe their placement. There is no agreed upon reference 365 for audio level. In addition, audio may be sent as an independent 366 stream from each microphone or as a multi-channel channel stream. 368 5.3. Capability Negotiation 370 Call setup for the telepresence conference will start with a single 371 call establishing one video media stream. After the connection is 372 established, a proprietary capability negotiation takes place that 373 will enable both sides to identify that they are telepresence 374 applications and capable of having two more video sessions and 375 provide the connectivity information. The result is that two or more 376 video sessions are established. The system may use two new SIP call 377 legs or just add the two new video streams to the existing dialog. 379 [more to be added] 381 5.4. Differences in Media Characteristics 383 Media characteristics such as video format, aspect ratio, and visual 384 scale can be handled differently at different sites creating 385 incompatibility. To interwork, an adaptive strategy is necessary. 386 Although differences in media characteristic must also be handled in 387 a typical video conference, the problem is made more complex in 388 Telepresence due to the multiple screens, cameras and streams. 390 Two examples - aspect ratio and visual scale are described here. 392 5.4.1. Aspect Ratio 394 If the aspect ratios in different sites are not the same, some 395 technique needs to be applied to adjust for the difference. Although 396 the same situation arises in normal video conferencing, multiple 397 streams in telepresence conferencing causes more difficulties. 399 For simplicity let us assume a point to point case - two conference 400 room on a point to point call. Both rooms have 3 screens and 3 401 cameras, as in 4.1 above. Both rooms have identical visual scale - 402 the display width and distance between the participants and the 403 displays are identical in both rooms. However the equipment - 404 cameras and displays - in each room has a different aspect ratio, 405 16:9 in one room and 4:3 in the other. 407 Although 4:3 is usually associated with standard definition TV and 408 16:9 with HDTV, telepresence systems may choose the aspect ratio to 409 obtain a particular field of view. Projecting images in the 16:9 410 aspect ratio offers a wider presentation angle that shows fine 411 details well (the pixel density is greater than a 4:3 system of the 412 same resolution and scale). In the room with 16:9 media 413 characteristic, people are shown at full size when they are seated. 414 However, when they stand up the height of the display results in 415 their image being cropped so that their heads are not shown. The 416 other room uses projectors to display HD images with 4:3 aspect 417 ratios. This results in an increased image height - the vertical 418 field of view is 33% greater than the 16:9 system. The increased 419 height allows most of the population to be shown full size whether 420 they are standing or sitting. 422 Some strategy is necessary to deal with the case of the two sites 423 having a point to point call. In order to convert formats of unequal 424 ratios a variety of techniques can be used, such as: zooming 425 (enlarging) and cropping (removing), letterboxing (adding horizontal 426 bars), pillarboxing (adding vertical bars) to retain the original 427 format's aspect ratio, or scaling (which distorts) in a variety of 428 ways. 430 For the video sent from the 4:3 room to the 16:9 room, several 431 techniques can be used: 433 1. The 16:9 system might simply crop the top 1/4 of each 4:3 image. 434 This will result in full size display, eye contact, and gaze 435 awareness for the individuals who are seated. However, the 436 standing presenter's head will be cropped. 438 2. The 16:9 system might stretch each to the 4:3 images to fully fit 439 the 16:9 display. This would reduce image height (creating 440 geometric distortion) and create eye-contact error. Continuity 441 of the panoramic image would be preserved. 443 3. The 16:9 system could pillarbox each of the 4:3 images, placing 444 horizontal borders on the three displays. This results in 445 reducing the image size to less than full size. It also destroys 446 the continuity of the panoramic image, and introduces additional 447 error in eye contact and gaze awareness. 449 4. The 16:9 system could pillarbox only the center display. This 450 reduces the size of the presenter who is the focus of the 451 meeting. 453 5. The 16:9 system could also crop the bottom of the center display. 454 Visually this reduces the height of the presenter, but maintains 455 full size. There is a vertical discontinuity in the panoramic 456 image. Whether this is objectionable or not depends on the room 457 layout. 459 Strategies 4 and 5 could be accomplished in response to a user 460 command or automatically. The details will be discussed in more 461 detail in future documents. 463 For the video sent from the 16:9 room to the 4:3 room, the receiving 464 system simply letterboxes the video displays. Since the scales are 465 identical, this full size image displays in the 4:3 room. 467 For the video sent from the 16:9 room to the 4:3 room, the common 468 techniques are: 470 1. The 4:3 system places the border above the image. This maintains 471 eye contact for those who are seated, but cannot maintain eye 472 contact for the presenter. 474 2. The 4:3 system places the border below the images. If the 16:9 475 system crops the bottom of the center display then this will 476 maintain eye contact for the presenter and the remote site. 478 3. The 4:3 system centers the images. Eye contact suffers for 479 everyone, but the worst case eye contact error is better 480 controlled. 482 In this use case, negotiation between the systems is not strictly 483 necessary, no matter which scheme is used. However, the best user 484 experience is obtained if both systems have knowledge about apect 485 ratios being used and which participants are standing and which are 486 sitting so they can adjust optimally. 488 5.4.2. Visual Scale 490 The visual scale of displays may differ between sites. Again, let us 491 use the point to point case as a simple example. Assume two 492 conference rooms in a point to point call. One room is designed for 493 6 participants, and has three 16:9 screens and 3 cameras. This room 494 is designed to show participants at their normal size when seated (2 495 participants per camera/display). It does not have adequate display 496 height to capture those who are standing. The second room is also 497 designed for 6 participants, but shows 3 participants per camera/ 498 display also at their full size. Therefore, it only needs two 16:9 499 cameras/display pairs. Since the field of view in both the vertical 500 and horizontal is increased by 50%, it also shows those who are 501 standing without cropping. 503 For the video sent from the 2 screen (larger scale) room to the 3 504 screen (smaller scale) room, two approaches can be used: 506 1. The 3 screen system might simply show the participants on two of 507 its displays. Participants will be shown at 67% of their full 508 size. Eye contact and gaze awareness will be lost. 510 2. The 3 screen system might construct and display a vertically 511 cropped 3-screen view, showing 2 participants on each screen. 512 Participants will be shown at full size, with preservation of eye 513 contact and gaze awareness. 515 For the video sent from the 3 screen to the 2 screen room, there are 516 two analogous approaches: 518 1. The 2 screen system selects 2 streams and simply shows them on 519 its displays. Participants will be shown at 150% of their normal 520 size. Eye contact and gaze awareness will be lost, and some of 521 the remote site is lost. 523 2. The 2 screen system might construct and display a 2 screen view 524 (with a vertical border on the top) which shows 3 participants on 525 each screen. Participants will be shown at full size, with 526 preservation of eye contact and gaze awareness. 528 Although there is no need for negotiation between the systems, the 529 best user experience is obtained if both systems have knowledge of 530 the visual scale, and where individuals are seated, and can then 531 choose the best manner of display. 533 6. IANA Considerations 535 This document contains no IANA considerations. 537 7. Security Considerations 539 While there are likely to be security considerations for any solution 540 for telepresence interoperability, this document has no security 541 considerations. 543 8. Acknowledgements 545 The draft has benefitted from input from a number of people including 546 Roni Even, Jim Cole, Nermeen Ismail, Nathan Buckles. 548 9. Informative References 550 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 551 Requirement Levels", BCP 14, RFC 2119, March 1997. 553 Authors' Addresses 555 Allyn Romanow 556 Cisco 557 San Jose, CA 95134 558 US 560 Email: allyn@cisco.com 562 Stephen Botzko 563 Polycom 564 Andover, MA 01810 565 US 567 Email: stephen.botzko@polycom.com