idnits 2.17.1 draft-romanow-dispatch-telepresence-prob-statement-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 12, 2010) is 5036 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 DISPATCH WG A. Romanow 3 Internet-Draft Cisco 4 Intended status: Informational S. Botzko 5 Expires: January 13, 2011 Polycom 6 July 12, 2010 8 Problem Statement for Telepresence Multi-streams 9 draft-romanow-dispatch-telepresence-prob-statement-01.txt 11 Abstract 13 Telepresence systems create a "being there" conferencing experience. 14 A number of issues need to be solved largely by manipulating multiple 15 audio and video streams. Different systems take different 16 approaches, employ different techniques, and convey information by 17 using different vocabularies, making interoperability extremely 18 challenging. This problem statement describes the typical issues 19 that must be solved and uses examples to illustrate the kind of 20 diversity that makes interworking problematic. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 13, 2011. 39 Copyright Notice 41 Copyright (c) 2010 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3. Fundamental Issues for Telepresence . . . . . . . . . . . . . 4 59 4. Manipulating Media Streams . . . . . . . . . . . . . . . . . . 5 60 5. Examples of Interworking Issues . . . . . . . . . . . . . . . 6 61 5.1. Designating Roles and Positions for transmitted streams . 6 62 5.2. Multipoint . . . . . . . . . . . . . . . . . . . . . . . . 7 63 5.3. Capability Negotiation . . . . . . . . . . . . . . . . . . 9 64 5.4. Differences in Media Characteristics . . . . . . . . . . . 9 65 5.4.1. Aspect Ratio . . . . . . . . . . . . . . . . . . . . . 9 66 5.4.2. Visual Scale . . . . . . . . . . . . . . . . . . . . . 11 67 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 68 7. Security Considerations . . . . . . . . . . . . . . . . . . . 12 69 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 70 9. Informative References . . . . . . . . . . . . . . . . . . . . 13 71 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 73 1. Introduction 75 In a Telepresence conference, the idea is to create a feeling of 76 presence - that you are in the same room with the remote parties. In 77 order to create the "being there" or telepresence experience, a 78 number of technical issues need to be solved. These issues are 79 addressed by manipulating multiple media streams, video and audio - 80 by describing them, controlling them, and signaling about them. The 81 fundamental features of telepresence require handling multiple 82 streams of media, and considering additional characteristics of those 83 streams beyond those normally specified in existing videoconferencing 84 standards. 86 Different telepresence systems approach solving the basic issues 87 differently. They use disparate techniques, and they describe, 88 control and signal media in dissimilar fashions. Such diversity 89 creates an interoperability problem. The same issues are solved in 90 different ways by different systems, so that they are not directly 91 interoperable. This makes interworking difficult at best and 92 sometimes impossible. 94 Some degree of interworking is possible through transcoding and 95 translation. This requires additional devices, which are expensive 96 and not entirely automatic. Specialized knowledge is required to 97 operate a telepresence conference where the endpoints use different 98 equipment and a transcoding and translating device is employed for 99 interoperability. Often such conferences are interrupted by 100 difficulties that arise. 102 The general problem that needs to be solved is this. The 103 transmitting side sends audio and video streams based upon a model 104 for rendering a realistic depiction from this information. If the 105 receiving side belongs to the same vendor, it works with the same 106 model and renders the information according to that shared model. 107 However, if the receiver and the sender are from different vendors, 108 the models they each have for rendering presence differ. 110 It is as if Alice and Bob are at different sites. Alice needs to 111 tell Bob information about what her camera and sound equipment see at 112 her site so that Bob's receiver can create a display that will 113 capture the important characteristics of her site. Alice and Bob 114 need to agree on what the salient characteristics are as well as how 115 to represent and communicate them. The telepresence multi-steam work 116 seeks to describe the sender situation in a way that allows the 117 receiver to render it realistically though it may have a different 118 rendering model than the sender. 120 This problem statement identifies the fundamental issues that need to 121 be addressed to provide telepresence in typical use case scenarios. 122 We show how different approaches to solving the problems and 123 different techniques for handling multiple media create a challenge 124 for interoperability. 126 This document describes some of the problems that arise, it is not an 127 complete list, but rather it is more illustrative than exhaustive. 128 Requirements, use cases and solutions are discussed in other 129 documents. 131 2. Terminology 133 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 134 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 135 document are to be interpreted as described in RFC 2119 [RFC2119]. 137 3. Fundamental Issues for Telepresence 139 The fundamental issues that must be handled to produce a typical 140 telepresence conference, either point to point or multipoint include: 142 1. Participant display 144 A. Placement of video 146 B. Size 148 C. Angle 150 D. Overlap 152 E. Display technology 154 2. Audio 156 A. Placement, emanating from right place 158 B. Type of audio 160 3. Different number of screens on sender and receiver sides 162 4. Participant display for multipoint 164 A. Placement of video 165 B. Continuous presence 167 C. Control of display, how does it change? - automatic, user 169 5. Maintaining eye contact and gaze connection 171 6. Panoramic view for site switching 173 7. Mismatches between media characteristics between sender and 174 receiver, such as: 176 A. aspect ratio 178 B. format 180 C. frame rate 182 D. resolution 184 8. Presentation 186 A. What methodology? 188 9. Security 190 A. SRTP? 192 B. Key methodology 194 4. Manipulating Media Streams 196 In addressing the fundamental issues, multiple media streams are 197 handled in the following ways: 199 1. Sender and receiver understand each others capabilities 201 A. Number of video, audio and presentation streams that can be 202 sent/received simultaneously 204 B. What media signaling protocol being used (SDP, proprietary, 205 etc.) 207 2. Streaming control 209 3. Feedback mechanisms 210 4. Signaling about RTP payload 212 5. Media control signaling 214 A. Video refresh 216 B. Flow control 218 6. Signaling media formats and media capabilities 220 7. Signaling content type 222 8. Signaling device type 224 9. Signaling network characteristics per stream 226 10. Floor control signaling 228 5. Examples of Interworking Issues 230 This section describes several examples that illustrate the kinds of 231 incompatibilities that arise when different systems take different 232 approaches to an issue. 234 5.1. Designating Roles and Positions for transmitted streams 236 Senders and receivers need to have the same vocabulary and 237 understanding of stream roles and positions in order to place them 238 appropriately. For example one system may define roles as: center, 239 left, right, legacy center, legacy right, legacy left, auxiliary 1/5 240 fps and auxiliary 30 fps positions. These roles as defined are a 241 combination of "input devices" + "codec type/format" for transmission 242 positions, and a combination of "stream decoders/output devices" + 243 "codec type/format" for receive positions. Another system will not 244 have the exact same vocabulary and meaning, though it still has to 245 accomplish the same placement task. 247 How the cameras and encoders are wired determines how the local scene 248 is displayed on the remote screen. In many systems right and left 249 need to be exchanged to be seen properly, but this depends on the way 250 the equipment is wired. 252 In describing how to display the local scene, the language can be 253 misleading if there is no agreed upon reference for right and left. 254 [for example, more] 256 Although often the video is displayed on separate monitors, it is 257 also possible to use projectors to create a video wall. In this 258 case, there may be an overlap region between cameras which allows for 259 projector blending. Also, although cameras are generally arranged to 260 create a seamless panoramic view of the participants, it is also 261 possible for there to be gaps between cameras (and corresponding gaps 262 between displays). 264 There is also no reference for image size. Some rooms use 265 proportionally larger displays, and set the camera field of view to 266 show participants either standing or sitting at life size. Others 267 use smaller displays, and set the field of view for sitting 268 participants (cropping off heads when people stand). In order to 269 preserve full size display when these systems interoperate, both 270 systems must rescale their video. 272 5.2. Multipoint 274 Multipoint conferences, where there are more than two endpoints, 275 create a wealth of technical issues to be solved. The primary one is 276 which participants to display on each screen at each site. If the 277 number of sites is greater than can be shown on the number of 278 displays at a site, this adds to the complexity. There are, of 279 course, almost unlimited ways this can be handled. We discuss the 280 common approaches and how they differ. 282 The local screens can show all the camera image from the a particular 283 remote site (site switching); or each local screen can show a 284 participant or two from each of the remote sites (segment switching); 285 or local displays can show a composite of remote camera shots 286 (continuous presence). The choice of who to display on a screen can 287 be determined by users, or, more often, automated according to voice 288 activity level. 290 [Add user-controlled personal telepresence scenario.] 292 Policies are created and implemented in many ways. They tend to be 293 based on some combination of what H.323 defines as centralized and 294 decentralized. One of the challenges is that the endpoints in the 295 conference may have different number of cameras and displays from 296 each other so a common mode on the number of streams and their 297 priority is required. Also, the various endpoints might have 298 different bandwidth constraints and support different codec profiles. 300 A centralized multipoint conference is one in which all participating 301 endpoints communicate in a point-to-point fashion with an MCU. The 302 endpoints transmit their control, audio, video, and/or data streams 303 to the MCU. The MCUA centrally manages the conference, processes the 304 audio, video and/or data streams, and returns the processed streams 305 to each endpoint. In this mode, the MCU will mix the audio streams; 306 and if using centralized video, will either use voice activated video 307 switch, where everyone will see the active speaker and the speaker 308 will see the previous speaker, or will use continuous presence mode, 309 where the MCU will create a video stream with sub windows for each of 310 the participants. MCUs can support multiple video layouts and they 311 can be created automatically based on the number of participants or 312 by a conference management application. 314 There are three methods commonly used for video stream distribution 315 in centralized multipoint conferences. The three conference policies 316 above can be implemented using any of these technologies. 318 Simple video switching (forwarding) has the advantage of low latency 319 and low complexity. It can be used if all systems are capable of 320 receiving the encodings used by the sending endpoints (including both 321 the video codec and the image resolution/aspect ratio). In some 322 situations it can be wasteful of bandwidth. 324 Full video transcoding usually has higher latency than switching It 325 does not require system to be capable of receiving identical 326 encodings, and different sites can connect with different bandwidths. 328 Layered video encoding combines some of the benefits of video 329 switching and video transcoding. It is more complex than video 330 switching, but less complex than video transcoding. Bandwidth and 331 resolution can be reduced for each site. Since this is done by 332 filtering out layers of the original encoding, the available 333 bandwidths and resolutions are not as fine-grained as full video 334 transcoding. 336 In decentralized mode or full mesh mode each endpoint creates its 337 display mode. This requires each endpoint to receive multiple 338 streams and send its video and audio to all participants, using 339 multicast of unicast. 341 In practice, multicast is not now being used in commercial systems, 342 so the size of a strictly decentralized multipoint conference is 343 limited. 345 There are analogous issues for audio. Like video, the audio is 346 rotated, so there is no clarity on the meaning of left and right. 347 Since the number of streams, microphones, and speakers are not 348 matched, the systems need to re-process the received audio in order 349 to create the correct sound field for their respective rooms. 351 There are two ways in which the audio might be handled in this use 352 case: 354 o A single stereo audio stream is sent to the remote site, just as 355 in standard videoconferencing. 357 o Three monaural audio streams are sent to the remote site, with 358 proprietary signaling to associate each audio stream with a video 359 stream. 361 Microphones and speakers positions vary; and there is no agreed upon 362 way to describe their placement. There is no agreed upon reference 363 for audio level. In addition, audio may be sent as an independent 364 stream from each microphone or as a multi-channel channel stream. 366 5.3. Capability Negotiation 368 Call setup for the telepresence conference will start with a single 369 call establishing one video media stream. After the connection is 370 established, a proprietary capability negotiation takes place that 371 will enable both sides to identify that they are telepresence 372 applications and capable of having two more video sessions and 373 provide the connectivity information. The result is that two or more 374 video sessions are established. The system may use two new SIP call 375 legs or just add the two new video streams to the existing dialog. 377 [more to be added] 379 5.4. Differences in Media Characteristics 381 Media characteristics such as video format, aspect ratio, and visual 382 scale can be handled differently at different sites creating 383 incompatibility. To interwork, an adaptive strategy is necessary. 384 Although differences in media characteristic must also be handled in 385 a typical video conference, the problem is made more complex in 386 Telepresence due to the multiple screens, cameras and streams. 388 Two examples - aspect ratio and visual scale are described here. 390 5.4.1. Aspect Ratio 392 If the aspect ratios in different sites are not the same, some 393 technique needs to be applied to adjust for the difference. Although 394 the same situation arises in normal video conferencing, multiple 395 streams in telepresence conferencing causes more difficulties. 397 For simplicity let us assume a point to point case - two conference 398 room on a point to point call. Both rooms have 3 screens and 3 399 cameras, as in 4.1 above. Both rooms have identical visual scale - 400 the display width and distance between the participants and the 401 displays are identical in both rooms. However the equipment - 402 cameras and displays - in each room has a different aspect ratio, 403 16:9 in one room and 4:3 in the other. 405 Although 4:3 is usually associated with standard definition TV and 406 16:9 with HDTV, telepresence systems may choose the aspect ratio to 407 obtain a particular field of view. Projecting images in the 16:9 408 aspect ratio offers a wider presentation angle that shows fine 409 details well (the pixel density is greater than a 4:3 system of the 410 same resolution and scale). In the room with 16:9 media 411 characteristic, people are shown at full size when they are seated. 412 However, when they stand up the height of the display results in 413 their image being cropped so that their heads are not shown. The 414 other room uses projectors to display HD images with 4:3 aspect 415 ratios. This results in an increased image height - the vertical 416 field of view is 33% greater than the 16:9 system. The increased 417 height allows most of the population to be shown full size whether 418 they are standing or sitting. 420 Some strategy is necessary to deal with the case of the two sites 421 having a point to point call. In order to convert formats of unequal 422 ratios a variety of techniques can be used, such as: zooming 423 (enlarging) and cropping (removing), letterboxing (adding horizontal 424 bars), pillarboxing (adding vertical bars) to retain the original 425 format's aspect ratio, or scaling (which distorts) in a variety of 426 ways. 428 For the video sent from the 4:3 room to the 16:9 room, several 429 techniques can be used: 431 1. The 16:9 system might simply crop the top 1/4 of each 4:3 image. 432 This will result in full size display, eye contact, and gaze 433 awareness for the individuals who are seated. However, the 434 standing presenter's head will be cropped. 436 2. The 16:9 system might stretch each to the 4:3 images to fully fit 437 the 16:9 display. This would reduce image height (creating 438 geometric distortion) and create eye-contact error. Continuity 439 of the panoramic image would be preserved. 441 3. The 16:9 system could pillarbox each of the 4:3 images, placing 442 horizontal borders on the three displays. This results in 443 reducing the image size to less than full size. It also destroys 444 the continuity of the panoramic image, and introduces additional 445 error in eye contact and gaze awareness. 447 4. The 16:9 system could pillarbox only the center display. This 448 reduces the size of the presenter who is the focus of the 449 meeting. 451 5. The 16:9 system could also crop the bottom of the center display. 452 Visually this reduces the height of the presenter, but maintains 453 full size. There is a vertical discontinuity in the panoramic 454 image. Whether this is objectionable or not depends on the room 455 layout. 457 Strategies 4 and 5 could be accomplished in response to a user 458 command or automatically. The details will be discussed in more 459 detail in future documents. 461 For the video sent from the 16:9 room to the 4:3 room, the receiving 462 system simply letterboxes the video displays. Since the scales are 463 identical, this full size image displays in the 4:3 room. 465 For the video sent from the 16:9 room to the 4:3 room, the common 466 techniques are: 468 1. The 4:3 system places the border above the image. This maintains 469 eye contact for those who are seated, but cannot maintain eye 470 contact for the presenter. 472 2. The 4:3 system places the border below the images. If the 16:9 473 system crops the bottom of the center display then this will 474 maintain eye contact for the presenter and the remote site. 476 3. The 4:3 system centers the images. Eye contact suffers for 477 everyone, but the worst case eye contact error is better 478 controlled. 480 In this use case, negotiation between the systems is not strictly 481 necessary, no matter which scheme is used. However, the best user 482 experience is obtained if both systems have knowledge about apect 483 ratios being used and which participants are standing and which are 484 sitting so they can adjust optimally. 486 5.4.2. Visual Scale 488 The visual scale of displays may differ between sites. Again, let us 489 use the point to point case as a simple example. Assume two 490 conference rooms in a point to point call. One room is designed for 491 6 participants, and has three 16:9 screens and 3 cameras. This room 492 is designed to show participants at their normal size when seated (2 493 participants per camera/display). It does not have adequate display 494 height to capture those who are standing. The second room is also 495 designed for 6 participants, but shows 3 participants per camera/ 496 display also at their full size. Therefore, it only needs two 16:9 497 cameras/display pairs. Since the field of view in both the vertical 498 and horizontal is increased by 50%, it also shows those who are 499 standing without cropping. 501 For the video sent from the 2 screen (larger scale) room to the 3 502 screen (smaller scale) room, two approaches can be used: 504 1. The 3 screen system might simply show the participants on two of 505 its displays. Participants will be shown at 67% of their full 506 size. Eye contact and gaze awareness will be lost. 508 2. The 3 screen system might construct and display a vertically 509 cropped 3-screen view, showing 2 participants on each screen. 510 Participants will be shown at full size, with preservation of eye 511 contact and gaze awareness. 513 For the video sent from the 3 screen to the 2 screen room, there are 514 two analogous approaches: 516 1. The 2 screen system selects 2 streams and simply shows them on 517 its displays. Participants will be shown at 150% of their normal 518 size. Eye contact and gaze awareness will be lost, and some of 519 the remote site is lost. 521 2. The 2 screen system might construct and display a 2 screen view 522 (with a vertical border on the top) which shows 3 participants on 523 each screen. Participants will be shown at full size, with 524 preservation of eye contact and gaze awareness. 526 Although there is no need for negotiation between the systems, the 527 best user experience is obtained if both systems have knowledge of 528 the visual scale, and where individuals are seated, and can then 529 choose the best manner of display. 531 6. IANA Considerations 533 This document contains no IANA considerations. 535 7. Security Considerations 537 While there are likely to be security considerations for any solution 538 for telepresence interoperability, this document has no security 539 considerations. 541 8. Acknowledgements 543 The draft has benefitted from input from a number of people including 544 Roni Even, Jim Cole, Nermeen Ismail, Nathan Buckles. 546 9. Informative References 548 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 549 Requirement Levels", BCP 14, RFC 2119, March 1997. 551 Authors' Addresses 553 Allyn Romanow 554 Cisco 555 San Jose, CA 95134 556 US 558 Email: allyn@cisco.com 560 Stephen Botzko 561 Polycom 562 Andover, MA 01810 563 US 565 Email: stephen.botzko@polycom.com