idnits 2.17.1 draft-ietf-clue-telepresence-use-cases-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 07, 2013) is 3885 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 4582 (Obsoleted by RFC 8855) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco 4 Intended status: Informational S. Botzko 5 Expires: March 11, 2014 M. Duckworth 6 Polycom 7 R. Even, Ed. 8 Huawei Technologies 9 September 07, 2013 11 Use Cases for Telepresence Multi-streams 12 draft-ietf-clue-telepresence-use-cases-07.txt 14 Abstract 16 Telepresence conferencing systems seek to create an environment that 17 gives non co-located users or user groups a feeling of co-located 18 presence through multimedia communication including at least audio 19 and video signals of high fidelity. A number of techniques for 20 handling audio and video streams are used to create this experience. 21 When these techniques are not similar, interoperability between 22 different systems is difficult at best, and often not possible. 23 Conveying information about the relationships between multiple 24 streams of media would allow senders and receivers to make choices to 25 allow telepresence systems to interwork. This memo describes the 26 most typical and important use cases for sending multiple streams in 27 a telepresence conference. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on March 11, 2014. 46 Copyright Notice 47 Copyright (c) 2013 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 63 2. Telepresence Scenarios Overview . . . . . . . . . . . . . . . 3 64 3. Use Case Scenarios . . . . . . . . . . . . . . . . . . . . . 6 65 3.1. Point to point meeting: symmetric . . . . . . . . . . . . 6 66 3.2. Point to point meeting: asymmetric . . . . . . . . . . . 7 67 3.3. Multipoint meeting . . . . . . . . . . . . . . . . . . . 8 68 3.4. Presentation . . . . . . . . . . . . . . . . . . . . . . 10 69 3.5. Heterogeneous Systems . . . . . . . . . . . . . . . . . . 11 70 3.6. Multipoint Education Usage . . . . . . . . . . . . . . . 12 71 3.7. Multipoint Multiview (Virtual space) . . . . . . . . . . 13 72 3.8. Multiple presentations streams - Telemedicine . . . . . . 14 73 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 74 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 75 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 76 7. Informative References . . . . . . . . . . . . . . . . . . . 16 77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 79 1. Introduction 81 Telepresence applications try to provide a "being there" experience 82 for conversational video conferencing. Often this telepresence 83 application is described as "immersive telepresence" in order to 84 distinguish it from traditional video conferencing, and from other 85 forms of remote presence not related to conversational video 86 conferencing, such as avatars and robots. The salient 87 characteristics of telepresence are often described as: actual sized, 88 immersive video, preserving interpersonal interaction and allowing 89 non-verbal communication. 91 Although telepresence systems are based on open standards such as RTP 92 [RFC3550], SIP [RFC3261], H.264, and the H.323[ITU.H323]suite of 93 protocols, they cannot easily interoperate with each other without 94 operator assistance and expensive additional equipment which 95 translates from one vendor's protocol to another. 97 The basic features that give telepresence its distinctive 98 characteristics are implemented in disparate ways in different 99 systems. Currently Telepresence systems from diverse vendors 100 interoperate to some extent, but this is not supported in a standards 101 based fashion. Interworking requires that translation and 102 transcoding devices be included in the architecture. Such devices 103 increase latency, reducing the quality of interpersonal interaction. 104 Use of these devices is often not automatic; it frequently requires 105 substantial manual configuration and a detailed understanding of the 106 nature of underlying audio and video streams. This state of affairs 107 is not acceptable for the continued growth of telepresence - 108 telepresence systems should have the same ease of interoperability as 109 do telephones. Thus, a standard way of describing the multiple 110 streams constituting the media flows and the fundamental aspects of 111 their behavior, would allow telepresence systems to interwork. 113 This document presents a set of use cases describing typical 114 scenarios. Requirements will be derived from these use cases in a 115 separate document. The use cases are described from the viewpoint of 116 the users. They are illustrative of the user experience that needs 117 to be supported. It is possible to implement these use cases in a 118 variety of different ways. 120 Many different scenarios need to be supported. This document 121 describes in detail the most common and basic use cases. These will 122 cover most of the requirements. There may be additional scenarios 123 that bring new features and requirements which can be used to extend 124 the initial work. 126 Point-to-point and Multipoint telepresence conferences are 127 considered. In some use cases, the number of screens is the same at 128 all sites, in others, the number of screens differs at different 129 sites. Both use cases are considered. Also included is a use case 130 describing display of presentation material or content. 132 The document structure is as follows: Section 2 gives an overview of 133 scenarios, and Section 3 describes use cases. 135 2. Telepresence Scenarios Overview 137 This section describes the general characteristics of the use cases 138 and what the scenarios are intended to show. The typical setting is 139 a business conference, which was the initial focus of telepresence. 140 Recently consumer products are also being developed. We specifically 141 do not include in our scenarios the physical infrastructure aspects 142 of telepresence, such as room construction, layout and decoration. 144 Telepresence systems are typically composed of one or more video 145 cameras and encoders and one or more display screens of large size 146 (diagonal around 60"). Microphones pick up sound and audio codec(s) 147 and produce one or more audio streams. The cameras used to capture 148 the telepresence users are referred to as participant cameras (and 149 likewise for screens). There may also be other cameras, such as for 150 document display. These will be referred to as presentation or 151 content cameras, which generally have different formats, aspect 152 ratios, and frame rates from the participant cameras. The 153 presentation streams may be shown on participant screen, or on 154 auxiliary display screens. A user's computer may also serve as a 155 virtual content camera, generating an animation or playing a video 156 for display to the remote participants. 158 We describe such a telepresence system as sending one or more video 159 streams, audio streams, and presentation streams to the remote 160 system(s). (Note that the number of audio, video or presentation 161 streams is generally not identical.) 163 The fundamental parameters describing today's typical telepresence 164 scenarios include: 166 1. The number of participating sites 168 2. The number of visible seats at a site 170 3. The number of cameras 172 4. The number and type of microphones 174 5. The number of audio channels 176 6. The screen size 178 7. The screen capabilities - such as resolution, frame rate, aspect 179 ratio 181 8. The arrangement of the screens in relation to each other 183 9. The number of primary screens at each sites 185 10. Type and number of presentation screens 186 11. Multipoint conference display strategies - for example, the 187 camera-to-screen mappings may be static or dynamic 189 12. The camera point of capture. 191 13. The cameras fields of view and how they spatially relate to each 192 other. 194 The basic features that give telepresence its distinctive 195 characteristics are implemented in disparate ways in different 196 systems. Currently Telepresence systems from diverse vendors 197 interoperate to some extent, but this is not supported in a standards 198 based fashion. Interworking requires that translation and 199 transcoding devices be included in the architecture. Such devices 200 increase latency, reducing the quality of interpersonal interaction. 201 Use of these devices is often not automatic; it frequently requires 202 substantial manual configuration and a detailed understanding of the 203 nature of underlying audio and video streams. This state of affairs 204 is not acceptable for the continued growth of telepresence - 205 telepresence systems should have the same ease of interoperability as 206 do telephones. 208 There is no agreed upon way to adequately describe the semantics of 209 how streams of various media types relate to each other. Without a 210 standard for stream semantics to describe the particular roles and 211 activities of each stream in the conference, interoperability is 212 cumbersome at best. 214 In a multiple screen conference, the video and audio streams sent 215 from remote participants must be understood by receivers so that they 216 can be presented in a coherent and life-like manner. This includes 217 the ability to present remote participants at their actual size for 218 their apparent distance, while maintaining correct eye contact, 219 gesticular cues, and simultaneously providing a spatial audio sound 220 stage that is consistent with the displayed video. 222 The receiving device that decides how to render incoming information 223 needs to understand a number of variables such as the spatial 224 position of the speaker, the field of view of the cameras; the camera 225 zoom; which media stream is related to each of the screens; etc. It 226 is not simply that individual streams must be adequately described, 227 to a large extent this already exists, but rather that the semantics 228 of the relationships between the streams must be communicated. Note 229 that all of this is still required even if the basic aspects of the 230 streams, such as the bit rate, frame rate, and aspect ratio, are 231 known. Thus, this problem has aspects considerably beyond those 232 encountered in interoperation of single camera/screen video 233 conferencing systems. 235 3. Use Case Scenarios 237 The use case scenarios focus on typical implementations. There are a 238 number of possible variants for these use cases, for example, the 239 audio supported may differ at the end points (such as mono or stereo 240 versus surround sound), etc. 242 Many of these systems offer a full conference room solution where 243 local participants sit on one side of a table and remote participants 244 are displayed as if they are sitting on the other side of the table. 245 The cameras and screens are typically arranged to provide a panoramic 246 (left to right from the local user view point) view of the remote 247 room. 249 The sense of immersion and non-verbal communication is fostered by a 250 number of technical features, such as: 252 1. Good eye contact, which is achieved by careful placement of 253 participants, cameras and screens. 255 2. Camera field of view and screen sizes are matched so that the 256 images of the remote room appear to be full size. 258 3. The left side of each room is presented on the right screen at 259 the far end; similarly the right side of the room is presented on 260 the left screen. The effect of this is that participants of each 261 site appear to be sitting across the table from each other. If 262 two participants on the same site glance at each other, all 263 participants can observe it. Likewise, if a participant on one 264 site gestures to a participant on the other site, all 265 participants observe the gesture itself and the participants it 266 includes. 268 3.1. Point to point meeting: symmetric 270 In this case each of the two sites has an identical number of 271 screens, with cameras having fixed fields of view, and one camera for 272 each screen. The sound type is the same at each end. As an example, 273 there could be 3 cameras and 3 screens in each room, with stereo 274 sound being sent and received at each end. 276 The important thing here is that each of the 2 sites has the same 277 number of screens. Each screen is paired with a corresponding 278 camera. Each camera / screen pair is typically connected to a 279 separate codec, producing a video encoded stream for transmission to 280 the remote site, and receiving a similarly encoded stream from the 281 remote site. 283 Each system has one or multiple microphones for capturing audio. In 284 some cases, stereophonic microphones are employed. In other systems, 285 a microphone may be placed in front of each participant (or pair of 286 participants). In typical systems all the microphones are connected 287 to a single codec that sends and receives the audio streams as either 288 stereo or surround sound. The number of microphones and the number 289 of audio channels are often not the same as the number of cameras. 290 Also the number of microphones is often not the same as the number of 291 loudspeakers. 293 The audio may be transmitted as multi-channel (stereo/surround sound) 294 or as distinct and separate monophonic streams. Audio levels should 295 be matched, so the sound levels at both sites are identical. 296 Loudspeaker and microphone placements are chosen so that the sound 297 "stage" (orientation of apparent audio sources) is coordinated with 298 the video. That is, if a participant on one site speaks, the 299 participants at the remote site perceive her voice as originating 300 from her visual image. In order to accomplish this, the audio needs 301 to be mapped at the received site in the same fashion as the video. 302 That is, audio received from the right side of the room needs to be 303 output from loudspeaker(s) on the left side at the remote site, and 304 vice versa. 306 3.2. Point to point meeting: asymmetric 308 In this case, each site has a different number of screens and cameras 309 than the other site. The important characteristic of this scenario 310 is that the number of screens is different between the two sites. 311 This creates challenges which are handled differently by different 312 telepresence systems. 314 This use case builds on the basic scenario of 3 screens to 3 screens. 315 Here, we use the common case of 3 screens and 3 cameras at one site, 316 and 1 screen and 1 camera at the other site, connected by a point to 317 point call. The screen sizes and camera fields of view at both sites 318 are basically similar, such that each camera view is designed to show 319 two people sitting side by side. Thus the 1 screen room has up to 2 320 people seated at the table, while the 3 screen room may have up to 6 321 people at the table. 323 The basic considerations of defining left and right and indicating 324 relative placement of the multiple audio and video streams are the 325 same as in the 3-3 use case. However, handling the mismatch between 326 the two sites of the number of screens and cameras requires more 327 complicated manoeuvres. 329 For the video sent from the 1 camera room to the 3 screen room, 330 usually what is done is to simply use 1 of the 3 screens and keep the 331 second and third screens inactive or, for example, put up the current 332 date. This would maintain the "full size" image of the remote side. 334 For the other direction, the 3 camera room sending video to the 1 335 screen room, there are more complicated variations to consider. Here 336 are several possible ways in which the video streams can be handled. 338 1. The 1 screen system might simply show only 1 of the 3 camera 339 images, since the receiving side has only 1 screen. Two people 340 are seen at full size, but 4 people are not seen at all. The 341 choice of which 1 of the 3 streams to display could be fixed, or 342 could be selected by the users. It could also be made 343 automatically based on who is speaking in the 3 screen room, such 344 that the people in the 1 screen room always see the person who is 345 speaking. If the automatic selection is done at the sender, the 346 transmission of streams that are not displayed could be 347 suppressed, which would avoid wasting bandwidth. 349 2. The 1 screen system might be capable of receiving and decoding 350 all 3 streams from all 3 cameras. The 1 screen system could then 351 compose the 3 streams into 1 local image for display on the 352 single screen. All six people would be seen, but smaller than 353 full size. This could be done in conjunction with reducing the 354 image resolution of the streams, such that encode/decode 355 resources and bandwidth are not wasted on streams that will be 356 downsized for display anyway. 358 3. The 3 screen system might be capable of including all 6 people in 359 a single stream to send to the 1 screen system. For example, it 360 could use PTZ (Pan Tilt Zoom) cameras to physically adjust the 361 cameras such that 1 camera captures the whole room of six people. 362 Or it could recompose the 3 camera images into 1 encoded stream 363 to send to the remote site. These variations also show all six 364 people, but at a reduced size. 366 4. Or, there could be a combination of these approaches, such as 367 simultaneously showing the speaker in full size with a composite 368 of all the 6 participants in smaller size. 370 The receiving telepresence system needs to have information about the 371 content of the streams it receives to make any of these decisions. 372 If the systems are capable of supporting more than one strategy, 373 there needs to be some negotiation between the two sites to figure 374 out which of the possible variations they will use in a specific 375 point to point call. 377 3.3. Multipoint meeting 378 In a multipoint telepresence conference, there are more than two 379 sites participating. Additional complexity is required to enable 380 media streams from each participant to show up on the screens of the 381 other participants. 383 Clearly, there are a great number of topologies that can be used to 384 display the streams from multiple sites participating in a 385 conference. 387 One major objective for telepresence is to be able to preserve the 388 "Being there" user experience. However, in multi-site conferences it 389 is often (in fact usually) not possible to simultaneously provide 390 full size video, eye contact, common perception of gestures and gaze 391 by all participants. Several policies can be used for stream 392 distribution and display: all provide good results but they all make 393 different compromises. 395 One common policy is called site switching. Let's say the speaker is 396 at site A and everyone else is at a "remote" site. When the room at 397 site A shown, all the camera images from site A are forwarded to the 398 remote sites. Therefore at each receiving remote site, all the 399 screens display camera images from site A. This can be used to 400 preserve full size image display, and also provide full visual 401 context of the displayed far end, site A. In site switching, there 402 is a fixed relation between the cameras in each room and the screens 403 in remote rooms. The room or participants being shown is switched 404 from time to time based on who is speaking or by manual control, 405 e.g., from site A to site B. 407 Segment switching is another policy choice. Still using site A as 408 where the speaker is, and "remote" to refer to all the other sites, 409 in segment switching, rather than sending all the images from site A, 410 only the speaker at site A is shown. The camera images of the 411 current speaker and previous speakers (if any) are forwarded to the 412 other sites in the conference. Therefore the screens in each site 413 are usually displaying images from different remote sites - the 414 current speaker at site A and the previous ones. This strategy can 415 be used to preserve full size image display, and also capture the 416 non-verbal communication between the speakers. In segment switching, 417 the display depends on the activity in the remote rooms - generally, 418 but not necessarily based on audio / speech detection). 420 A third possibility is to reduce the image size so that multiple 421 camera views can be composited onto one or more screens. This does 422 not preserve full size image display, but provides the most visual 423 context (since more sites or segments can be seen). Typically in 424 this case the display mapping is static, i.e., each part of each room 425 is shown in the same location on the display screens throughout the 426 conference. 428 Other policies and combinations are also possible. For example, 429 there can be a static display of all screens from all remote rooms, 430 with part or all of one screen being used to show the current speaker 431 at full size. 433 3.4. Presentation 435 In addition to the video and audio streams showing the participants, 436 additional streams are used for presentations. 438 In systems available today, generally only one additional video 439 stream is available for presentations. Often this presentation 440 stream is half-duplex in nature, with presenters taking turns. The 441 presentation stream may be captured from a PC screen, or it may come 442 from a multimedia source such as a document camera, camcorder or a 443 DVD. In a multipoint meeting, the presentation streams for the 444 currently active presentation are always distributed to all sites in 445 the meeting, so that the presentations are viewed by all. 447 Some systems display the presentation streams on a screen that is 448 mounted either above or below the three participant screens. Other 449 systems provide screens on the conference table for observing 450 presentations. If multiple presentation screens are used, they 451 generally display identical content. There is considerable variation 452 in the placement, number, and size or presentation screens. 454 In some systems presentation audio is pre-mixed with the room audio. 455 In others, a separate presentation audio stream is provided (if the 456 presentation includes audio). 458 In H.323[ITU.H323] systems, H.239[ITU.H239] is typically used to 459 control the video presentation stream. In SIP systems, similar 460 control mechanisms can be provided using BFCP [RFC4582] for 461 presentation token. These mechanisms are suitable for managing a 462 single presentation stream. 464 Although today's systems remain limited to a single video 465 presentation stream, there are obvious uses for multiple presentation 466 streams: 468 1. Frequently the meeting convener is following a meeting agenda, 469 and it is useful for her to be able to show that agenda to all 470 participants during the meeting. Other participants at various 471 remote sites are able to make presentations during the meeting, 472 with the presenters taking turns. The presentations and the 473 agenda are both shown, either on separate screens, or perhaps re- 474 scaled and shown on a single screen. 476 2. A single multimedia presentation can itself include multiple 477 video streams that should be shown together. For instance, a 478 presenter may be discussing the fairness of media coverage. In 479 addition to slides which support the presenter's conclusions, she 480 also has video excerpts from various news programs which she 481 shows to illustrate her findings. She uses a DVD player for the 482 video excerpts so that she can pause and reposition the video as 483 needed. 485 3. An educator who is presenting a multi-screen slide show. This 486 show requires that the placement of the images on the multiple 487 screens at each site be consistent. 489 There are many other examples where multiple presentation streams are 490 useful. 492 3.5. Heterogeneous Systems 494 It is common in meeting scenarios for people to join the conference 495 from a variety of environments, using different types of endpoint 496 devices. A multi-screen immersive telepresence conference may 497 include someone on a PC-based video conferencing system, a 498 participant calling in by phone, and (soon) someone on a handheld 499 device. 501 What experience/view will each of these devices have? 503 Some may be able to handle multiple streams and others can handle 504 only a single stream. (We are not here talking about legacy systems, 505 but rather systems built to participate in such a conference, 506 although they are single stream only.) In a single video stream , 507 the stream may contain one or more compositions depending on the 508 available screen space on the device. In most cases an intermediate 509 transcoding device will be relied upon to produce a single stream, 510 perhaps with some kind of continuous presence. 512 Bit rates will vary - the handheld and phone having lower bit rates 513 than PC and multi-screen systems. 515 Layout is accomplished according to different policies. For example, 516 a handheld and PC may receive the active speaker stream. The 517 decision can either be made explicitly by the receiver or by the 518 sender if it can receive some kind of rendering hint. The same is 519 true for audio -- i.e., that it receives a mixed stream or a number 520 of the loudest speakers if mixing is not available in the network. 522 For the PC based conferencing participant, the user's experience 523 depends on the application. It could be single stream, similar to a 524 handheld but with a bigger screen. Or, it could be multiple streams, 525 similar to an immersive telepresence system but with a smaller 526 screen. Control for manipulation of streams can be local in the 527 software application, or in another location and sent to the 528 application over the network. 530 The handheld device is the most extreme. How will that participant 531 be viewed and heard? It should be an equal participant, though the 532 bandwidth will be significantly less than an immersive system. A 533 receiver may choose to display output coming from a handheld 534 differently based on the resolution, but that would be the case with 535 any low resolution video stream, e.g., from a powerful PC on a bad 536 network. 538 The handheld will send and receive a single video stream, which could 539 be a composite or a subset of the conference. The handheld could say 540 what it wants or could accept whatever the sender (conference server 541 or sending endpoint) thinks is best. The handheld will have to 542 signal any actions it wants to take the same way that immersive 543 system signals actions. 545 3.6. Multipoint Education Usage 547 The importance of this example is that the multiple video streams are 548 not used to create an immersive conferencing experience with 549 panoramic views at all the sites. Instead the multiple streams are 550 dynamically used to enable full participation of remote students in a 551 university class. In some instances the same video stream is 552 displayed on multiple screens in the room, in other instances an 553 available stream is not displayed at all. 555 The main site is a university auditorium which is equipped with three 556 cameras. One camera is focused on the professor at the podium. A 557 second camera is mounted on the wall behind the professor and 558 captures the class in its entirety. The third camera is co-located 559 with the second, and is designed to capture a close up view of a 560 questioner in the audience. It automatically zooms in on that 561 student using sound localization. 563 Although the auditorium is equipped with three cameras, it is only 564 equipped with two screens. One is a large screen located at the 565 front so that the class can see it. The other is located at the rear 566 so the professor can see it. When someone asks a question, the front 567 screen shows the questioner. Otherwise it shows the professor 568 (ensuring everyone can easily see her). 570 The remote sites are typical immersive telepresence room with three 571 camera/screen pairs. 573 All remote sites display the professor on the center screen at full 574 size. A second screen shows the entire classroom view when the 575 professor is speaking. However, when a student asks a question, the 576 second screen shows the close up view of the student at full size. 577 Sometimes the student is in the auditorium; sometimes the speaking 578 student is at another remote site. The remote systems never display 579 the students that are actually in that room. 581 If someone at the remote site asks a question, then the screen in the 582 auditorium will show the remote student at full size (as if they were 583 present in the auditorium itself). The screen in the rear also shows 584 this questioner, allowing the professor to see and respond to the 585 student without needing to turn her back on the main class. 587 When no one is asking a question, the screen in the rear briefly 588 shows a full-room view of each remote site in turn, allowing the 589 professor to monitor the entire class (remote and local students). 590 The professor can also use a control on the podium to see a 591 particular site - she can choose either a full-room view or a single 592 camera view. 594 Realization of this use case does not require any negotiation between 595 the participating sites. Endpoint devices (and an MCU if present) - 596 need to know who is speaking and what video stream includes the view 597 of that speaker. The remote systems need some knowledge of which 598 stream should be placed in the center. The ability of the professor 599 to see specific sites (or for the system to show all the sites in 600 turn) would also require the auditorium system to know what sites are 601 available, and to be able to request a particular view of any site. 602 Bandwidth is optimized if video that is not being shown at a 603 particular site is not distributed to that site. 605 3.7. Multipoint Multiview (Virtual space) 607 This use case describes a virtual space multipoint meeting with good 608 eye contact and spatial layout of participants. The use case was 609 proposed very early in the development of video conferencing systems 610 as described in 1983 by Allardyce and Randal [virtualspace]. The use 611 case is illustrated in figure 2-5 of their report. The virtual space 612 expands the point to point case by having all multipoint conference 613 participants "seat" in a virtual room. In this case each participant 614 has a fixed "seat" in the virtual room so each participant expects to 615 see a different view having a different participant on his left and 616 right side. Today, the use case is implemented in multiple 617 telepresence type video conferencing systems on the market. The term 618 "virtual space" was used in their report. The main difference 619 between the result obtained with modern systems and those from 1983 620 are larger screen sizes. 622 Virtual space multipoint as defined here assumes endpoints with 623 multiple cameras and screens. Usually there is the same number of 624 cameras and screens at a given endpoint. A camera is positioned 625 above each screen. A key aspect of virtual space multipoint is the 626 details of how the cameras are aimed. The cameras are each aimed on 627 the same area of view of the participants at the site. Thus each 628 camera takes a picture of the same set of people but from a different 629 angle. Each endpoint sender in the virtual space multipoint meeting 630 therefore offers a choice of video streams to remote receivers, each 631 stream representing a different view point. For example a camera 632 positioned above a screen to a participant's left may take video 633 pictures of the participant's left ear while at the same time, a 634 camera positioned above a screen to the participant's right may take 635 video pictures of the participant's right ear. 637 Since a sending endpoint has a camera associated with each screen, an 638 association is made between the receiving stream output on a 639 particular screen and the corresponding sending stream from the 640 camera associated with that screen. These associations are repeated 641 for each screen/camera pair in a meeting. The result of this system 642 is a horizontal arrangement of video images from remote sites, one 643 per screen. The image from each screen is paired with the camera 644 output from the camera above that screen resulting in excellent eye 645 contact. 647 3.8. Multiple presentations streams - Telemedicine 648 This use case describes a scenario where multiple presentation 649 streams are used. In this use case, the local site is a surgery room 650 connected to one or more remote sites that may have different 651 capabilities. At the local site three main cameras capture the whole 652 room (typical 3 camera Telepresence case). Also multiple 653 presentation inputs are available: a surgery camera which is used to 654 provide a zoomed view of the operation, an endoscopic monitor, an 655 X-ray CT image output device, a B-ultrasonic apparatus, a cardiogram 656 generator, an MRI image instrument, etc. These devices are used to 657 provide multiple local video presentation streams to help the surgeon 658 monitor the status of the patient and assist in the surgical process. 660 The local site may have three main screens and one (or more) 661 presentation screen(s). The main screens can be used to display the 662 remote experts. The presentation screen(s) can be used to display 663 multiple presentation streams from local and remote sites 664 simultaneously. The three main cameras capture different parts of 665 the surgery room. The surgeon can decide the number, the size and 666 the placement of the presentations displayed on the local 667 presentation screen(s). He can also indicate which local 668 presentation captures are provided for the remote sites. The local 669 site can send multiple presentation captures to remote sites and it 670 can receive multiple presentations related to the patient or the 671 procedure from them. 673 One type of remote site is a single or dual screen and one camera 674 system used by a consulting expert. In the general case the remote 675 sites can be part of a multipoint Telepresence conference. The 676 presentation screens at the remote sites allow the experts to see the 677 details of the operation and related data. Like the main site, the 678 experts can decide the number, the size and the placement of the 679 presentations displayed on the presentation screens. The 680 presentation screens can display presentation streams from the 681 surgery room or from other remote sites and also local presentation 682 streams. Thus the experts can also start sending presentation 683 streams, which can carry medical records, pathology data, or their 684 reference and analysis, etc. 686 Another type of remote site is a typical immersive Telepresence room 687 with three camera/screen pairs allowing more experts to join the 688 consultation. These sites can also be used for education. The 689 teacher, who is not necessarily the surgeon, and the students are in 690 different remote sites. Students can observe and learn the details 691 of the whole procedure, while the teacher can explain and answer 692 questions during the operation. 694 All remote education sites can display the surgery room. Another 695 option is to display the surgery room on the center screen, and the 696 rest of the screens can show the teacher and the student who is 697 asking a question. For all the above sites, multiple presentation 698 screens can be used to enhance visibility: one screen for the zoomed 699 surgery stream and the others for medical image streams, such as MRI 700 images, cardiogram, B-ultrasonic images and pathology data. 702 4. Acknowledgements 704 The document has benefitted from input from a number of people 705 including Alex Eleftheriadis, Marshall Eubanks, Tommy Andre Nyquist, 706 Mark Gorzynski, Charles Eckel, Nermeen Ismail, Mary Barnes, Pascal 707 Buhler, Jim Cole. 709 Special acknowledgement to Lennard Xiao who contributed the text for 710 the telemedicine use case 712 5. IANA Considerations 714 This document contains no IANA considerations. 716 6. Security Considerations 718 While there are likely to be security considerations for any solution 719 for telepresence interoperability, this document has no security 720 considerations. 722 7. Informative References 724 [ITU.H239] 725 , "Role management and additional media channels for 726 H.300-series terminals", ITU-T Recommendation H.239, 727 September 2005. 729 [ITU.H323] 730 , "Packet-based Multimedia Communications Systems ", ITU-T 731 Recommendation H.323, December 2009. 733 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 734 A., Peterson, J., Sparks, R., Handley, M., and E. 735 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 736 June 2002. 738 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 739 Jacobson, "RTP: A Transport Protocol for Real-Time 740 Applications", STD 64, RFC 3550, July 2003. 742 [RFC4582] Camarillo, G., Ott, J., and K. Drage, "The Binary Floor 743 Control Protocol (BFCP)", RFC 4582, November 2006. 745 [virtualspace] 746 Allardyce, . and . Randall, "Development of 747 Teleconferencing Methodologies With Emphasis on Virtual 748 Space Videe and Interactive Graphics", 1983. 750 Authors' Addresses 752 Allyn Romanow 753 Cisco 754 San Jose, CA 95134 755 US 757 Email: allyn@cisco.com 759 Stephen Botzko 760 Polycom 761 Andover, MA 01810 762 US 764 Email: stephen.botzko@polycom.com 766 Mark Duckworth 767 Polycom 768 Andover, MA 01810 769 US 771 Email: mark.duckworth@polycom.com 773 Roni Even (editor) 774 Huawei Technologies 775 Tel Aviv 776 Israel 778 Email: roni.even@mail01.huawei.com