idnits 2.17.1 

draft-ietf-clue-telepresence-use-cases-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (September 07, 2013) is 3885 days in the past.  Is
     this intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 4582
     (Obsoleted by RFC 8855)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                                     Cisco
4	Intended status: Informational                                 S. Botzko
5	Expires: March 11, 2014                                     M. Duckworth
6	                                                                 Polycom
7	                                                            R. Even, Ed.
8	                                                     Huawei Technologies
9	                                                      September 07, 2013

11	                Use Cases for Telepresence Multi-streams
12	             draft-ietf-clue-telepresence-use-cases-07.txt

14	Abstract

16	   Telepresence conferencing systems seek to create an environment that
17	   gives non co-located users or user groups a feeling of co-located
18	   presence through multimedia communication including at least audio
19	   and video signals of high fidelity.  A number of techniques for
20	   handling audio and video streams are used to create this experience.
21	   When these techniques are not similar, interoperability between
22	   different systems is difficult at best, and often not possible.
23	   Conveying information about the relationships between multiple
24	   streams of media would allow senders and receivers to make choices to
25	   allow telepresence systems to interwork.  This memo describes the
26	   most typical and important use cases for sending multiple streams in
27	   a telepresence conference.

29	Status of This Memo

31	   This Internet-Draft is submitted in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at http://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on March 11, 2014.

46	Copyright Notice
47	   Copyright (c) 2013 IETF Trust and the persons identified as the
48	   document authors.  All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents
52	   (http://trustee.ietf.org/license-info) in effect on the date of
53	   publication of this document.  Please review these documents
54	   carefully, as they describe your rights and restrictions with respect
55	   to this document.  Code Components extracted from this document must
56	   include Simplified BSD License text as described in Section 4.e of
57	   the Trust Legal Provisions and are provided without warranty as
58	   described in the Simplified BSD License.

60	Table of Contents

62	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
63	   2.  Telepresence Scenarios Overview . . . . . . . . . . . . . . .   3
64	   3.  Use Case Scenarios  . . . . . . . . . . . . . . . . . . . . .   6
65	     3.1.  Point to point meeting: symmetric . . . . . . . . . . . .   6
66	     3.2.  Point to point meeting: asymmetric  . . . . . . . . . . .   7
67	     3.3.  Multipoint meeting  . . . . . . . . . . . . . . . . . . .   8
68	     3.4.  Presentation  . . . . . . . . . . . . . . . . . . . . . .  10
69	     3.5.  Heterogeneous Systems . . . . . . . . . . . . . . . . . .  11
70	     3.6.  Multipoint Education Usage  . . . . . . . . . . . . . . .  12
71	     3.7.  Multipoint Multiview (Virtual space)  . . . . . . . . . .  13
72	     3.8.  Multiple presentations streams - Telemedicine . . . . . .  14
73	   4.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  16
74	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
75	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
76	   7.  Informative References  . . . . . . . . . . . . . . . . . . .  16
77	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

79	1.  Introduction

81	   Telepresence applications try to provide a "being there" experience
82	   for conversational video conferencing.  Often this telepresence
83	   application is described as "immersive telepresence" in order to
84	   distinguish it from traditional video conferencing, and from other
85	   forms of remote presence not related to conversational video
86	   conferencing, such as avatars and robots.  The salient
87	   characteristics of telepresence are often described as: actual sized,
88	   immersive video, preserving interpersonal interaction and allowing
89	   non-verbal communication.

91	   Although telepresence systems are based on open standards such as RTP
92	   [RFC3550], SIP [RFC3261], H.264, and the H.323[ITU.H323]suite of
93	   protocols, they cannot easily interoperate with each other without
94	   operator assistance and expensive additional equipment which
95	   translates from one vendor's protocol to another.

97	   The basic features that give telepresence its distinctive
98	   characteristics are implemented in disparate ways in different
99	   systems.  Currently Telepresence systems from diverse vendors
100	   interoperate to some extent, but this is not supported in a standards
101	   based fashion.  Interworking requires that translation and
102	   transcoding devices be included in the architecture.  Such devices
103	   increase latency, reducing the quality of interpersonal interaction.
104	   Use of these devices is often not automatic; it frequently requires
105	   substantial manual configuration and a detailed understanding of the
106	   nature of underlying audio and video streams.  This state of affairs
107	   is not acceptable for the continued growth of telepresence -
108	   telepresence systems should have the same ease of interoperability as
109	   do telephones.  Thus, a standard way of describing the multiple
110	   streams constituting the media flows and the fundamental aspects of
111	   their behavior, would allow telepresence systems to interwork.

113	   This document presents a set of use cases describing typical
114	   scenarios.  Requirements will be derived from these use cases in a
115	   separate document.  The use cases are described from the viewpoint of
116	   the users.  They are illustrative of the user experience that needs
117	   to be supported.  It is possible to implement these use cases in a
118	   variety of different ways.

120	   Many different scenarios need to be supported.  This document
121	   describes in detail the most common and basic use cases.  These will
122	   cover most of the requirements.  There may be additional scenarios
123	   that bring new features and requirements which can be used to extend
124	   the initial work.

126	   Point-to-point and Multipoint telepresence conferences are
127	   considered.  In some use cases, the number of screens is the same at
128	   all sites, in others, the number of screens differs at different
129	   sites.  Both use cases are considered.  Also included is a use case
130	   describing display of presentation material or content.

132	   The document structure is as follows: Section 2 gives an overview of
133	   scenarios, and Section 3 describes use cases.

135	2.  Telepresence Scenarios Overview

137	   This section describes the general characteristics of the use cases
138	   and what the scenarios are intended to show.  The typical setting is
139	   a business conference, which was the initial focus of telepresence.
140	   Recently consumer products are also being developed.  We specifically
141	   do not include in our scenarios the physical infrastructure aspects
142	   of telepresence, such as room construction, layout and decoration.

144	   Telepresence systems are typically composed of one or more video
145	   cameras and encoders and one or more display screens of large size
146	   (diagonal around 60").  Microphones pick up sound and audio codec(s)
147	   and produce one or more audio streams.  The cameras used to capture
148	   the telepresence users are referred to as participant cameras (and
149	   likewise for screens).  There may also be other cameras, such as for
150	   document display.  These will be referred to as presentation or
151	   content cameras, which generally have different formats, aspect
152	   ratios, and frame rates from the participant cameras.  The
153	   presentation streams may be shown on participant screen, or on
154	   auxiliary display screens.  A user's computer may also serve as a
155	   virtual content camera, generating an animation or playing a video
156	   for display to the remote participants.

158	   We describe such a telepresence system as sending one or more video
159	   streams, audio streams, and presentation streams to the remote
160	   system(s).  (Note that the number of audio, video or presentation
161	   streams is generally not identical.)

163	   The fundamental parameters describing today's typical telepresence
164	   scenarios include:

166	   1.   The number of participating sites

168	   2.   The number of visible seats at a site

170	   3.   The number of cameras

172	   4.   The number and type of microphones

174	   5.   The number of audio channels

176	   6.   The screen size

178	   7.   The screen capabilities - such as resolution, frame rate, aspect
179	        ratio

181	   8.   The arrangement of the screens in relation to each other

183	   9.   The number of primary screens at each sites

185	   10.  Type and number of presentation screens
186	   11.  Multipoint conference display strategies - for example, the
187	        camera-to-screen mappings may be static or dynamic

189	   12.  The camera point of capture.

191	   13.  The cameras fields of view and how they spatially relate to each
192	        other.

194	   The basic features that give telepresence its distinctive
195	   characteristics are implemented in disparate ways in different
196	   systems.  Currently Telepresence systems from diverse vendors
197	   interoperate to some extent, but this is not supported in a standards
198	   based fashion.  Interworking requires that translation and
199	   transcoding devices be included in the architecture.  Such devices
200	   increase latency, reducing the quality of interpersonal interaction.
201	   Use of these devices is often not automatic; it frequently requires
202	   substantial manual configuration and a detailed understanding of the
203	   nature of underlying audio and video streams.  This state of affairs
204	   is not acceptable for the continued growth of telepresence -
205	   telepresence systems should have the same ease of interoperability as
206	   do telephones.

208	   There is no agreed upon way to adequately describe the semantics of
209	   how streams of various media types relate to each other.  Without a
210	   standard for stream semantics to describe the particular roles and
211	   activities of each stream in the conference, interoperability is
212	   cumbersome at best.

214	   In a multiple screen conference, the video and audio streams sent
215	   from remote participants must be understood by receivers so that they
216	   can be presented in a coherent and life-like manner.  This includes
217	   the ability to present remote participants at their actual size for
218	   their apparent distance, while maintaining correct eye contact,
219	   gesticular cues, and simultaneously providing a spatial audio sound
220	   stage that is consistent with the displayed video.

222	   The receiving device that decides how to render incoming information
223	   needs to understand a number of variables such as the spatial
224	   position of the speaker, the field of view of the cameras; the camera
225	   zoom; which media stream is related to each of the screens; etc.  It
226	   is not simply that individual streams must be adequately described,
227	   to a large extent this already exists, but rather that the semantics
228	   of the relationships between the streams must be communicated.  Note
229	   that all of this is still required even if the basic aspects of the
230	   streams, such as the bit rate, frame rate, and aspect ratio, are
231	   known.  Thus, this problem has aspects considerably beyond those
232	   encountered in interoperation of single camera/screen video
233	   conferencing systems.

235	3.  Use Case Scenarios

237	   The use case scenarios focus on typical implementations.  There are a
238	   number of possible variants for these use cases, for example, the
239	   audio supported may differ at the end points (such as mono or stereo
240	   versus surround sound), etc.

242	   Many of these systems offer a full conference room solution where
243	   local participants sit on one side of a table and remote participants
244	   are displayed as if they are sitting on the other side of the table.
245	   The cameras and screens are typically arranged to provide a panoramic
246	   (left to right from the local user view point) view of the remote
247	   room.

249	   The sense of immersion and non-verbal communication is fostered by a
250	   number of technical features, such as:

252	   1.  Good eye contact, which is achieved by careful placement of
253	       participants, cameras and screens.

255	   2.  Camera field of view and screen sizes are matched so that the
256	       images of the remote room appear to be full size.

258	   3.  The left side of each room is presented on the right screen at
259	       the far end; similarly the right side of the room is presented on
260	       the left screen.  The effect of this is that participants of each
261	       site appear to be sitting across the table from each other.  If
262	       two participants on the same site glance at each other, all
263	       participants can observe it.  Likewise, if a participant on one
264	       site gestures to a participant on the other site, all
265	       participants observe the gesture itself and the participants it
266	       includes.

268	3.1.  Point to point meeting: symmetric

270	   In this case each of the two sites has an identical number of
271	   screens, with cameras having fixed fields of view, and one camera for
272	   each screen.  The sound type is the same at each end.  As an example,
273	   there could be 3 cameras and 3 screens in each room, with stereo
274	   sound being sent and received at each end.

276	   The important thing here is that each of the 2 sites has the same
277	   number of screens.  Each screen is paired with a corresponding
278	   camera.  Each camera / screen pair is typically connected to a
279	   separate codec, producing a video encoded stream for transmission to
280	   the remote site, and receiving a similarly encoded stream from the
281	   remote site.

283	   Each system has one or multiple microphones for capturing audio.  In
284	   some cases, stereophonic microphones are employed.  In other systems,
285	   a microphone may be placed in front of each participant (or pair of
286	   participants).  In typical systems all the microphones are connected
287	   to a single codec that sends and receives the audio streams as either
288	   stereo or surround sound.  The number of microphones and the number
289	   of audio channels are often not the same as the number of cameras.
290	   Also the number of microphones is often not the same as the number of
291	   loudspeakers.

293	   The audio may be transmitted as multi-channel (stereo/surround sound)
294	   or as distinct and separate monophonic streams.  Audio levels should
295	   be matched, so the sound levels at both sites are identical.
296	   Loudspeaker and microphone placements are chosen so that the sound
297	   "stage" (orientation of apparent audio sources) is coordinated with
298	   the video.  That is, if a participant on one site speaks, the
299	   participants at the remote site perceive her voice as originating
300	   from her visual image.  In order to accomplish this, the audio needs
301	   to be mapped at the received site in the same fashion as the video.
302	   That is, audio received from the right side of the room needs to be
303	   output from loudspeaker(s) on the left side at the remote site, and
304	   vice versa.

306	3.2.  Point to point meeting: asymmetric

308	   In this case, each site has a different number of screens and cameras
309	   than the other site.  The important characteristic of this scenario
310	   is that the number of screens is different between the two sites.
311	   This creates challenges which are handled differently by different
312	   telepresence systems.

314	   This use case builds on the basic scenario of 3 screens to 3 screens.
315	   Here, we use the common case of 3 screens and 3 cameras at one site,
316	   and 1 screen and 1 camera at the other site, connected by a point to
317	   point call.  The screen sizes and camera fields of view at both sites
318	   are basically similar, such that each camera view is designed to show
319	   two people sitting side by side.  Thus the 1 screen room has up to 2
320	   people seated at the table, while the 3 screen room may have up to 6
321	   people at the table.

323	   The basic considerations of defining left and right and indicating
324	   relative placement of the multiple audio and video streams are the
325	   same as in the 3-3 use case.  However, handling the mismatch between
326	   the two sites of the number of screens and cameras requires more
327	   complicated manoeuvres.

329	   For the video sent from the 1 camera room to the 3 screen room,
330	   usually what is done is to simply use 1 of the 3 screens and keep the
331	   second and third screens inactive or, for example, put up the current
332	   date.  This would maintain the "full size" image of the remote side.

334	   For the other direction, the 3 camera room sending video to the 1
335	   screen room, there are more complicated variations to consider.  Here
336	   are several possible ways in which the video streams can be handled.

338	   1.  The 1 screen system might simply show only 1 of the 3 camera
339	       images, since the receiving side has only 1 screen.  Two people
340	       are seen at full size, but 4 people are not seen at all.  The
341	       choice of which 1 of the 3 streams to display could be fixed, or
342	       could be selected by the users.  It could also be made
343	       automatically based on who is speaking in the 3 screen room, such
344	       that the people in the 1 screen room always see the person who is
345	       speaking.  If the automatic selection is done at the sender, the
346	       transmission of streams that are not displayed could be
347	       suppressed, which would avoid wasting bandwidth.

349	   2.  The 1 screen system might be capable of receiving and decoding
350	       all 3 streams from all 3 cameras.  The 1 screen system could then
351	       compose the 3 streams into 1 local image for display on the
352	       single screen.  All six people would be seen, but smaller than
353	       full size.  This could be done in conjunction with reducing the
354	       image resolution of the streams, such that encode/decode
355	       resources and bandwidth are not wasted on streams that will be
356	       downsized for display anyway.

358	   3.  The 3 screen system might be capable of including all 6 people in
359	       a single stream to send to the 1 screen system.  For example, it
360	       could use PTZ (Pan Tilt Zoom) cameras to physically adjust the
361	       cameras such that 1 camera captures the whole room of six people.
362	       Or it could recompose the 3 camera images into 1 encoded stream
363	       to send to the remote site.  These variations also show all six
364	       people, but at a reduced size.

366	   4.  Or, there could be a combination of these approaches, such as
367	       simultaneously showing the speaker in full size with a composite
368	       of all the 6 participants in smaller size.

370	   The receiving telepresence system needs to have information about the
371	   content of the streams it receives to make any of these decisions.
372	   If the systems are capable of supporting more than one strategy,
373	   there needs to be some negotiation between the two sites to figure
374	   out which of the possible variations they will use in a specific
375	   point to point call.

377	3.3.  Multipoint meeting
378	   In a multipoint telepresence conference, there are more than two
379	   sites participating.  Additional complexity is required to enable
380	   media streams from each participant to show up on the screens of the
381	   other participants.

383	   Clearly, there are a great number of topologies that can be used to
384	   display the streams from multiple sites participating in a
385	   conference.

387	   One major objective for telepresence is to be able to preserve the
388	   "Being there" user experience.  However, in multi-site conferences it
389	   is often (in fact usually) not possible to simultaneously provide
390	   full size video, eye contact, common perception of gestures and gaze
391	   by all participants.  Several policies can be used for stream
392	   distribution and display: all provide good results but they all make
393	   different compromises.

395	   One common policy is called site switching.  Let's say the speaker is
396	   at site A and everyone else is at a "remote" site.  When the room at
397	   site A shown, all the camera images from site A are forwarded to the
398	   remote sites.  Therefore at each receiving remote site, all the
399	   screens display camera images from site A.  This can be used to
400	   preserve full size image display, and also provide full visual
401	   context of the displayed far end, site A.  In site switching, there
402	   is a fixed relation between the cameras in each room and the screens
403	   in remote rooms.  The room or participants being shown is switched
404	   from time to time based on who is speaking or by manual control,
405	   e.g., from site A to site B.

407	   Segment switching is another policy choice.  Still using site A as
408	   where the speaker is, and "remote" to refer to all the other sites,
409	   in segment switching, rather than sending all the images from site A,
410	   only the speaker at site A is shown.  The camera images of the
411	   current speaker and previous speakers (if any) are forwarded to the
412	   other sites in the conference.  Therefore the screens in each site
413	   are usually displaying images from different remote sites - the
414	   current speaker at site A and the previous ones.  This strategy can
415	   be used to preserve full size image display, and also capture the
416	   non-verbal communication between the speakers.  In segment switching,
417	   the display depends on the activity in the remote rooms - generally,
418	   but not necessarily based on audio / speech detection).

420	   A third possibility is to reduce the image size so that multiple
421	   camera views can be composited onto one or more screens.  This does
422	   not preserve full size image display, but provides the most visual
423	   context (since more sites or segments can be seen).  Typically in
424	   this case the display mapping is static, i.e., each part of each room
425	   is shown in the same location on the display screens throughout the
426	   conference.

428	   Other policies and combinations are also possible.  For example,
429	   there can be a static display of all screens from all remote rooms,
430	   with part or all of one screen being used to show the current speaker
431	   at full size.

433	3.4.  Presentation

435	   In addition to the video and audio streams showing the participants,
436	   additional streams are used for presentations.

438	   In systems available today, generally only one additional video
439	   stream is available for presentations.  Often this presentation
440	   stream is half-duplex in nature, with presenters taking turns.  The
441	   presentation stream may be captured from a PC screen, or it may come
442	   from a multimedia source such as a document camera, camcorder or a
443	   DVD.  In a multipoint meeting, the presentation streams for the
444	   currently active presentation are always distributed to all sites in
445	   the meeting, so that the presentations are viewed by all.

447	   Some systems display the presentation streams on a screen that is
448	   mounted either above or below the three participant screens.  Other
449	   systems provide screens on the conference table for observing
450	   presentations.  If multiple presentation screens are used, they
451	   generally display identical content.  There is considerable variation
452	   in the placement, number, and size or presentation screens.

454	   In some systems presentation audio is pre-mixed with the room audio.
455	   In others, a separate presentation audio stream is provided (if the
456	   presentation includes audio).

458	   In H.323[ITU.H323] systems, H.239[ITU.H239] is typically used to
459	   control the video presentation stream.  In SIP systems, similar
460	   control mechanisms can be provided using BFCP [RFC4582] for
461	   presentation token.  These mechanisms are suitable for managing a
462	   single presentation stream.

464	   Although today's systems remain limited to a single video
465	   presentation stream, there are obvious uses for multiple presentation
466	   streams:

468	   1.  Frequently the meeting convener is following a meeting agenda,
469	       and it is useful for her to be able to show that agenda to all
470	       participants during the meeting.  Other participants at various
471	       remote sites are able to make presentations during the meeting,
472	       with the presenters taking turns.  The presentations and the
473	       agenda are both shown, either on separate screens, or perhaps re-
474	       scaled and shown on a single screen.

476	   2.  A single multimedia presentation can itself include multiple
477	       video streams that should be shown together.  For instance, a
478	       presenter may be discussing the fairness of media coverage.  In
479	       addition to slides which support the presenter's conclusions, she
480	       also has video excerpts from various news programs which she
481	       shows to illustrate her findings.  She uses a DVD player for the
482	       video excerpts so that she can pause and reposition the video as
483	       needed.

485	   3.  An educator who is presenting a multi-screen slide show.  This
486	       show requires that the placement of the images on the multiple
487	       screens at each site be consistent.

489	   There are many other examples where multiple presentation streams are
490	   useful.

492	3.5.  Heterogeneous Systems

494	   It is common in meeting scenarios for people to join the conference
495	   from a variety of environments, using different types of endpoint
496	   devices.  A multi-screen immersive telepresence conference may
497	   include someone on a PC-based video conferencing system, a
498	   participant calling in by phone, and (soon) someone on a handheld
499	   device.

501	   What experience/view will each of these devices have?

503	   Some may be able to handle multiple streams and others can handle
504	   only a single stream.  (We are not here talking about legacy systems,
505	   but rather systems built to participate in such a conference,
506	   although they are single stream only.)  In a single video stream ,
507	   the stream may contain one or more compositions depending on the
508	   available screen space on the device.  In most cases an intermediate
509	   transcoding device will be relied upon to produce a single stream,
510	   perhaps with some kind of continuous presence.

512	   Bit rates will vary - the handheld and phone having lower bit rates
513	   than PC and multi-screen systems.

515	   Layout is accomplished according to different policies.  For example,
516	   a handheld and PC may receive the active speaker stream.  The
517	   decision can either be made explicitly by the receiver or by the
518	   sender if it can receive some kind of rendering hint.  The same is
519	   true for audio -- i.e., that it receives a mixed stream or a number
520	   of the loudest speakers if mixing is not available in the network.

522	   For the PC based conferencing participant, the user's experience
523	   depends on the application.  It could be single stream, similar to a
524	   handheld but with a bigger screen.  Or, it could be multiple streams,
525	   similar to an immersive telepresence system but with a smaller
526	   screen.  Control for manipulation of streams can be local in the
527	   software application, or in another location and sent to the
528	   application over the network.

530	   The handheld device is the most extreme.  How will that participant
531	   be viewed and heard?  It should be an equal participant, though the
532	   bandwidth will be significantly less than an immersive system.  A
533	   receiver may choose to display output coming from a handheld
534	   differently based on the resolution, but that would be the case with
535	   any low resolution video stream, e.g., from a powerful PC on a bad
536	   network.

538	   The handheld will send and receive a single video stream, which could
539	   be a composite or a subset of the conference.  The handheld could say
540	   what it wants or could accept whatever the sender (conference server
541	   or sending endpoint) thinks is best.  The handheld will have to
542	   signal any actions it wants to take the same way that immersive
543	   system signals actions.

545	3.6.  Multipoint Education Usage

547	   The importance of this example is that the multiple video streams are
548	   not used to create an immersive conferencing experience with
549	   panoramic views at all the sites.  Instead the multiple streams are
550	   dynamically used to enable full participation of remote students in a
551	   university class.  In some instances the same video stream is
552	   displayed on multiple screens in the room, in other instances an
553	   available stream is not displayed at all.

555	   The main site is a university auditorium which is equipped with three
556	   cameras.  One camera is focused on the professor at the podium.  A
557	   second camera is mounted on the wall behind the professor and
558	   captures the class in its entirety.  The third camera is co-located
559	   with the second, and is designed to capture a close up view of a
560	   questioner in the audience.  It automatically zooms in on that
561	   student using sound localization.

563	   Although the auditorium is equipped with three cameras, it is only
564	   equipped with two screens.  One is a large screen located at the
565	   front so that the class can see it.  The other is located at the rear
566	   so the professor can see it.  When someone asks a question, the front
567	   screen shows the questioner.  Otherwise it shows the professor
568	   (ensuring everyone can easily see her).

570	   The remote sites are typical immersive telepresence room with three
571	   camera/screen pairs.

573	   All remote sites display the professor on the center screen at full
574	   size.  A second screen shows the entire classroom view when the
575	   professor is speaking.  However, when a student asks a question, the
576	   second screen shows the close up view of the student at full size.
577	   Sometimes the student is in the auditorium; sometimes the speaking
578	   student is at another remote site.  The remote systems never display
579	   the students that are actually in that room.

581	   If someone at the remote site asks a question, then the screen in the
582	   auditorium will show the remote student at full size (as if they were
583	   present in the auditorium itself).  The screen in the rear also shows
584	   this questioner, allowing the professor to see and respond to the
585	   student without needing to turn her back on the main class.

587	   When no one is asking a question, the screen in the rear briefly
588	   shows a full-room view of each remote site in turn, allowing the
589	   professor to monitor the entire class (remote and local students).
590	   The professor can also use a control on the podium to see a
591	   particular site - she can choose either a full-room view or a single
592	   camera view.

594	   Realization of this use case does not require any negotiation between
595	   the participating sites.  Endpoint devices (and an MCU if present) -
596	   need to know who is speaking and what video stream includes the view
597	   of that speaker.  The remote systems need some knowledge of which
598	   stream should be placed in the center.  The ability of the professor
599	   to see specific sites (or for the system to show all the sites in
600	   turn) would also require the auditorium system to know what sites are
601	   available, and to be able to request a particular view of any site.
602	   Bandwidth is optimized if video that is not being shown at a
603	   particular site is not distributed to that site.

605	3.7.  Multipoint Multiview (Virtual space)

607	   This use case describes a virtual space multipoint meeting with good
608	   eye contact and spatial layout of participants.  The use case was
609	   proposed very early in the development of video conferencing systems
610	   as described in 1983 by Allardyce and Randal [virtualspace].  The use
611	   case is illustrated in figure 2-5 of their report.  The virtual space
612	   expands the point to point case by having all multipoint conference
613	   participants "seat" in a virtual room.  In this case each participant
614	   has a fixed "seat" in the virtual room so each participant expects to
615	   see a different view having a different participant on his left and
616	   right side.  Today, the use case is implemented in multiple
617	   telepresence type video conferencing systems on the market.  The term
618	   "virtual space" was used in their report.  The main difference
619	   between the result obtained with modern systems and those from 1983
620	   are larger screen sizes.

622	   Virtual space multipoint as defined here assumes endpoints with
623	   multiple cameras and screens.  Usually there is the same number of
624	   cameras and screens at a given endpoint.  A camera is positioned
625	   above each screen.  A key aspect of virtual space multipoint is the
626	   details of how the cameras are aimed.  The cameras are each aimed on
627	   the same area of view of the participants at the site.  Thus each
628	   camera takes a picture of the same set of people but from a different
629	   angle.  Each endpoint sender in the virtual space multipoint meeting
630	   therefore offers a choice of video streams to remote receivers, each
631	   stream representing a different view point.  For example a camera
632	   positioned above a screen to a participant's left may take video
633	   pictures of the participant's left ear while at the same time, a
634	   camera positioned above a screen to the participant's right may take
635	   video pictures of the participant's right ear.

637	   Since a sending endpoint has a camera associated with each screen, an
638	   association is made between the receiving stream output on a
639	   particular screen and the corresponding sending stream from the
640	   camera associated with that screen.  These associations are repeated
641	   for each screen/camera pair in a meeting.  The result of this system
642	   is a horizontal arrangement of video images from remote sites, one
643	   per screen.  The image from each screen is paired with the camera
644	   output from the camera above that screen resulting in excellent eye
645	   contact.

647	3.8.  Multiple presentations streams - Telemedicine
648	   This use case describes a scenario where multiple presentation
649	   streams are used.  In this use case, the local site is a surgery room
650	   connected to one or more remote sites that may have different
651	   capabilities.  At the local site three main cameras capture the whole
652	   room (typical 3 camera Telepresence case).  Also multiple
653	   presentation inputs are available: a surgery camera which is used to
654	   provide a zoomed view of the operation, an endoscopic monitor, an
655	   X-ray CT image output device, a B-ultrasonic apparatus, a cardiogram
656	   generator, an MRI image instrument, etc.  These devices are used to
657	   provide multiple local video presentation streams to help the surgeon
658	   monitor the status of the patient and assist in the surgical process.

660	   The local site may have three main screens and one (or more)
661	   presentation screen(s).  The main screens can be used to display the
662	   remote experts.  The presentation screen(s) can be used to display
663	   multiple presentation streams from local and remote sites
664	   simultaneously.  The three main cameras capture different parts of
665	   the surgery room.  The surgeon can decide the number, the size and
666	   the placement of the presentations displayed on the local
667	   presentation screen(s).  He can also indicate which local
668	   presentation captures are provided for the remote sites.  The local
669	   site can send multiple presentation captures to remote sites and it
670	   can receive multiple presentations related to the patient or the
671	   procedure from them.

673	   One type of remote site is a single or dual screen and one camera
674	   system used by a consulting expert.  In the general case the remote
675	   sites can be part of a multipoint Telepresence conference.  The
676	   presentation screens at the remote sites allow the experts to see the
677	   details of the operation and related data.  Like the main site, the
678	   experts can decide the number, the size and the placement of the
679	   presentations displayed on the presentation screens.  The
680	   presentation screens can display presentation streams from the
681	   surgery room or from other remote sites and also local presentation
682	   streams.  Thus the experts can also start sending presentation
683	   streams, which can carry medical records, pathology data, or their
684	   reference and analysis, etc.

686	   Another type of remote site is a typical immersive Telepresence room
687	   with three camera/screen pairs allowing more experts to join the
688	   consultation.  These sites can also be used for education.  The
689	   teacher, who is not necessarily the surgeon, and the students are in
690	   different remote sites.  Students can observe and learn the details
691	   of the whole procedure, while the teacher can explain and answer
692	   questions during the operation.

694	   All remote education sites can display the surgery room.  Another
695	   option is to display the surgery room on the center screen, and the
696	   rest of the screens can show the teacher and the student who is
697	   asking a question.  For all the above sites, multiple presentation
698	   screens can be used to enhance visibility: one screen for the zoomed
699	   surgery stream and the others for medical image streams, such as MRI
700	   images, cardiogram, B-ultrasonic images and pathology data.

702	4.  Acknowledgements

704	   The document has benefitted from input from a number of people
705	   including Alex Eleftheriadis, Marshall Eubanks, Tommy Andre Nyquist,
706	   Mark Gorzynski, Charles Eckel, Nermeen Ismail, Mary Barnes, Pascal
707	   Buhler, Jim Cole.

709	   Special acknowledgement to Lennard Xiao who contributed the text for
710	   the telemedicine use case

712	5.  IANA Considerations

714	   This document contains no IANA considerations.

716	6.  Security Considerations

718	   While there are likely to be security considerations for any solution
719	   for telepresence interoperability, this document has no security
720	   considerations.

722	7.  Informative References

724	   [ITU.H239]
725	              , "Role management and additional media channels for
726	              H.300-series terminals", ITU-T Recommendation H.239,
727	              September 2005.

729	   [ITU.H323]
730	              , "Packet-based Multimedia Communications Systems ", ITU-T
731	              Recommendation H.323, December 2009.

733	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
734	              A., Peterson, J., Sparks, R., Handley, M., and E.
735	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
736	              June 2002.

738	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
739	              Jacobson, "RTP: A Transport Protocol for Real-Time
740	              Applications", STD 64, RFC 3550, July 2003.

742	   [RFC4582]  Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
743	              Control Protocol (BFCP)", RFC 4582, November 2006.

745	   [virtualspace]
746	              Allardyce, . and . Randall, "Development of
747	              Teleconferencing Methodologies With Emphasis on Virtual
748	              Space Videe and Interactive Graphics", 1983.

750	Authors' Addresses

752	   Allyn Romanow
753	   Cisco
754	   San Jose, CA  95134
755	   US

757	   Email: allyn@cisco.com

759	   Stephen Botzko
760	   Polycom
761	   Andover, MA  01810
762	   US

764	   Email: stephen.botzko@polycom.com

766	   Mark Duckworth
767	   Polycom
768	   Andover, MA  01810
769	   US

771	   Email: mark.duckworth@polycom.com

773	   Roni Even (editor)
774	   Huawei Technologies
775	   Tel Aviv
776	   Israel

778	   Email: roni.even@mail01.huawei.com