idnits 2.17.1 

draft-ietf-clue-telepresence-use-cases-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 17 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (April 06, 2013) is 4031 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 4582
     (Obsoleted by RFC 8855)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                                     Cisco
4	Intended status: Informational                                 S. Botzko
5	Expires: October 08, 2013                                   M. Duckworth
6	                                                                 Polycom
7	                                                            R. Even, Ed.
8	                                                     Huawei Technologies
9	                                                          April 06, 2013

11	                Use Cases for Telepresence Multi-streams
12	             draft-ietf-clue-telepresence-use-cases-05.txt

14	Abstract

16	   Telepresence conferencing systems seek to create the sense of really
17	   being present for the participants.  A number of techniques for
18	   handling audio and video streams are used to create this experience.
19	   When these techniques are not similar, interoperability between
20	   different systems is difficult at best, and often not possible.
21	   Conveying information about the relationships between multiple
22	   streams of media would allow senders and receivers to make choices to
23	   allow telepresence systems to interwork.  This memo describes the
24	   most typical and important use cases for sending multiple streams in
25	   a telepresence conference.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on October 08, 2013.

44	Copyright Notice

46	   Copyright (c) 2013 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
62	   2.  Telepresence Scenarios Overview . . . . . . . . . . . . . . .   3
63	   3.  Use Case Scenarios  . . . . . . . . . . . . . . . . . . . . .   5
64	     3.1.  Point to point meeting: symmetric . . . . . . . . . . . .   6
65	     3.2.  Point to point meeting: asymmetric  . . . . . . . . . . .   7
66	     3.3.  Multipoint meeting  . . . . . . . . . . . . . . . . . . .   8
67	     3.4.  Presentation  . . . . . . . . . . . . . . . . . . . . . .   9
68	     3.5.  Heterogeneous Systems . . . . . . . . . . . . . . . . . .  11
69	     3.6.  Multipoint Education Usage  . . . . . . . . . . . . . . .  12
70	     3.7.  Multipoint Multiview (Virtual space)  . . . . . . . . . .  13
71	     3.8.  Multiple presentations streams - Telemedicine . . . . . .  14
72	   4.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
73	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
74	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
75	   7.  Informative References  . . . . . . . . . . . . . . . . . . .  16
76	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  16

78	1.  Introduction

80	   Telepresence applications try to provide a "being there" experience
81	   for conversational video conferencing.  Often this telepresence
82	   application is described as "immersive telepresence" in order to
83	   distinguish it from traditional video conferencing, and from other
84	   forms of remote presence not related to conversational video
85	   conferencing, such as avatars and robots.  The salient
86	   characteristics of telepresence are often described as: actual sized,
87	   immersive video, preserving interpersonal interaction and allowing
88	   non-verbal communication.

90	   Although telepresence systems are based on open standards such as RTP
91	   [RFC3550], SIP [RFC3261], H.264, and the H.323[ITU.H323]suite of
92	   protocols, they cannot easily interoperate with each other without
93	   operator assistance and expensive additional equipment which
94	   translates from one vendor's protocol to another.  A standard way of
95	   describing the multiple streams constituting the media flows and the
96	   fundamental aspects of their behavior, would allow telepresence
97	   systems to interwork.

99	   This draft presents a set of use cases describing typical scenarios.
100	   Requirements will be derived from these use cases in a separate
101	   document.  The use cases are described from the viewpoint of the
102	   users.  They are illustrative of the user experience that needs to be
103	   supported.  It is possible to implement these use cases in a variety
104	   of different ways.

106	   Many different scenarios need to be supported.  This document
107	   describes in detail the most common and basic use cases.  These will
108	   cover most of the requirements.  There may be additional scenarios
109	   that bring new features and requirements which can be used to extend
110	   the initial work.

112	   Point-to-point and Multipoint telepresence conferences are
113	   considered.  In some use cases, the number of displays is the same at
114	   all sites, in others, the number of displays differs at different
115	   sites.  Both use cases are considered.  Also included is a use case
116	   describing display of presentation material or content.

118	   The document structure is as follows:Section 2 gives an overview of
119	   scenarios, and Section 3 describes use cases.

121	2.  Telepresence Scenarios Overview

123	   This section describes the general characteristics of the use cases
124	   and what the scenarios are intended to show.  The typical setting is
125	   a business conference, which was the initial focus of telepresence.
126	   Recently consumer products are also being developed.  We specifically
127	   do not include in our scenarios the infrastructure aspects of
128	   telepresence, such as room construction, layout and decoration.

130	   Telepresence systems are typically composed of one or more video
131	   cameras and encoders and one or more display monitors of large size
132	   (diagonal around 60").  Microphones pick up sound and audio
133	   codec(s)produce one or more audio streams.  The cameras used to
134	   capture the telepresence users we will call participant cameras (and
135	   likewise for displays).  There may also be other cameras, such as for
136	   document display.  These will be referred to as presentation or
137	   content cameras, which generally have different formats, aspect
138	   ratios, and frame rates from the participant cameras.  The
139	   presentation streams may be shown on participant monitor, or on
140	   auxiliary display monitors.  A user's computer may also serve as a
141	   virtual content camera, generating an animation or playing back a
142	   video for display to the remote participants.

144	   We describe such a telepresence system as sending M video streams, N
145	   audio streams, and D content streams to the remote system(s).  (Note
146	   that the number of audio streams is generally not the same as the
147	   number of video streams.)

149	   The fundamental parameters describing today's typical telepresence
150	   scenario include:

152	   1.   The number of participating sites

154	   2.   The number of visible seats at a site

156	   3.   The number of cameras

158	   4.   The number and type of microphones

160	   5.   The number of audio channels

162	   6.   The screen size

164	   7.   The display capabilities - such as resolution, frame rate,
165	        aspect ratio

167	   8.   The arrangement of the monitors in relation to each other

169	   9.   The same or a different number of primary monitors at all sites

171	   10.  Type and number of presentation monitors

173	   11.  Multipoint conference display strategies - for example, the
174	        camera-to-display mappings may be static or dynamic

176	   12.  The camera viewpoint

178	   13.  The cameras fields of view and how they do or do not overlap

180	   The basic features that give telepresence its distinctive
181	   characteristics are implemented in disparate ways in different
182	   systems.  Currently Telepresence systems from diverse vendors
183	   interoperate to some extent, but this is not supported in a standards
184	   based fashion.  Interworking requires that translation and
185	   transcoding devices be included in the architecture.  Such devices
186	   increase latency, reducing the quality of interpersonal interaction.
187	   Use of these devices is often not automatic; it frequently requires
188	   substantial manual configuration and a detailed understanding of the
189	   nature of underlying audio and video streams.  This state of affairs
190	   is not acceptable for the continued growth of telepresence -
191	   telepresence systems should have the same ease of interoperability as
192	   do telephones.

194	   There is no agreed upon way to adequately describe the semantics of
195	   how streams of various media types relate to each other.  Without a
196	   standard for stream semantics to describe the particular roles and
197	   activities of each stream in the conference, interoperability is
198	   cumbersome at best.

200	   In a multiple screen conference, the video and audio streams sent
201	   from remote participants must be understood by receivers so that they
202	   can be presented in a coherent and life-like manner.  This includes
203	   the ability to present remote participants at their actual size for
204	   their apparent distance, while maintaining correct eye contact,
205	   gesticular cues, and simultaneously providing a spatial audio sound
206	   stage that is consistent with the displayed video.

208	   The receiving device that decides how to display incoming information
209	   needs to understand a number of variables such as the spatial
210	   position of the speaker, the field of view of the cameras; the camera
211	   zoom; which media stream is related to each of the displays; etc.  It
212	   is not simply that individual streams must be adequately described,
213	   to a large extent this already exists, but rather that the semantics
214	   of the relationships between the streams must be communicated.  Note
215	   that all of this is still required even if the basic aspects of the
216	   streams, such as the bit rate, frame rate, and aspect ratio, are
217	   known.  Thus, this problem has aspects considerably beyond those
218	   encountered in interoperation of single-node video conferencing
219	   units.

221	3.  Use Case Scenarios

223	   Our development of use cases is staged, initially focusing on what is
224	   currently typical and important.  Use cases that add future or more
225	   specialized features will be added later as needed.  Also, there are
226	   a number of possible variants for these use cases, for example, the
227	   audio supported may differ at the end points (such as mono or stereo
228	   versus surround sound), etc.

230	   The use cases here are intended to be hierarchical, in that the
231	   earlier use cases describe basics of telepresence that will also be
232	   used by later use cases.

234	   Many of these systems offer a full conference room solution where
235	   local participants sit on one side of a table and remote participants
236	   are displayed as if they are sitting on the other side of the table.
237	   The cameras and screens are typically arranged to provide a panoramic
238	   (left to right from the local user view point) view of the remote
239	   room.

241	   The sense of immersion and non-verbal communication is fostered by a
242	   number of technical features, such as:

244	   1.  Good eye contact, which is achieved by careful placement of
245	       participants, cameras and screens.

247	   2.  Camera field of view and screen sizes are matched so that the
248	       images of the remote room appear to be full size.

250	   3.  The left side of each room is presented on the right display at
251	       the far end; similarly the right side of the room is presented on
252	       the left display.  The effect of this is that participants of
253	       each site appear to be sitting across the table from each other.
254	       If two participants on the same site glance at each other, all
255	       participants can observe it.  Likewise, if a participant on one
256	       site gestures to a participant on the other site, all
257	       participants observe the gesture itself and the participants it
258	       includes.

260	3.1.  Point to point meeting: symmetric

262	   In this case each of the two sites has an identical number of
263	   screens, with cameras having fixed fields of view, and one camera for
264	   each screen.  The sound type is the same at each end.  As an example,
265	   there could be 3 cameras and 3 screens in each room, with stereo
266	   sound being sent and received at each end.

268	   The important thing here is that each of the 2 sites has the same
269	   number of screens.  Each screen is paired with a corresponding
270	   camera.  Each camera / screen pair is typically connected to a
271	   separate codec, producing a video encoded stream for transmission to
272	   the remote site, and receiving a similarly encoded stream from the
273	   remote site.

275	   Each system has one or multiple microphones for capturing audio.  In
276	   some cases, stereophonic microphones are employed.  In other systems,
277	   a microphone may be placed in front of each participant (or pair of
278	   participants).  In typical systems all the microphones are connected
279	   to a single codec that sends and receives the audio streams as either
280	   stereo or surround sound.  The number of microphones and the number
281	   of audio channels are often not the same as the number of cameras.
282	   Also the number of microphones is often not the same as the number of
283	   loudspeakers.

285	   The audio may be transmitted as multi-channel (stereo/surround sound)
286	   or as distinct and separate monophonic streams.  Audio levels should
287	   be matched, so the sound levels at both sites are identical.
288	   Loudspeaker and microphone placements are chosen so that the sound
289	   "stage" (orientation of apparent audio sources) is coordinated with
290	   the video.  That is, if a participant on one site speaks, the
291	   participants at the remote site perceive her voice as originating
292	   from her visual image.  In order to accomplish this, the audio needs
293	   to be mapped at the received site in the same fashion as the video.
294	   That is, audio received from the right side of the room needs to be
295	   output from loudspeaker(s) on the left side at the remote site, and
296	   vice versa.

298	3.2.  Point to point meeting: asymmetric

300	   In this case, each site has a different number of screens and cameras
301	   than the other site.  The important characteristic of this scenario
302	   is that the number of displays is different between the two sites.
303	   This creates challenges which are handled differently by different
304	   telepresence systems.

306	   This use case builds on the basic scenario of 3 screens to 3 screens.
307	   Here, we use the common case of 3 screens and 3 cameras at one site,
308	   and 1 screen and 1 camera at the other site, connected by a point to
309	   point call.  The display sizes and camera fields of view at both
310	   sites are basically similar, such that each camera view is designed
311	   to show two people sitting side by side.  Thus the 1 screen room has
312	   up to 2 people seated at the table, while the 3 screen room may have
313	   up to 6 people at the table.

315	   The basic considerations of defining left and right and indicating
316	   relative placement of the multiple audio and video streams are the
317	   same as in the 3-3 use case.  However, handling the mismatch between
318	   the two sites of the number of displays and cameras requires more
319	   complicated manoeuvres.

321	   For the video sent from the 1 camera room to the 3 screen room,
322	   usually what is done is to simply use 1 of the 3 displays and keep
323	   the second and third displays inactive, or put up the date, for
324	   example.  This would maintain the "full size" image of the remote
325	   side.

327	   For the other direction, the 3 camera room sending video to the 1
328	   screen room, there are more complicated variations to consider.  Here
329	   are several possible ways in which the video streams can be handled.

331	   1.  The 1 screen system might simply show only 1 of the 3 camera
332	       images, since the receiving side has only 1 screen.  Two people
333	       are seen at full size, but 4 people are not seen at all.  The
334	       choice of which 1 of the 3 streams to display could be fixed, or
335	       could be selected by the users.  It could also be made
336	       automatically based on who is speaking in the 3 screen room, such
337	       that the people in the 1 screen room always see the person who is
338	       speaking.  If the automatic selection is done at the sender, the
339	       transmission of streams that are not displayed could be
340	       suppressed, which would avoid wasting bandwidth.

342	   2.  The 1 screen system might be capable of receiving and decoding
343	       all 3 streams from all 3 cameras.  The 1 screen system could then
344	       compose the 3 streams into 1 local image for display on the
345	       single screen.  All six people would be seen, but smaller than
346	       full size.  This could be done in conjunction with reducing the
347	       image resolution of the streams, such that encode/decode
348	       resources and bandwidth are not wasted on streams that will be
349	       downsized for display anyway.

351	   3.  The 3 screen system might be capable of including all 6 people in
352	       a single stream to send to the 1 screen system.  For example, it
353	       could use PTZ (Pan Tilt Zoom) cameras to physically adjust the
354	       cameras such that 1 camera captures the whole room of six people.
355	       Or it could recompose the 3 camera images into 1 encoded stream
356	       to send to the remote site.  These variations also show all six
357	       people, but at a reduced size.

359	   4.  Or, there could be a combination of these approaches, such as
360	       simultaneously showing the speaker in full size with a composite
361	       of all the 6 participants in smaller size.

363	   The receiving telepresence system needs to have information about the
364	   content of the streams it receives to make any of these decisions.
365	   If the systems are capable of supporting more than one strategy,
366	   there needs to be some negotiation between the two sites to figure
367	   out which of the possible variations they will use in a specific
368	   point to point call.

370	3.3.  Multipoint meeting

372	   In a multipoint telepresence conference, there are more than two
373	   sites participating.  Additional complexity is required to enable
374	   media streams from each participant to show up on the displays of the
375	   other participants.

377	   Clearly, there are a great number of topologies that can be used to
378	   display the streams from multiple sites participating in a
379	   conference.

381	   One major objective for telepresence is to be able to preserve the
382	   "Being there" user experience.  However, in multi-site conferences it
383	   is often (in fact usually) not possible to simultaneously provide
384	   full size video, eye contact, common perception of gestures and gaze
385	   by all participants.  Several policies can be used for stream
386	   distribution and display: all provide good results but they all make
387	   different compromises.

389	   One common policy is called site switching.  Let's say the speaker is
390	   at site A and everyone else is at a "remote" site.  When the room at
391	   site A shown, all the camera images from site A are forwarded to the
392	   remote sites.  Therefore at each receiving remote site, all the
393	   screens display camera images from site A.  This can be used to
394	   preserve full size image display, and also provide full visual
395	   context of the displayed far end, site A.  In site switching, there
396	   is a fixed relation between the cameras in each room and the displays
397	   in remote rooms.  The room or participants being shown is switched
398	   from time to time based on who is speaking or by manual control,
399	   e.g., from site A to site B.

401	   Segment switching is another policy choice.  Still using site A as
402	   where the speaker is, and "remote" to refer to all the other sites,
403	   in segment switching, rather than sending all the images from site A,
404	   only the speaker at site A is shown.  The camera images of the
405	   current speaker and previous speakers (if any) are forwarded to the
406	   other sites in the conference.  Therefore the screens in each site
407	   are usually displaying images from different remote sites - the
408	   current speaker at site A and the previous ones.  This strategy can
409	   be used to preserve full size image display, and also capture the
410	   non-verbal communication between the speakers.  In segment switching,
411	   the display depends on the activity in the remote rooms - generally,
412	   but not necessarily based on audio / speech detection).

414	   A third possibility is to reduce the image size so that multiple
415	   camera views can be composited onto one or more screens.  This does
416	   not preserve full size image display, but provides the most visual
417	   context (since more sites or segments can be seen).  Typically in
418	   this case the display mapping is static, i.e., each part of each room
419	   is shown in the same location on the display screens throughout the
420	   conference.

422	   Other policies and combinations are also possible.  For example,
423	   there can be a static display of all screens from all remote rooms,
424	   with part or all of one screen being used to show the current speaker
425	   at full size.

427	3.4.  Presentation
428	   In addition to the video and audio streams showing the participants,
429	   additional streams are used for presentations.

431	   In systems available today, generally only one additional video
432	   stream is available for presentations.  Often this presentation
433	   stream is half-duplex in nature, with presenters taking turns.  The
434	   presentation stream may be captured from a PC screen, or it may come
435	   from a multimedia source such as a document camera, camcorder or a
436	   DVD.  In a multipoint meeting, the presentation streams for the
437	   currently active presentation are always distributed to all sites in
438	   the meeting, so that the presentations are viewed by all.

440	   Some systems display the presentation streams on a screen that is
441	   mounted either above or below the three participant screens.  Other
442	   systems provide monitors on the conference table for observing
443	   presentations.  If multiple presentation monitors are used, they
444	   generally display identical content.  There is considerable variation
445	   in the placement, number, and size or presentation displays.

447	   In some systems presentation audio is pre-mixed with the room audio.
448	   In others, a separate presentation audio stream is provided (if the
449	   presentation includes audio).

451	   In H.323[ITU.H323] systems, H.239[ITU.H239] is typically used to
452	   control the video presentation stream.  In SIP systems, similar
453	   control mechanisms can be provided using BFCP [RFC4582] for
454	   presentation token.  These mechanisms are suitable for managing a
455	   single presentation stream.

457	   Although today's systems remain limited to a single video
458	   presentation stream, there are obvious uses for multiple presentation
459	   streams:

461	   1.  Frequently the meeting convener is following a meeting agenda,
462	       and it is useful for her to be able to show that agenda to all
463	       participants during the meeting.  Other participants at various
464	       remote sites are able to make presentations during the meeting,
465	       with the presenters taking turns.  The presentations and the
466	       agenda are both shown, either on separate displays, or perhaps
467	       re-scaled and shown on a single display.

469	   2.  A single multimedia presentation can itself include multiple
470	       video streams that should be shown together.  For instance, a
471	       presenter may be discussing the fairness of media coverage.  In
472	       addition to slides which support the presenter's conclusions, she
473	       also has video excerpts from various news programs which she
474	       shows to illustrate her findings.  She uses a DVD player for the
475	       video excerpts so that she can pause and reposition the video as
476	       needed.

478	   3.  An educator who is presenting a multi-screen slide show.  This
479	       show requires that the placement of the images on the multiple
480	       displays at each site be consistent.

482	   There are many other examples where multiple presentation streams are
483	   useful.

485	3.5.  Heterogeneous Systems

487	   It is common in meeting scenarios for people to join the conference
488	   from a variety of environments, using different types of endpoint
489	   devices.  A multi-screen immersive telepresence conference may
490	   include someone on a PC-based video conferencing system, a
491	   participant calling in by phone, and (soon) someone on a handheld
492	   device.

494	   What experience/view will each of these devices have?

496	   Some may be able to handle multiple streams and others can handle
497	   only a single stream.  (We are not here talking about legacy systems,
498	   but rather systems built to participate in such a conference,
499	   although they are single stream only.)  In a single video stream ,
500	   the stream may contain one or more compositions depending on the
501	   available screen space on the device.  In most cases an intermediate
502	   transcoding device will be relied upon to produce a single stream,
503	   perhaps with some kind of continuous presence.

505	   Bit rates will vary - the handheld and phone having lower bit rates
506	   than PC and multi-screen systems.

508	   Layout is accomplished according to different policies.  For example,
509	   a handheld and PC may receive the active speaker stream.  The
510	   decision can either be made explicitly by the receiver or by the
511	   sender if it can receive some kind of rendering hint.  The same is
512	   true for audio -- i.e., that it receives a mixed stream or a number
513	   of the loudest speakers if mixing is not available in the network.

515	   For the PC based conferencing participant, the user's experience
516	   depends on the application.  It could be single stream, similar to a
517	   handheld but with a bigger screen.  Or, it could be multiple streams,
518	   similar to an immersive telepresence system but with a smaller
519	   screen.  Control for manipulation of streams can be local in the
520	   software application, or in another location and sent to the
521	   application over the network.

523	   The handheld device is the most extreme.  How will that participant
524	   be viewed and heard?  It should be an equal participant, though the
525	   bandwidth will be significantly less than an immersive system.  A
526	   receiver may choose to display output coming from a handheld
527	   differently based on the resolution, but that would be the case with
528	   any low resolution video stream, e.  g., from a powerful PC on a bad
529	   network.

531	   The handheld will send and receive a single video stream, which could
532	   be a composite or a subset of the conference.  The handheld could say
533	   what it wants or could accept whatever the sender (conference server
534	   or sending endpoint) thinks is best.  The handheld will have to
535	   signal any actions it wants to take the same way that immersive
536	   system signals actions.

538	3.6.  Multipoint Education Usage

540	   The importance of this example is that the multiple video streams are
541	   not used to create an immersive conferencing experience with
542	   panoramic views at all the site.  Instead the multiple streams are
543	   dynamically used to enable full participation of remote students in a
544	   university class.  In some instances the same video stream is
545	   displayed on multiple displays in the room, in other instances an
546	   available stream is not displayed at all.

548	   The main site is a university auditorium which is equipped with three
549	   cameras.  One camera is focused on the professor at the podium.  A
550	   second camera is mounted on the wall behind the professor and
551	   captures the class in its entirety.  The third camera is co-located
552	   with the second, and is designed to capture a close up view of a
553	   questioner in the audience.  It automatically zooms in on that
554	   student using sound localization.

556	   Although the auditorium is equipped with three cameras, it is only
557	   equipped with two screens.  One is a large screen located at the
558	   front so that the class can see it.  The other is located at the rear
559	   so the professor can see it.  When someone asks a question, the front
560	   screen shows the questioner.  Otherwise it shows the professor
561	   (ensuring everyone can easily see her).

563	   The remote sites are typical immersive telepresence room with three
564	   camera/screen pairs.

566	   All remote sites display the professor on the center screen at full
567	   size.  A second screen shows the entire classroom view when the
568	   professor is speaking.  However, when a student asks a question, the
569	   second screen shows the close up view of the student at full size.
570	   Sometimes the student is in the auditorium; sometimes the speaking
571	   student is at another remote site.  The remote systems never display
572	   the students that are actually in that room.

574	   If someone at the remote site asks a question, then the screen in the
575	   auditorium will show the remote student at full size (as if they were
576	   present in the auditorium itself).  The display in the rear also
577	   shows this questioner, allowing the professor to see and respond to
578	   the student without needing to turn her back on the main class.

580	   When no one is asking a question, the screen in the rear briefly
581	   shows a full-room view of each remote site in turn, allowing the
582	   professor to monitor the entire class (remote and local students).
583	   The professor can also use a control on the podium to see a
584	   particular site - she can choose either a full-room view or a single
585	   camera view.

587	   Realization of this use case does not require any negotiation between
588	   the participating sites.  Endpoint devices (and an MCU if present) -
589	   need to know who is speaking and what video stream includes the view
590	   of that speaker.  The remote systems need some knowledge of which
591	   stream should be placed in the center.  The ability of the professor
592	   to see specific sites (or for the system to show all the sites in
593	   turn) would also require the auditorium system to know what sites are
594	   available, and to be able to request a particular view of any site.
595	   Bandwidth is optimized if video that is not being shown at a
596	   particular site is not distributed to that site.

598	3.7.  Multipoint Multiview (Virtual space)

600	   This use case describes a virtual space multipoint meeting with good
601	   eye contact and spatial layout of prticipants.The use case was
602	   proposed very early in the development of video conferencing systems
603	   as described in 1983 by Allardyce and Randal [virtualspace].  The use
604	   case is illustrated in figure 2-5 of their report.  The virtual space
605	   expands the point to point case by having all multipoint conference
606	   participants "seat" in a virtual room.  In this case each participant
607	   has a fixed "seat" in the virtual room so each participant expects to
608	   see a different view having a different participant on his left and
609	   right side.  Today, the use case is implemented in multiple
610	   telepresence type video conferencing systems on the market.  The term
611	   "virtual space" was used in their report.  The main difference
612	   between the result obtained with modern systems and those from 1983
613	   are larger display sizes.

615	   Virtual space multipoint as defined here assumes endpoints with
616	   multiple cameras and displays.  Usually there is the same number of
617	   cameras and displays at a given endpoint.  A camera is positioned
618	   above each display.  A key aspect of virtual space multipoint is the
619	   details of how the cameras are aimed.  The cameras are each aimed on
620	   the same area of view of the participants at the site.  Thus each
621	   camera takes a picture of the same set of people but from a different
622	   angle.  Each endpoint sender in the virtual space multipoint meeting
623	   therefore offers a choice of video streams to remote receivers, each
624	   stream representing a different view point.  For example a camera
625	   positioned above a display to a participant's left may take video
626	   pictures of the participant's left ear while at the same time, a
627	   camera positioned above a display to the participant's right may take
628	   video pictures of the participant's right ear.

630	   Since a sending endpoint has a camera associated with each display,
631	   an association is made between the receiving stream output on a
632	   particular display and the corresponding sending stream from the
633	   camera associated with that display.  These associations are repeated
634	   for each display/camera pair in a meeting.  The result of this system
635	   is a horizontal arrangement of video images from remote sites, one
636	   per display.  The image from each display is paired with the camera
637	   output from the camera above that display resulting in excellent eye
638	   contact.

640	3.8.  Multiple presentations streams - Telemedicine

642	   This use case describes a scenario where multiple presentation
643	   streams are used.  In this use case, the local site is a surgery room
644	   connected to one or more remote sites that may have different
645	   capabilities.  At the local site three main cameras capture the whole
646	   room (typical 3 camera Telepresence case).  Also multiple
647	   presentation inputs are available: a surgery camera which is used to
648	   provide a zoomed view of the operation, an endoscopic monitor, an
649	   X-ray CT image output device, a B-ultrasonic apparatus, a cardiogram
650	   generator, an MRI image instrument, etc.  These devices are used to
651	   provide multiple local video presentation streams to help the surgeon
652	   monitor the status of the patient and assist the process of the
653	   surgery.

655	   The local site may have three main screens and one (or more)
656	   presentation screen(s).  The main screens can be used to display the
657	   remote experts.  The presentation screen(s) can be used to display
658	   multiple presentation streams from local and remote sites
659	   simultaneously.  The three main cameras capture different parts of
660	   the surgery room.  The surgeon can decide the number, the size and
661	   the placement of the presentations displayed on the local
662	   presentation screen(s).  He can also indicate which local
663	   presentation captures are provided for the remote sites.  The local
664	   site can send multiple presentation captures to remote sites and it
665	   can receive multiple presentations related to the patient or the
666	   procedure from them.

668	   One type of remote site is a single or dual screen and one camera
669	   system used by a consulting expert.  In the general case the remote
670	   sites can be part of a multipoint Telepresence conference.  The
671	   presentation screens at the remote sites allow the experts to see the
672	   details of the operation and related data.  Like the main site, the
673	   experts can decide the number, the size and the placement of the
674	   presentations displayed on the presentation screens.  The
675	   presentation screens can display presentation streams from the
676	   surgery room or from other remote sites and also local presentation
677	   streams.  Thus the experts can also start sending presentation
678	   streams, which can carry medical records, pathology data, or their
679	   reference and analysis, etc.

681	   Another type of remote site is a typical immersive Telepresence room
682	   with three camera/screen pairs allowing more experts to join the
683	   consultation.  These sites can also be used for education.  The
684	   teacher, who is not necessarily the surgeon, and the students are in
685	   different remote sites.  Students can observe and learn the details
686	   of the whole procedure, while the teacher can explain and answer
687	   questions during the operation.

689	   All remote education sites can display the surgery room.  Another
690	   option is to display the surgery room on the center screen, and the
691	   rest of the screens can show the teacher and the student who is
692	   asking a question.  For all the above sites, multiple presentation
693	   screens can be used to enhance visibility: one screen for the zoomed
694	   surgery stream and the others for medical image streams, such as MRI
695	   images, cardiogram, B-ultrasonic images and pathology data.

697	4.  Acknowledgements

699	   The draft has benefitted from input from a number of people including
700	   Alex Eleftheriadis, Marshall Eubanks, Tommy Andre Nyquist, Mark
701	   Gorzynski, Charles Eckel, Nermeen Ismail, Mary Barnes, Pascal Buhler,
702	   Jim Cole.

704	   Special acknowledgement to Lennard Xiao who contributed the text for
705	   the telemedicine use case

707	5.  IANA Considerations

709	   This document contains no IANA considerations.

711	6.  Security Considerations

713	   While there are likely to be security considerations for any solution
714	   for telepresence interoperability, this document has no security
715	   considerations.

717	7.  Informative References

719	   [ITU.H239]
720	              "Role management and additional media channels for
721	              H.300-series terminals", ITU-T Recommendation H.239,
722	              September 2005.

724	   [ITU.H323]
725	              "Packet-based Multimedia Communications Systems ", ITU-T
726	              Recommendation H.323, December 2009.

728	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
729	              A., Peterson, J., Sparks, R., Handley, M., and E.
730	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
731	              June 2002.

733	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
734	              Jacobson, "RTP: A Transport Protocol for Real-Time
735	              Applications", STD 64, RFC 3550, July 2003.

737	   [RFC4582]  Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
738	              Control Protocol (BFCP)", RFC 4582, November 2006.

740	   [virtualspace]
741	              Allardyce, and Randall, "Development of Teleconferencing
742	              Methodologies With Emphasis on Virtual Space Videe and
743	              Interactive Graphics", 1983.

745	Authors' Addresses

747	   Allyn Romanow
748	   Cisco
749	   San Jose, CA  95134
750	   US

752	   Email: allyn@cisco.com
753	   Stephen Botzko
754	   Polycom
755	   Andover, MA  01810
756	   US

758	   Email: stephen.botzko@polycom.com

760	   Mark Duckworth
761	   Polycom
762	   Andover, MA  01810
763	   US

765	   Email: mark.duckworth@polycom.com

767	   Roni Even (editor)
768	   Huawei Technologies
769	   Tel Aviv
770	   Israel

772	   Email: roni.even@mail01.huawei.com