idnits 2.17.1 

draft-romanow-clue-telepresence-use-cases-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (May 25, 2011) is 4720 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 4582
     (Obsoleted by RFC 8855)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                                     Cisco
4	Intended status: Informational                                 S. Botzko
5	Expires: November 26, 2011                                  M. Duckworth
6	                                                                 Polycom
7	                                                                 R. Even
8	                                                     Huawei Technologies
9	                                                              T. Eubanks
10	                                                 Iformata Communications
11	                                                            May 25, 2011

13	                Use Cases for Telepresence Multi-streams
14	            draft-romanow-clue-telepresence-use-cases-02.txt

16	Abstract

18	   Telepresence conferencing systems seek to create the sense of really
19	   being present.  A number of techniques for handling audio and video
20	   streams are used to create this experience.  When these techniques
21	   are not similar, interoperability between different systems is
22	   difficult at best, and often not possible.  Conveying information
23	   about the relationships between multiple streams of media would allow
24	   senders and receivers to make choices to allow telepresence systems
25	   to interwork.  This memo describes the most typical and important use
26	   cases for sending multiple streams in a telepresence conference.

28	Status of this Memo

30	   This Internet-Draft is submitted in full conformance with the
31	   provisions of BCP 78 and BCP 79.

33	   Internet-Drafts are working documents of the Internet Engineering
34	   Task Force (IETF).  Note that other groups may also distribute
35	   working documents as Internet-Drafts.  The list of current Internet-
36	   Drafts is at http://datatracker.ietf.org/drafts/current/.

38	   Internet-Drafts are draft documents valid for a maximum of six months
39	   and may be updated, replaced, or obsoleted by other documents at any
40	   time.  It is inappropriate to use Internet-Drafts as reference
41	   material or to cite them other than as "work in progress."

43	   This Internet-Draft will expire on November 26, 2011.

45	Copyright Notice

47	   Copyright (c) 2011 IETF Trust and the persons identified as the
48	   document authors.  All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents
52	   (http://trustee.ietf.org/license-info) in effect on the date of
53	   publication of this document.  Please review these documents
54	   carefully, as they describe your rights and restrictions with respect
55	   to this document.  Code Components extracted from this document must
56	   include Simplified BSD License text as described in Section 4.e of
57	   the Trust Legal Provisions and are provided without warranty as
58	   described in the Simplified BSD License.

60	Table of Contents

62	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
63	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
64	   3.  Telepresence Scenarios Overview  . . . . . . . . . . . . . . .  4
65	   4.  Use Case Scenarios . . . . . . . . . . . . . . . . . . . . . .  6
66	     4.1.  Point to point meeting: symmetric  . . . . . . . . . . . .  6
67	     4.2.  Point to point meeting: asymmetric . . . . . . . . . . . .  7
68	     4.3.  Multipoint meeting . . . . . . . . . . . . . . . . . . . .  9
69	     4.4.  Presentation . . . . . . . . . . . . . . . . . . . . . . . 10
70	     4.5.  Heterogeneous Systems  . . . . . . . . . . . . . . . . . . 11
71	     4.6.  Multipoint Education Usage . . . . . . . . . . . . . . . . 12
72	   5.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
73	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
74	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 14
75	   8.  Informative References . . . . . . . . . . . . . . . . . . . . 14
76	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 14

78	1.  Introduction

80	   Telepresence applications try to provide a "being there" experience
81	   for conversational video conferencing.  Often this telepresence
82	   application is described as "immersive telepresence" in order to
83	   distinguish it from traditional video conferencing, and from other
84	   forms of remote presence not related to conversational video
85	   conferencing, such as avatars and robots.  The salient
86	   characteristics of telepresence are often described as: full-sized,
87	   immersive video, preserving interpersonal interaction and allowing
88	   non-verbal communication.

90	   Although telepresence systems are based on open standards such as RTP
91	   [RFC3550], SIP [RFC3261] , H.264, and the H.323 suite of protocols,
92	   they cannot easily interoperate with each other without operator
93	   assistance and expensive additional equipment which translates from
94	   one vendor to another.  A standard way of describing the multiple
95	   streams constituting the media flows and the fundamental aspects of
96	   their behavior, would allow telepresence systems to interwork.

98	   This draft presents a set of use cases describing typical scenarios.
99	   Requirements will be derived from these use cases in a separate
100	   document.  The use cases are described from the viewpoint of the
101	   users.  They are illustrative of the user experience that needs to be
102	   supported.  It is possible to implement these use cases in a variety
103	   of different ways.

105	   Many different scenarios need to be supported.  Our strategy in this
106	   document is to describe in detail the most common and basic use
107	   cases.  These will cover most of the requirements.  Additional
108	   scenarios that bring new features and requirements will be added.

110	   We look at telepresence conferences that are point-to-point and
111	   multipoint.  In some settings, the number of displays is similar at
112	   all sites, in others, the number of displays differs at different
113	   sites.  Both cases are considered.  Also included is a use case
114	   describing display of presentation or content.

116	   The document structure is as follows: Section 2 presents the document
117	   terminology, Section 3 gives an overview of the scenarios, and
118	   Section 4 describes use cases.

120	2.  Terminology

122	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
123	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
124	   document are to be interpreted as described in RFC 2119 [RFC2119].

126	3.  Telepresence Scenarios Overview

128	   This section describes the general characteristics of the use cases
129	   and what the scenarios are intended to show.  The typical setting is
130	   a business conference, which was the initial focus of telepresence.
131	   Recently consumer products are also being developed.  We specifically
132	   do not include in our scenarios the infrastructure aspects of
133	   telepresence, such as room construction, layout and decoration.

135	   Telepresence systems are typically composed of one or more video
136	   cameras and encoders and one or more display monitors of large size
137	   (around 60").  Microphones pick up sound and audio codec(s)produce
138	   one or more audio streams.  The cameras used to present the
139	   telepresence users we will call participant cameras (and likewise for
140	   displays).  There may also be other cameras, such as for document
141	   display.  These will be referred to as presentation or content
142	   cameras, which generally have different formats, aspect ratios, and
143	   frame rates from the participant cameras.  The presentation videos
144	   may be shown on participant screen, or on auxiliary display screens.
145	   A user's computer may also serve as a virtual content camera,
146	   generating an animation or playing back a video for display to the
147	   remote participants.

149	   We describe such a telepresence system as sending M video streams, N
150	   audio streams, and D content streams to the remote system(s).  (Note
151	   that the number of audio streams is generally not the same as the
152	   number of video streams.)

154	   The fundamental parameters describing today's typical telepresence
155	   scenario include:

157	   1.   The number of participating sites

159	   2.   The number of visible seats at a site

161	   3.   The number of cameras

163	   4.   The number of audio channels

165	   5.   The screen size

167	   6.   The display capabilities - such as resolution, frame rate,
168	        aspect ratio

170	   7.   The arrangement of the displays in relation to each other

172	   8.   Similar or dissimilar number of primary screens at all sites
173	   9.   Type and number of presentation displays

175	   10.  Multipoint conference display strategies - for example, the
176	        camera-to-display mappings may be static or dynamic

178	   11.  The camera viewpoint

180	   12.  The cameras fields of view and how they do or do not overlap

182	   The basic features that give telepresence its distinctive
183	   characteristics are implemented in disparate ways in different
184	   systems.  Currently Telepresence systems from diverse vendors
185	   interoperate to some extent, but this is not supported in a standards
186	   based fashion.  Interworking requires that translation and
187	   transcoding devices be included in the architecture.  Such devices
188	   increase latency, reducing the quality of interpersonal interaction.
189	   Use of these devices is often not automatic; it frequently requires
190	   substantial manual configuration and a detailed understanding of the
191	   nature of underlying audio and video streams.  This state of affairs
192	   is not acceptable for the continued growth of telepresence - we
193	   believe telepresence systems should have the same ease of
194	   interoperability as do telephones.

196	   There is no agreed upon way to adequately describe the semantics of
197	   how streams of various media types relate to each other.  Without a
198	   standard for stream semantics to describe the particular roles and
199	   activities of each stream in the conference, interoperability is
200	   cumbersome at best.

202	   In a multiple screen conference, the video and audio streams sent
203	   from remote participants must be understood by receivers so that they
204	   can be presented in a coherent and life-like manner.  This includes
205	   the ability to present remote participants at their true size for
206	   their apparent distance, while maintaining correct eye contact,
207	   gesticular cues, and simultaneously providing a spatial audio sound
208	   stage that is consistent with the video presentation.

210	   The receiving device that decides how to display incoming information
211	   needs to understand a number of variables such as the spatial
212	   position of the speaker, the field of view of the cameras; the camera
213	   zoom; which media stream is related to each of the displays; etc.  It
214	   is not simply that individual streams must be adequately described,
215	   to a large extent this already exists, but rather that the semantics
216	   of the relationships between the streams must be communicated.  Note
217	   that all of this is still required even if the basic aspects of the
218	   streams, such as the bit rate, frame rate, and aspect ratio, are
219	   known.  Thus, this problem has aspects considerably beyond those
220	   encountered in interoperation of single-node video conferencing
221	   units.

223	4.  Use Case Scenarios

225	   Our development of use cases is staged, initially focusing on what is
226	   currently typical and important.  Use cases that add future or more
227	   specialized features will be added later as needed.  Also, there are
228	   a number of possible variants for these use cases, for example, the
229	   audio supported may differ at the end points (such as mono or stereo
230	   versus surround sound), etc.

232	   The use cases here are intended to be hierarchical, in that the
233	   earlier use cases describe basics of telepresence that will also be
234	   used by later use cases.

236	   Many of these systems offer a full conference room solution where
237	   local participants sit on one side of a table and remote participants
238	   are displayed as if they are sitting on the other side of the table.
239	   The cameras and screens are typically arranged to provide a panoramic
240	   (left to right from the local user view point) view of the remote
241	   room.

243	   The sense of immersion and non-verbal communication is fostered by a
244	   number of technical features, such as:

246	   1.  Good eye contact, which is achieved by careful placement of
247	       participants, cameras and screens.

249	   2.  Camera field of view and screen sizes are matched so that the
250	       images of the remote room appear to be full size.

252	   3.  The left side of each room is presented on the right display at
253	       the far end; similarly the right side of the room is presented on
254	       the left display.  The effect of this is that participants of
255	       each site appear to be sitting across the table from each other.
256	       If two participants on the same site glance at each other, all
257	       participants can observe it.  Likewise, if a participant on one
258	       site gestures to a participant on the other site, all
259	       participants observe the gesture itself and the participants it
260	       includes.

262	4.1.  Point to point meeting: symmetric

264	   In this case each of the two sites has an identical number of
265	   screens, with cameras having fixed fields of view, and one camera for
266	   each screen.  The sound type is the same at each end.  As an example,
267	   there could be 3 cameras and 3 screens in each room, with stereo
268	   sound being sent and received at each end.

270	   The important thing here is that each of the 2 sites has the same
271	   number of screens.  Each screen is paired with a corresponding
272	   camera.  Each camera / screen pair is typically connected to a
273	   separate codec, producing a video encoded stream for transmission to
274	   the remote site, and receiving a similarly encoded stream from the
275	   remote site.

277	   Each system has one or multiple microphones for capturing audio.  In
278	   some cases, stereophonic microphones are employed.  In other systems,
279	   a microphone may be placed in front of each participant (or pair of
280	   participants).  In typical systems all the microphones are connected
281	   to a single codec that sends and receives the audio streams as either
282	   stereo or surround sound.  The number of microphones and the number
283	   of audio channels are often not the same as the number of cameras.
284	   Also the number of microphones is often not the same as the number of
285	   loudspeakers.

287	   The audio may be transmitted as multi-channel (stereo/surround sound)
288	   or as distinct and separate monophonic streams.  Audio levels should
289	   be matched, so the sound levels at both sites are identical.
290	   Loudspeaker and microphone placements are chosen so that the sound
291	   "stage" (orientation of apparent audio sources) is coordinated with
292	   the video.  That is, if a participant on one site speaks, the
293	   participants at the remote site perceive her voice as originating
294	   from her visual image.  In order to accomplish this, the audio needs
295	   to be mapped at the received site in the same fashion as the video.
296	   That is, audio received from the right side of the room needs to be
297	   output from loudspeaker(s) on the left side at the remote site, and
298	   vice versa.

300	4.2.  Point to point meeting: asymmetric

302	   In this case, each site has a different number of screens and cameras
303	   than the other site.  The important characteristic of this scenario
304	   is that the number of displays is different between the two sites.
305	   This creates challenges which are handled differently by different
306	   telepresence systems.

308	   This use case builds on the basic scenario of 3 screens to 3 screens.
309	   Here, we use the common case of 3 screens and 3 cameras at one site,
310	   and 1 screen and 1 camera at the other site, connected by a point to
311	   point call.  The display sizes and camera fields of view at both
312	   sites are basically similar, such that each camera view is designed
313	   to show two people sitting side by side.  Thus the 1 screen room has
314	   up to 2 people seated at the table, while the 3 screen room may have
315	   up to 6 people at the table.

317	   The basic considerations of defining left and right and indicating
318	   relative placement of the multiple audio and video streams are the
319	   same as in the 3-3 use case.  However, handling the mismatch between
320	   the two sites of the number of displays and cameras requires more
321	   complicated maneuvers.

323	   For the video sent from the 1 camera room to the 3 screen room,
324	   usually what is done is to simply use 1 of the 3 displays and keep
325	   the second and third displays inactive, or put up the date, for
326	   example.  This would maintain the "full size" image of the remote
327	   side.

329	   For the other direction, the 3 camera room sending video to the 1
330	   screen room, there are more complicated variations to consider.  Here
331	   are several possible ways in which the video streams can be handled.

333	   1.  The 1 screen system might simply show only 1 of the 3 camera
334	       images, since the receiving side has only 1 screen.  Two people
335	       are seen at full size, but 4 people are not seen at all.  The
336	       choice of which 1 of the 3 streams to display could be fixed, or
337	       could be selected by the users.  It could also be made
338	       automatically based on who is speaking in the 3 screen room, such
339	       that the people in the 1 screen room always see the person who is
340	       speaking.  If the automatic selection is done at the sender, the
341	       transmission of streams that are not displayed could be
342	       suppressed, which would avoid wasting bandwidth.

344	   2.  The 1 screen system might be capable of receiving and decoding
345	       all 3 streams from all 3 cameras.  The 1 screen system could then
346	       compose the 3 streams into 1 local image for display on the
347	       single screen.  All six people would be seen, but smaller than
348	       full size.  This could be done in conjunction with reducing the
349	       image resolution of the streams, such that encode/decode
350	       resources and bandwidth are not wasted on streams that will be
351	       downsized for display anyway.

353	   3.  The 3 screen system might be capable of including all 6 people in
354	       a single stream to send to the 1 screen system.  For example, it
355	       could use PTZ (Pan Tilt Zoom) cameras to physically adjust the
356	       cameras such that 1 camera captures the whole room of six people.
357	       Or it could recompose the 3 camera images into 1 encoded stream
358	       to send to the remote site.  These variations also show all six
359	       people, but at a reduced size.

361	   4.  Or, there could be a combination of these approaches, such as
362	       simultaneously showing the speaker in full size with a composite
363	       of all the 6 participants in smaller size.

365	   The receiving telepresence system needs to have information about the
366	   content of the streams it receives to make any of these decisions.
367	   If the systems are capable of supporting more than one strategy,
368	   there needs to be some negotiation between the two sites to figure
369	   out which of the possible variations they will use in a specific
370	   point to point call.

372	4.3.  Multipoint meeting

374	   In a multipoint telepresence conference, there are more than two
375	   sites participating.  Additional complexity is required to enable
376	   media streams from each participant to show up on the displays of the
377	   other participants.

379	   Clearly, there are a great number of topologies that can be used to
380	   display the streams from multiple sites participating in a
381	   conference.

383	   One major objective for telepresence is to be able to preserve the
384	   "Being there" user experience.  However, in multi-site conferences it
385	   is often (in fact usually) not possible to simultaneously provide
386	   full size video, eye contact, common perception of gestures and gaze
387	   by all participants.  Several policies can be used for stream
388	   distribution and display: all provide good results but they all make
389	   different compromises.

391	   One common policy is called site switching.  Let's say the speaker is
392	   at site A and everyone else is at a "remote" site.  When the room at
393	   site A shown, all the camera images from site A are forwarded to the
394	   remote sites.  Therefore at each receiving remote site, all the
395	   screens display camera images from site A. This can be used to
396	   preserve full size image display, and also provide full visual
397	   context of the displayed far end, site A. In site switching, there is
398	   a fixed relation between the cameras in each room and the displays in
399	   remote rooms.  The room or participants being shown is switched from
400	   time to time based on who is speaking or by manual control, e.g.,
401	   from site A to site B.

403	   Segment switching is another policy choice.  Still using site A as
404	   where the speaker is, and "remote" to refer to all the other sites,
405	   in segment switching, rather than sending all the images from site A,
406	   only the speaker at site A is shown.  The camera images of the
407	   current speaker and previous speakers (if any) are forwarded to the
408	   other sites in the conference.  Therefore the screens in each site
409	   are usually displaying images from different remote sites - the
410	   current speaker at site A and the previous ones.  This strategy can
411	   be used to preserve full size image display, and also capture the
412	   non-verbal communication between the speakers.  In segment switching,
413	   the display depends on the activity in the remote rooms - generally,
414	   but not necessarily based on audio / speech detection).

416	   A third possibility is to reduce the image size so that multiple
417	   camera views can be composited onto one or more screens.  This does
418	   not preserve full size image display, but provides the most visual
419	   context (since more sites or segments can be seen).  Typically in
420	   this case the display mapping is static, i.e., each part of each room
421	   is shown in the same location on the display screens throughout the
422	   conference.

424	   Other policies and combinations are also possible.  For example,
425	   there can be a static display of all screens from all remote rooms,
426	   with part or all of one screen being used to show the current speaker
427	   at full size.

429	4.4.  Presentation

431	   In addition to the video and audio streams showing the participants,
432	   additional streams are used for presentations.

434	   In systems available today, generally only one additional video
435	   stream is available for presentations.  Often this presentation
436	   stream is half-duplex in nature, with presenters taking turns.  The
437	   presentation video may be captured from a PC screen, or it may come
438	   from a multimedia source such as a document camera, camcorder or a
439	   DVD.  In a multipoint meeting, the presentation streams for the
440	   currently active presentation are always distributed to all sites in
441	   the meeting, so that the presentations are viewed by all.

443	   Some systems display the presentation video on a screen that is
444	   mounted either above or below the three participant screens.  Other
445	   systems provide monitors on the conference table for observing
446	   presentations.  If multiple presentation monitors are used, they
447	   generally display identical content.  There is considerable variation
448	   in the placement, number, and size or presentation displays.

450	   In some systems presentation audio is pre-mixed with the room audio.
451	   In others, a separate presentation audio stream is provided (if the
452	   presentation includes audio).

454	   In H.323 systems, H.239 is typically used to control the video
455	   presentation stream.  In SIP systems, similar control mechanisms can
456	   be provided using BFCP [RFC4582] for presentation token.  These
457	   mechanisms are suitable for managing a single presentation stream.

459	   Although today's systems remain limited to a single video
460	   presentation stream, there are obvious uses for multiple presentation
461	   streams.

463	   1.  Frequently the meeting convener is following a meeting agenda,
464	       and it is useful for her to be able to show that agenda to all
465	       participants during the meeting.  Other participants at various
466	       remote sites are able to make presentations during the meeting,
467	       with the presenters taking turns.  The presentations and the
468	       agenda are both shown, either on separate displays, or perhaps
469	       re-scaled and shown on a single display.

471	   2.  A single multimedia presentation can itself include multiple
472	       video streams that should be shown together.  For instance, a
473	       presenter may be discussing the fairness of media coverage.  In
474	       addition to slides which support the presenter's conclusions, she
475	       also has video excerpts from various news programs which she
476	       shows to illustrate her findings.  She uses a DVD player for the
477	       video excerpts so that she can pause and reposition the video as
478	       needed.  Another example is an educator who is presenting a
479	       multi-screen slide show.  This show requires that the placement
480	       of the images on the multiple displays at each site be
481	       consistent.

483	   There are many other examples where multiple presentation streams are
484	   useful.

486	4.5.  Heterogeneous Systems

488	   It is common in meeting scenarios for people to join the conference
489	   from a variety of environments, using different types of endpoint
490	   devices.  In a multi-screen immersive telepresence conference may
491	   include someone on a PC-based video conferencing system, a
492	   participant calling in by phone, and (soon) someone on a handheld
493	   device.

495	   What experience/view will each of these devices have?

497	   Some may be able to handle multiple streams and others can handle
498	   only a single stream.  (We are not here talking about legacy systems,
499	   but rather systems built to participate in such a conference,
500	   although they are single stream only.)  In a single video stream ,
501	   the stream may contain one or more compositions depending on the
502	   available screen space on the device.  In most cases a transcoding
503	   intermediate device will be relied upon to produce a single stream,
504	   perhaps with some kind of continuous presence.

506	   Bit rates will vary - the handheld and phone having lower bit rates
507	   than PC and multi-screen systems.

509	   Layout is accomplished according to different policies.  For example,
510	   a handheld and PC may receive the active speaker stream.  The
511	   decision can either be made explicitly by the receiver or by the
512	   sender if it can receive some kind of rendering hint.  The same is
513	   true for audio -- i. e., that it receives a mixed stream or a number
514	   of the loudest speakers if mixing is not available in the network.

516	   For the software conferencing participant, the user's experience
517	   depends on the application.  It could be single stream, similar to a
518	   handheld but with a bigger screen.  Or, it could be multiple streams,
519	   similar to an immersive but with a smaller screen.  Control for
520	   manipulation of streams can be local in the software application, or
521	   in another location and sent to the application over the network.

523	   The handheld device is the most extreme.  How will that participant
524	   be viewed and heard? it should be an equal participant, though the
525	   bandwidth will be significantly less than an immersive system.  A
526	   receiver may choose to display output coming from a handheld
527	   differently based on the resolution, but that would be the case with
528	   any low resolution video stream, e. g., from a powerful PC on a bad
529	   network.

531	   The handheld will send and receive a single video stream, which could
532	   be a composite or a subset of the conference.  The handheld could say
533	   what it wants or could accept whatever the sender (conference server
534	   or sending endpoint) thinks is best.  The handheld will have to
535	   signal any actions it wants to take the same way that immersive
536	   signals.

538	4.6.  Multipoint Education Usage

540	   The importance of this example is that the multiple video streams are
541	   not used to create an immersive conferencing experience with
542	   panoramic views at all the site.  Instead the multiple streams are
543	   dynamically used to enable full participation of remote students in a
544	   university class.  In some instances the same video stream is
545	   displayed on multiple displays in the room, in other instances an
546	   available stream is not displayed at all.

548	   The main site is a university auditorium which is equipped with three
549	   cameras.  One camera is focused on the professor at the podium.  A
550	   second camera is mounted on the wall behind the professor and
551	   captures the class in its entirety.  The third camera is co-located
552	   with the second, and is designed to capture a close up view of a
553	   questioner in the audience.  It automatically zooms in on that
554	   student using sound localization.

556	   Although the auditorium is equipped with three cameras, it is only
557	   equipped with two screens.  One is a large screen located at the
558	   front so that the class can see it.  The other is located at the rear
559	   so the professor can see it.  When someone asks a question, the front
560	   screen shows the questioner.  Otherwise it shows the professor
561	   (ensuring everyone can easily see her).

563	   The remote sites are typical immersive telepresence room with three
564	   camera/screen pairs.

566	   All remote sites display the professor on the center screen at full
567	   size.  A second screen shows the entire classroom view when the
568	   professor is speaking.  However, when a student asks a question, the
569	   second screen shows the close up view of the student at full size.
570	   Sometimes the student is in the auditorium; sometimes the speaking
571	   student is at another remote site.  The remote systems never display
572	   the students that are actually in that room.

574	   If someone at the remote site asks a question, then the screen in the
575	   auditorium will show the remote student at full size (as if they were
576	   present in the auditorium itself).  The display in the rear also
577	   shows this questioner, allowing the professor to see and respond to
578	   the student without needing to turn her back on the main class.

580	   When no one is asking a question, the screen in the rear briefly
581	   shows a full-room view of each remote site in turn, allowing the
582	   professor to monitor the entire class (remote and local students).
583	   The professor can also use a control on the podium to see a
584	   particular site - she can choose either a full-room view or a single
585	   camera view.

587	   Realization of this use case does not require any negotiation between
588	   the participating sites.  Endpoint devices (and an MCU if present) -
589	   need to know who is speaking and what video stream includes the view
590	   of that speaker.  The remote systems need some knowledge of which
591	   stream should be placed in the center.  The ability of the professor
592	   to see specific sites (or for the system to show all the sites in
593	   turn) would also require the auditorium system to know what sites are
594	   available, and to be able to request a particular view of any site.
595	   Bandwidth is optimized if video that is not being shown at a
596	   particular site is not distributed to that site.

598	5.  Acknowledgements

600	   The draft has benefitted from input from a number of people including
601	   Alex Eleftheriadis, Tommy Andre Nyquist, Mark Gorzynski, Charles
602	   Eckel, Nermeen Ismail, Mary Barnes, Pascal Buhler, Jim Cole.

604	6.  IANA Considerations

606	   This document contains no IANA considerations.

608	7.  Security Considerations

610	   While there are likely to be security considerations for any solution
611	   for telepresence interoperability, this document has no security
612	   considerations.

614	8.  Informative References

616	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
617	              Requirement Levels", BCP 14, RFC 2119, March 1997.

619	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
620	              A., Peterson, J., Sparks, R., Handley, M., and E.
621	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
622	              June 2002.

624	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
625	              Jacobson, "RTP: A Transport Protocol for Real-Time
626	              Applications", STD 64, RFC 3550, July 2003.

628	   [RFC4582]  Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
629	              Control Protocol (BFCP)", RFC 4582, November 2006.

631	Authors' Addresses

633	   Allyn Romanow
634	   Cisco
635	   San Jose, CA  95134
636	   US

638	   Email: allyn@cisco.com

640	   Stephen Botzko
641	   Polycom
642	   Andover, MA  01810
643	   US

645	   Email: stephen.botzko@polycom.com
646	   Mark Duckworth
647	   Polycom
648	   Andover, MA  01810
649	   US

651	   Email: mark.duckworth@polycom.com

653	   Roni Even
654	   Huawei Technologies
655	   Tel Aviv,
656	   Israel

658	   Email: even.roni@huawei.com

660	   Marshall Eubanks
661	   Iformata Communications
662	   Dayton, Ohio  45402
663	   US

665	   Email: marshall.eubanks@ilformata.com