idnits 2.17.1 

draft-romanow-clue-telepresence-prob-statement-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (January 12, 2011) is 4843 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE WG                                                       A. Romanow
3	Internet-Draft                                                     Cisco
4	Intended status: Informational                                 S. Botzko
5	Expires: July 16, 2011                                           Polycom
6	                                                        January 12, 2011

8	            Problem Statement for Telepresence Multi-streams
9	         draft-romanow-clue-telepresence-prob-statement-00.txt

11	Abstract

13	   Telepresence systems create a "being there" conferencing experience.
14	   A number of issues need to be solved largely by manipulating multiple
15	   audio and video streams.  Different systems take different
16	   approaches, employ different techniques, and convey information by
17	   using different vocabularies, making interoperability extremely
18	   challenging.  This problem statement describes the typical issues
19	   that must be solved and uses examples to illustrate the kind of
20	   diversity that makes interworking problematic.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on July 16, 2011.

39	Copyright Notice

41	   Copyright (c) 2011 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	   3.  Fundamental Issues for Telepresence  . . . . . . . . . . . . .  4
59	   4.  Manipulating Media Streams . . . . . . . . . . . . . . . . . .  5
60	   5.  Examples of Interworking Issues  . . . . . . . . . . . . . . .  6
61	     5.1.  Designating Roles and Positions for transmitted streams  .  6
62	     5.2.  Multipoint . . . . . . . . . . . . . . . . . . . . . . . .  7
63	     5.3.  Capability Negotiation . . . . . . . . . . . . . . . . . .  9
64	     5.4.  Differences in Media Characteristics . . . . . . . . . . .  9
65	       5.4.1.  Aspect Ratio . . . . . . . . . . . . . . . . . . . . .  9
66	       5.4.2.  Visual Scale . . . . . . . . . . . . . . . . . . . . . 11
67	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
68	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
69	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
70	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 13
71	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13

73	1.  Introduction

75	   In a Telepresence conference, the idea is to create a feeling of
76	   presence - that you are in the same room with the remote parties.  In
77	   order to create the "being there" or telepresence experience, a
78	   number of technical issues need to be solved.  These issues are
79	   addressed by manipulating multiple media streams, video and audio -
80	   by describing them, controlling them, and signaling about them.  The
81	   fundamental features of telepresence require handling multiple
82	   streams of media, and considering additional characteristics of those
83	   streams beyond those normally specified in existing videoconferencing
84	   standards.

86	   Different telepresence systems approach solving the basic issues
87	   differently.  They use disparate techniques, and they describe,
88	   control and signal media in dissimilar fashions.  Such diversity
89	   creates an interoperability problem.  The same issues are solved in
90	   different ways by different systems, so that they are not directly
91	   interoperable.  This makes interworking difficult at best and
92	   sometimes impossible.

94	   Some degree of interworking is possible through transcoding and
95	   translation.  This requires additional devices, which are expensive
96	   and not entirely automatic.  Specialized knowledge is required to
97	   operate a telepresence conference where the endpoints use different
98	   equipment and a transcoding and translating device is employed for
99	   interoperability.  Often such conferences are interrupted by
100	   difficulties that arise.

102	   The general problem that needs to be solved is this.  The
103	   transmitting side sends audio and video streams based upon a model
104	   for rendering a realistic depiction from this information.  If the
105	   receiving side belongs to the same vendor, it works with the same
106	   model and renders the information according to that shared model.
107	   However, if the receiver and the sender are from different vendors,
108	   the models they each have for rendering presence differ.

110	   It is as if Alice and Bob are at different sites.  Alice needs to
111	   tell Bob information about what her camera and sound equipment see at
112	   her site so that Bob's receiver can create a display that will
113	   capture the important characteristics of her site.  Alice and Bob
114	   need to agree on what the salient characteristics are as well as how
115	   to represent and communicate them.  The telepresence multi-steam work
116	   seeks to describe the sender situation in a way that allows the
117	   receiver to render it realistically though it may have a different
118	   rendering model than the sender.

120	   This problem statement identifies the fundamental issues that need to
121	   be addressed to provide telepresence in typical use case scenarios.
122	   We show how different approaches to solving the problems and
123	   different techniques for handling multiple media create a challenge
124	   for interoperability.

126	   This document describes some of the problems that arise, it is not an
127	   complete list, but rather it is more illustrative than exhaustive.
128	   Requirements, use cases and solutions are discussed in other
129	   documents.

131	2.  Terminology

133	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
134	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
135	   document are to be interpreted as described in RFC 2119 [RFC2119].

137	3.  Fundamental Issues for Telepresence

139	   The fundamental issues that must be handled to produce a typical
140	   telepresence conference, either point to point or multipoint include:

142	   1.  Participant display

144	       A.  Placement of video

146	       B.  Size

148	       C.  Angle

150	       D.  Overlap

152	       E.  Display technology

154	   2.  Audio

156	       A.  Placement, emanating from right place

158	       B.  Type of audio

160	   3.  Different number of screens on sender and receiver sides

162	   4.  Participant display for multipoint

164	       A.  Placement of video
165	       B.  Continuous presence

167	       C.  Control of display, how does it change? - automatic, user

169	   5.  Maintaining eye contact and gaze connection

171	   6.  Panoramic view for site switching

173	   7.  Mismatches between media characteristics between sender and
174	       receiver, such as:

176	       A.  aspect ratio

178	       B.  format

180	       C.  frame rate

182	       D.  resolution

184	   8.  Presentation

186	       A.  What methodology?

188	   9.  Security

190	       A.  SRTP?

192	       B.  Key methodology

194	4.  Manipulating Media Streams

196	   In addressing the fundamental issues, multiple media streams are
197	   handled in the following ways:

199	   1.   Sender and receiver understand each others capabilities

201	        A.  Number of video, audio and presentation streams that can be
202	            sent/received simultaneously

204	        B.  What media signaling protocol being used (SDP, proprietary,
205	            etc.)

207	   2.   Streaming control

209	   3.   Feedback mechanisms
210	   4.   Signaling about RTP payload

212	   5.   Media control signaling

214	        A.  Video refresh

216	        B.  Flow control

218	   6.   Signaling media formats and media capabilities

220	   7.   Signaling content type

222	   8.   Signaling device type

224	   9.   Signaling network characteristics per stream

226	   10.  Floor control signaling

228	5.  Examples of Interworking Issues

230	   This section describes several examples that illustrate the kinds of
231	   incompatibilities that arise when different systems take different
232	   approaches to an issue.

234	5.1.  Designating Roles and Positions for transmitted streams

236	   Senders and receivers need to have the same vocabulary and
237	   understanding of stream roles and positions in order to place them
238	   appropriately.  For example one system may define roles as: center,
239	   left, right, legacy center, legacy right, legacy left, auxiliary 1/5
240	   fps and auxiliary 30 fps positions.  These roles as defined are a
241	   combination of "input devices" + "codec type/format" for transmission
242	   positions, and a combination of "stream decoders/output devices" +
243	   "codec type/format" for receive positions.  Another system will not
244	   have the exact same vocabulary and meaning, though it still has to
245	   accomplish the same placement task.

247	   How the cameras and encoders are wired determines how the local scene
248	   is displayed on the remote screen.  In many systems right and left
249	   need to be exchanged to be seen properly, but this depends on the way
250	   the equipment is wired.

252	   In describing how to display the local scene, the language can be
253	   misleading if there is no agreed upon reference for right and left.
254	   [for example, more]

256	   Although often the video is displayed on separate monitors, it is
257	   also possible to use projectors to create a video wall.  In this
258	   case, there may be an overlap region between cameras which allows for
259	   projector blending.  Also, although cameras are generally arranged to
260	   create a seamless panoramic view of the participants, it is also
261	   possible for there to be gaps between cameras (and corresponding gaps
262	   between displays).

264	   There is also no reference for image size.  Some rooms use
265	   proportionally larger displays, and set the camera field of view to
266	   show participants either standing or sitting at life size.  Others
267	   use smaller displays, and set the field of view for sitting
268	   participants (cropping off heads when people stand).  In order to
269	   preserve full size display when these systems interoperate, both
270	   systems must rescale their video.

272	5.2.  Multipoint

274	   Multipoint conferences, where there are more than two endpoints,
275	   create a wealth of technical issues to be solved.  The primary one is
276	   which participants to display on each screen at each site.  If the
277	   number of sites is greater than can be shown on the number of
278	   displays at a site, this adds to the complexity.  There are, of
279	   course, almost unlimited ways this can be handled.  We discuss the
280	   common approaches and how they differ.

282	   The local screens can show all the camera image from a particular
283	   remote site (site switching); or each local screen can show a
284	   participant or two from each of the remote sites (segment switching);
285	   or local displays can show a composite of remote camera shots
286	   (continuous presence).

288	   The choice of who to display on a screen can be determined
289	   statically, by users, or automatically according to some policy, such
290	   as voice activity level.

292	   [Add user-controlled personal telepresence scenario.]

294	   Policies are created and implemented in many ways.  They tend to be
295	   based on some combination of what H.323 defines as centralized and
296	   decentralized.  One of the challenges is that the endpoints in the
297	   conference may have different number of cameras and displays from
298	   each other so a common mode on the number of streams and their
299	   priority is required.  Also, the various endpoints might have
300	   different bandwidth constraints and support different codec profiles.

302	   A centralized multipoint conference is one in which all participating
303	   endpoints communicate in a point-to-point fashion with an MCU.  The
304	   endpoints transmit their control, audio, video, and/or data streams
305	   to the MCU.  The MCUA centrally manages the conference, processes the
306	   audio, video and/or data streams, and returns the processed streams
307	   to each endpoint.  In this mode, the MCU will mix the audio streams;
308	   and if using centralized video, will either use voice activated video
309	   switch, where everyone will see the active speaker and the speaker
310	   will see the previous speaker, or will use continuous presence mode,
311	   where the MCU will create a video stream with sub windows for each of
312	   the participants.  MCUs can support multiple video layouts and they
313	   can be created automatically based on the number of participants or
314	   by a conference management application.

316	   There are three methods commonly used for video stream distribution
317	   in centralized multipoint conferences.  The three conference policies
318	   above can be implemented using any of these technologies.

320	   Simple video switching (forwarding) has the advantage of low latency
321	   and low complexity.  It can be used if all systems are capable of
322	   receiving the encodings used by the sending endpoints (including both
323	   the video codec and the image resolution/aspect ratio).  In some
324	   situations it can be wasteful of bandwidth.

326	   Full video transcoding usually has higher latency than switching It
327	   does not require system to be capable of receiving identical
328	   encodings, and different sites can connect with different bandwidths.

330	   Layered video encoding combines some of the benefits of video
331	   switching and video transcoding.  It is more complex than video
332	   switching, but less complex than video transcoding.  Bandwidth and
333	   resolution can be reduced for each site.  Since this is done by
334	   filtering out layers of the original encoding, the available
335	   bandwidths and resolutions are not as fine-grained as full video
336	   transcoding.

338	   In decentralized mode or full mesh mode each endpoint creates its
339	   display mode.  This requires each endpoint to receive multiple
340	   streams and send its video and audio to all participants, using
341	   multicast of unicast.

343	   In practice, multicast is not now being used in commercial systems,
344	   so the size of a strictly decentralized multipoint conference is
345	   limited.

347	   There are analogous issues for audio.  Like video, the audio is
348	   rotated, so there is no clarity on the meaning of left and right.
349	   Since the number of streams, microphones, and speakers are not
350	   matched, the systems need to re-process the received audio in order
351	   to create the correct sound field for their respective rooms.

353	   There are two ways in which the audio might be handled in this use
354	   case:

356	   o  A single stereo audio stream is sent to the remote site, just as
357	      in standard videoconferencing.

359	   o  Three monaural audio streams are sent to the remote site, with
360	      proprietary signaling to associate each audio stream with a video
361	      stream.

363	   Microphones and speakers positions vary; and there is no agreed upon
364	   way to describe their placement.  There is no agreed upon reference
365	   for audio level.  In addition, audio may be sent as an independent
366	   stream from each microphone or as a multi-channel channel stream.

368	5.3.  Capability Negotiation

370	   Call setup for the telepresence conference will start with a single
371	   call establishing one video media stream.  After the connection is
372	   established, a proprietary capability negotiation takes place that
373	   will enable both sides to identify that they are telepresence
374	   applications and capable of having two more video sessions and
375	   provide the connectivity information.  The result is that two or more
376	   video sessions are established.  The system may use two new SIP call
377	   legs or just add the two new video streams to the existing dialog.

379	   [more to be added]

381	5.4.  Differences in Media Characteristics

383	   Media characteristics such as video format, aspect ratio, and visual
384	   scale can be handled differently at different sites creating
385	   incompatibility.  To interwork, an adaptive strategy is necessary.
386	   Although differences in media characteristic must also be handled in
387	   a typical video conference, the problem is made more complex in
388	   Telepresence due to the multiple screens, cameras and streams.

390	   Two examples - aspect ratio and visual scale are described here.

392	5.4.1.  Aspect Ratio

394	   If the aspect ratios in different sites are not the same, some
395	   technique needs to be applied to adjust for the difference.  Although
396	   the same situation arises in normal video conferencing, multiple
397	   streams in telepresence conferencing causes more difficulties.

399	   For simplicity let us assume a point to point case - two conference
400	   room on a point to point call.  Both rooms have 3 screens and 3
401	   cameras, as in 4.1 above.  Both rooms have identical visual scale -
402	   the display width and distance between the participants and the
403	   displays are identical in both rooms.  However the equipment -
404	   cameras and displays - in each room has a different aspect ratio,
405	   16:9 in one room and 4:3 in the other.

407	   Although 4:3 is usually associated with standard definition TV and
408	   16:9 with HDTV, telepresence systems may choose the aspect ratio to
409	   obtain a particular field of view.  Projecting images in the 16:9
410	   aspect ratio offers a wider presentation angle that shows fine
411	   details well (the pixel density is greater than a 4:3 system of the
412	   same resolution and scale).  In the room with 16:9 media
413	   characteristic, people are shown at full size when they are seated.
414	   However, when they stand up the height of the display results in
415	   their image being cropped so that their heads are not shown.  The
416	   other room uses projectors to display HD images with 4:3 aspect
417	   ratios.  This results in an increased image height - the vertical
418	   field of view is 33% greater than the 16:9 system.  The increased
419	   height allows most of the population to be shown full size whether
420	   they are standing or sitting.

422	   Some strategy is necessary to deal with the case of the two sites
423	   having a point to point call.  In order to convert formats of unequal
424	   ratios a variety of techniques can be used, such as: zooming
425	   (enlarging) and cropping (removing), letterboxing (adding horizontal
426	   bars), pillarboxing (adding vertical bars) to retain the original
427	   format's aspect ratio, or scaling (which distorts) in a variety of
428	   ways.

430	   For the video sent from the 4:3 room to the 16:9 room, several
431	   techniques can be used:

433	   1.  The 16:9 system might simply crop the top 1/4 of each 4:3 image.
434	       This will result in full size display, eye contact, and gaze
435	       awareness for the individuals who are seated.  However, the
436	       standing presenter's head will be cropped.

438	   2.  The 16:9 system might stretch each to the 4:3 images to fully fit
439	       the 16:9 display.  This would reduce image height (creating
440	       geometric distortion) and create eye-contact error.  Continuity
441	       of the panoramic image would be preserved.

443	   3.  The 16:9 system could pillarbox each of the 4:3 images, placing
444	       horizontal borders on the three displays.  This results in
445	       reducing the image size to less than full size.  It also destroys
446	       the continuity of the panoramic image, and introduces additional
447	       error in eye contact and gaze awareness.

449	   4.  The 16:9 system could pillarbox only the center display.  This
450	       reduces the size of the presenter who is the focus of the
451	       meeting.

453	   5.  The 16:9 system could also crop the bottom of the center display.
454	       Visually this reduces the height of the presenter, but maintains
455	       full size.  There is a vertical discontinuity in the panoramic
456	       image.  Whether this is objectionable or not depends on the room
457	       layout.

459	   Strategies 4 and 5 could be accomplished in response to a user
460	   command or automatically.  The details will be discussed in more
461	   detail in future documents.

463	   For the video sent from the 16:9 room to the 4:3 room, the receiving
464	   system simply letterboxes the video displays.  Since the scales are
465	   identical, this full size image displays in the 4:3 room.

467	   For the video sent from the 16:9 room to the 4:3 room, the common
468	   techniques are:

470	   1.  The 4:3 system places the border above the image.  This maintains
471	       eye contact for those who are seated, but cannot maintain eye
472	       contact for the presenter.

474	   2.  The 4:3 system places the border below the images.  If the 16:9
475	       system crops the bottom of the center display then this will
476	       maintain eye contact for the presenter and the remote site.

478	   3.  The 4:3 system centers the images.  Eye contact suffers for
479	       everyone, but the worst case eye contact error is better
480	       controlled.

482	   In this use case, negotiation between the systems is not strictly
483	   necessary, no matter which scheme is used.  However, the best user
484	   experience is obtained if both systems have knowledge about apect
485	   ratios being used and which participants are standing and which are
486	   sitting so they can adjust optimally.

488	5.4.2.  Visual Scale

490	   The visual scale of displays may differ between sites.  Again, let us
491	   use the point to point case as a simple example.  Assume two
492	   conference rooms in a point to point call.  One room is designed for
493	   6 participants, and has three 16:9 screens and 3 cameras.  This room
494	   is designed to show participants at their normal size when seated (2
495	   participants per camera/display).  It does not have adequate display
496	   height to capture those who are standing.  The second room is also
497	   designed for 6 participants, but shows 3 participants per camera/
498	   display also at their full size.  Therefore, it only needs two 16:9
499	   cameras/display pairs.  Since the field of view in both the vertical
500	   and horizontal is increased by 50%, it also shows those who are
501	   standing without cropping.

503	   For the video sent from the 2 screen (larger scale) room to the 3
504	   screen (smaller scale) room, two approaches can be used:

506	   1.  The 3 screen system might simply show the participants on two of
507	       its displays.  Participants will be shown at 67% of their full
508	       size.  Eye contact and gaze awareness will be lost.

510	   2.  The 3 screen system might construct and display a vertically
511	       cropped 3-screen view, showing 2 participants on each screen.
512	       Participants will be shown at full size, with preservation of eye
513	       contact and gaze awareness.

515	   For the video sent from the 3 screen to the 2 screen room, there are
516	   two analogous approaches:

518	   1.  The 2 screen system selects 2 streams and simply shows them on
519	       its displays.  Participants will be shown at 150% of their normal
520	       size.  Eye contact and gaze awareness will be lost, and some of
521	       the remote site is lost.

523	   2.  The 2 screen system might construct and display a 2 screen view
524	       (with a vertical border on the top) which shows 3 participants on
525	       each screen.  Participants will be shown at full size, with
526	       preservation of eye contact and gaze awareness.

528	   Although there is no need for negotiation between the systems, the
529	   best user experience is obtained if both systems have knowledge of
530	   the visual scale, and where individuals are seated, and can then
531	   choose the best manner of display.

533	6.  IANA Considerations

535	   This document contains no IANA considerations.

537	7.  Security Considerations

539	   While there are likely to be security considerations for any solution
540	   for telepresence interoperability, this document has no security
541	   considerations.

543	8.  Acknowledgements

545	   The draft has benefitted from input from a number of people including
546	   Roni Even, Jim Cole, Nermeen Ismail, Nathan Buckles.

548	9.  Informative References

550	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
551	              Requirement Levels", BCP 14, RFC 2119, March 1997.

553	Authors' Addresses

555	   Allyn Romanow
556	   Cisco
557	   San Jose, CA  95134
558	   US

560	   Email: allyn@cisco.com

562	   Stephen Botzko
563	   Polycom
564	   Andover, MA  01810
565	   US

567	   Email: stephen.botzko@polycom.com