idnits 2.17.1 

draft-romanow-dispatch-telepresence-prob-statement-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 12, 2010) is 5036 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	DISPATCH WG                                                   A. Romanow
3	Internet-Draft                                                     Cisco
4	Intended status: Informational                                 S. Botzko
5	Expires: January 13, 2011                                        Polycom
6	                                                           July 12, 2010

8	            Problem Statement for Telepresence Multi-streams
9	       draft-romanow-dispatch-telepresence-prob-statement-01.txt

11	Abstract

13	   Telepresence systems create a "being there" conferencing experience.
14	   A number of issues need to be solved largely by manipulating multiple
15	   audio and video streams.  Different systems take different
16	   approaches, employ different techniques, and convey information by
17	   using different vocabularies, making interoperability extremely
18	   challenging.  This problem statement describes the typical issues
19	   that must be solved and uses examples to illustrate the kind of
20	   diversity that makes interworking problematic.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on January 13, 2011.

39	Copyright Notice

41	   Copyright (c) 2010 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	   3.  Fundamental Issues for Telepresence  . . . . . . . . . . . . .  4
59	   4.  Manipulating Media Streams . . . . . . . . . . . . . . . . . .  5
60	   5.  Examples of Interworking Issues  . . . . . . . . . . . . . . .  6
61	     5.1.  Designating Roles and Positions for transmitted streams  .  6
62	     5.2.  Multipoint . . . . . . . . . . . . . . . . . . . . . . . .  7
63	     5.3.  Capability Negotiation . . . . . . . . . . . . . . . . . .  9
64	     5.4.  Differences in Media Characteristics . . . . . . . . . . .  9
65	       5.4.1.  Aspect Ratio . . . . . . . . . . . . . . . . . . . . .  9
66	       5.4.2.  Visual Scale . . . . . . . . . . . . . . . . . . . . . 11
67	   6.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 12
68	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
69	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
70	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 13
71	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13

73	1.  Introduction

75	   In a Telepresence conference, the idea is to create a feeling of
76	   presence - that you are in the same room with the remote parties.  In
77	   order to create the "being there" or telepresence experience, a
78	   number of technical issues need to be solved.  These issues are
79	   addressed by manipulating multiple media streams, video and audio -
80	   by describing them, controlling them, and signaling about them.  The
81	   fundamental features of telepresence require handling multiple
82	   streams of media, and considering additional characteristics of those
83	   streams beyond those normally specified in existing videoconferencing
84	   standards.

86	   Different telepresence systems approach solving the basic issues
87	   differently.  They use disparate techniques, and they describe,
88	   control and signal media in dissimilar fashions.  Such diversity
89	   creates an interoperability problem.  The same issues are solved in
90	   different ways by different systems, so that they are not directly
91	   interoperable.  This makes interworking difficult at best and
92	   sometimes impossible.

94	   Some degree of interworking is possible through transcoding and
95	   translation.  This requires additional devices, which are expensive
96	   and not entirely automatic.  Specialized knowledge is required to
97	   operate a telepresence conference where the endpoints use different
98	   equipment and a transcoding and translating device is employed for
99	   interoperability.  Often such conferences are interrupted by
100	   difficulties that arise.

102	   The general problem that needs to be solved is this.  The
103	   transmitting side sends audio and video streams based upon a model
104	   for rendering a realistic depiction from this information.  If the
105	   receiving side belongs to the same vendor, it works with the same
106	   model and renders the information according to that shared model.
107	   However, if the receiver and the sender are from different vendors,
108	   the models they each have for rendering presence differ.

110	   It is as if Alice and Bob are at different sites.  Alice needs to
111	   tell Bob information about what her camera and sound equipment see at
112	   her site so that Bob's receiver can create a display that will
113	   capture the important characteristics of her site.  Alice and Bob
114	   need to agree on what the salient characteristics are as well as how
115	   to represent and communicate them.  The telepresence multi-steam work
116	   seeks to describe the sender situation in a way that allows the
117	   receiver to render it realistically though it may have a different
118	   rendering model than the sender.

120	   This problem statement identifies the fundamental issues that need to
121	   be addressed to provide telepresence in typical use case scenarios.
122	   We show how different approaches to solving the problems and
123	   different techniques for handling multiple media create a challenge
124	   for interoperability.

126	   This document describes some of the problems that arise, it is not an
127	   complete list, but rather it is more illustrative than exhaustive.
128	   Requirements, use cases and solutions are discussed in other
129	   documents.

131	2.  Terminology

133	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
134	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
135	   document are to be interpreted as described in RFC 2119 [RFC2119].

137	3.  Fundamental Issues for Telepresence

139	   The fundamental issues that must be handled to produce a typical
140	   telepresence conference, either point to point or multipoint include:

142	   1.  Participant display

144	       A.  Placement of video

146	       B.  Size

148	       C.  Angle

150	       D.  Overlap

152	       E.  Display technology

154	   2.  Audio

156	       A.  Placement, emanating from right place

158	       B.  Type of audio

160	   3.  Different number of screens on sender and receiver sides

162	   4.  Participant display for multipoint

164	       A.  Placement of video
165	       B.  Continuous presence

167	       C.  Control of display, how does it change? - automatic, user

169	   5.  Maintaining eye contact and gaze connection

171	   6.  Panoramic view for site switching

173	   7.  Mismatches between media characteristics between sender and
174	       receiver, such as:

176	       A.  aspect ratio

178	       B.  format

180	       C.  frame rate

182	       D.  resolution

184	   8.  Presentation

186	       A.  What methodology?

188	   9.  Security

190	       A.  SRTP?

192	       B.  Key methodology

194	4.  Manipulating Media Streams

196	   In addressing the fundamental issues, multiple media streams are
197	   handled in the following ways:

199	   1.   Sender and receiver understand each others capabilities

201	        A.  Number of video, audio and presentation streams that can be
202	            sent/received simultaneously

204	        B.  What media signaling protocol being used (SDP, proprietary,
205	            etc.)

207	   2.   Streaming control

209	   3.   Feedback mechanisms
210	   4.   Signaling about RTP payload

212	   5.   Media control signaling

214	        A.  Video refresh

216	        B.  Flow control

218	   6.   Signaling media formats and media capabilities

220	   7.   Signaling content type

222	   8.   Signaling device type

224	   9.   Signaling network characteristics per stream

226	   10.  Floor control signaling

228	5.  Examples of Interworking Issues

230	   This section describes several examples that illustrate the kinds of
231	   incompatibilities that arise when different systems take different
232	   approaches to an issue.

234	5.1.  Designating Roles and Positions for transmitted streams

236	   Senders and receivers need to have the same vocabulary and
237	   understanding of stream roles and positions in order to place them
238	   appropriately.  For example one system may define roles as: center,
239	   left, right, legacy center, legacy right, legacy left, auxiliary 1/5
240	   fps and auxiliary 30 fps positions.  These roles as defined are a
241	   combination of "input devices" + "codec type/format" for transmission
242	   positions, and a combination of "stream decoders/output devices" +
243	   "codec type/format" for receive positions.  Another system will not
244	   have the exact same vocabulary and meaning, though it still has to
245	   accomplish the same placement task.

247	   How the cameras and encoders are wired determines how the local scene
248	   is displayed on the remote screen.  In many systems right and left
249	   need to be exchanged to be seen properly, but this depends on the way
250	   the equipment is wired.

252	   In describing how to display the local scene, the language can be
253	   misleading if there is no agreed upon reference for right and left.
254	   [for example, more]

256	   Although often the video is displayed on separate monitors, it is
257	   also possible to use projectors to create a video wall.  In this
258	   case, there may be an overlap region between cameras which allows for
259	   projector blending.  Also, although cameras are generally arranged to
260	   create a seamless panoramic view of the participants, it is also
261	   possible for there to be gaps between cameras (and corresponding gaps
262	   between displays).

264	   There is also no reference for image size.  Some rooms use
265	   proportionally larger displays, and set the camera field of view to
266	   show participants either standing or sitting at life size.  Others
267	   use smaller displays, and set the field of view for sitting
268	   participants (cropping off heads when people stand).  In order to
269	   preserve full size display when these systems interoperate, both
270	   systems must rescale their video.

272	5.2.  Multipoint

274	   Multipoint conferences, where there are more than two endpoints,
275	   create a wealth of technical issues to be solved.  The primary one is
276	   which participants to display on each screen at each site.  If the
277	   number of sites is greater than can be shown on the number of
278	   displays at a site, this adds to the complexity.  There are, of
279	   course, almost unlimited ways this can be handled.  We discuss the
280	   common approaches and how they differ.

282	   The local screens can show all the camera image from the a particular
283	   remote site (site switching); or each local screen can show a
284	   participant or two from each of the remote sites (segment switching);
285	   or local displays can show a composite of remote camera shots
286	   (continuous presence).  The choice of who to display on a screen can
287	   be determined by users, or, more often, automated according to voice
288	   activity level.

290	   [Add user-controlled personal telepresence scenario.]

292	   Policies are created and implemented in many ways.  They tend to be
293	   based on some combination of what H.323 defines as centralized and
294	   decentralized.  One of the challenges is that the endpoints in the
295	   conference may have different number of cameras and displays from
296	   each other so a common mode on the number of streams and their
297	   priority is required.  Also, the various endpoints might have
298	   different bandwidth constraints and support different codec profiles.

300	   A centralized multipoint conference is one in which all participating
301	   endpoints communicate in a point-to-point fashion with an MCU.  The
302	   endpoints transmit their control, audio, video, and/or data streams
303	   to the MCU.  The MCUA centrally manages the conference, processes the
304	   audio, video and/or data streams, and returns the processed streams
305	   to each endpoint.  In this mode, the MCU will mix the audio streams;
306	   and if using centralized video, will either use voice activated video
307	   switch, where everyone will see the active speaker and the speaker
308	   will see the previous speaker, or will use continuous presence mode,
309	   where the MCU will create a video stream with sub windows for each of
310	   the participants.  MCUs can support multiple video layouts and they
311	   can be created automatically based on the number of participants or
312	   by a conference management application.

314	   There are three methods commonly used for video stream distribution
315	   in centralized multipoint conferences.  The three conference policies
316	   above can be implemented using any of these technologies.

318	   Simple video switching (forwarding) has the advantage of low latency
319	   and low complexity.  It can be used if all systems are capable of
320	   receiving the encodings used by the sending endpoints (including both
321	   the video codec and the image resolution/aspect ratio).  In some
322	   situations it can be wasteful of bandwidth.

324	   Full video transcoding usually has higher latency than switching It
325	   does not require system to be capable of receiving identical
326	   encodings, and different sites can connect with different bandwidths.

328	   Layered video encoding combines some of the benefits of video
329	   switching and video transcoding.  It is more complex than video
330	   switching, but less complex than video transcoding.  Bandwidth and
331	   resolution can be reduced for each site.  Since this is done by
332	   filtering out layers of the original encoding, the available
333	   bandwidths and resolutions are not as fine-grained as full video
334	   transcoding.

336	   In decentralized mode or full mesh mode each endpoint creates its
337	   display mode.  This requires each endpoint to receive multiple
338	   streams and send its video and audio to all participants, using
339	   multicast of unicast.

341	   In practice, multicast is not now being used in commercial systems,
342	   so the size of a strictly decentralized multipoint conference is
343	   limited.

345	   There are analogous issues for audio.  Like video, the audio is
346	   rotated, so there is no clarity on the meaning of left and right.
347	   Since the number of streams, microphones, and speakers are not
348	   matched, the systems need to re-process the received audio in order
349	   to create the correct sound field for their respective rooms.

351	   There are two ways in which the audio might be handled in this use
352	   case:

354	   o  A single stereo audio stream is sent to the remote site, just as
355	      in standard videoconferencing.

357	   o  Three monaural audio streams are sent to the remote site, with
358	      proprietary signaling to associate each audio stream with a video
359	      stream.

361	   Microphones and speakers positions vary; and there is no agreed upon
362	   way to describe their placement.  There is no agreed upon reference
363	   for audio level.  In addition, audio may be sent as an independent
364	   stream from each microphone or as a multi-channel channel stream.

366	5.3.  Capability Negotiation

368	   Call setup for the telepresence conference will start with a single
369	   call establishing one video media stream.  After the connection is
370	   established, a proprietary capability negotiation takes place that
371	   will enable both sides to identify that they are telepresence
372	   applications and capable of having two more video sessions and
373	   provide the connectivity information.  The result is that two or more
374	   video sessions are established.  The system may use two new SIP call
375	   legs or just add the two new video streams to the existing dialog.

377	   [more to be added]

379	5.4.  Differences in Media Characteristics

381	   Media characteristics such as video format, aspect ratio, and visual
382	   scale can be handled differently at different sites creating
383	   incompatibility.  To interwork, an adaptive strategy is necessary.
384	   Although differences in media characteristic must also be handled in
385	   a typical video conference, the problem is made more complex in
386	   Telepresence due to the multiple screens, cameras and streams.

388	   Two examples - aspect ratio and visual scale are described here.

390	5.4.1.  Aspect Ratio

392	   If the aspect ratios in different sites are not the same, some
393	   technique needs to be applied to adjust for the difference.  Although
394	   the same situation arises in normal video conferencing, multiple
395	   streams in telepresence conferencing causes more difficulties.

397	   For simplicity let us assume a point to point case - two conference
398	   room on a point to point call.  Both rooms have 3 screens and 3
399	   cameras, as in 4.1 above.  Both rooms have identical visual scale -
400	   the display width and distance between the participants and the
401	   displays are identical in both rooms.  However the equipment -
402	   cameras and displays - in each room has a different aspect ratio,
403	   16:9 in one room and 4:3 in the other.

405	   Although 4:3 is usually associated with standard definition TV and
406	   16:9 with HDTV, telepresence systems may choose the aspect ratio to
407	   obtain a particular field of view.  Projecting images in the 16:9
408	   aspect ratio offers a wider presentation angle that shows fine
409	   details well (the pixel density is greater than a 4:3 system of the
410	   same resolution and scale).  In the room with 16:9 media
411	   characteristic, people are shown at full size when they are seated.
412	   However, when they stand up the height of the display results in
413	   their image being cropped so that their heads are not shown.  The
414	   other room uses projectors to display HD images with 4:3 aspect
415	   ratios.  This results in an increased image height - the vertical
416	   field of view is 33% greater than the 16:9 system.  The increased
417	   height allows most of the population to be shown full size whether
418	   they are standing or sitting.

420	   Some strategy is necessary to deal with the case of the two sites
421	   having a point to point call.  In order to convert formats of unequal
422	   ratios a variety of techniques can be used, such as: zooming
423	   (enlarging) and cropping (removing), letterboxing (adding horizontal
424	   bars), pillarboxing (adding vertical bars) to retain the original
425	   format's aspect ratio, or scaling (which distorts) in a variety of
426	   ways.

428	   For the video sent from the 4:3 room to the 16:9 room, several
429	   techniques can be used:

431	   1.  The 16:9 system might simply crop the top 1/4 of each 4:3 image.
432	       This will result in full size display, eye contact, and gaze
433	       awareness for the individuals who are seated.  However, the
434	       standing presenter's head will be cropped.

436	   2.  The 16:9 system might stretch each to the 4:3 images to fully fit
437	       the 16:9 display.  This would reduce image height (creating
438	       geometric distortion) and create eye-contact error.  Continuity
439	       of the panoramic image would be preserved.

441	   3.  The 16:9 system could pillarbox each of the 4:3 images, placing
442	       horizontal borders on the three displays.  This results in
443	       reducing the image size to less than full size.  It also destroys
444	       the continuity of the panoramic image, and introduces additional
445	       error in eye contact and gaze awareness.

447	   4.  The 16:9 system could pillarbox only the center display.  This
448	       reduces the size of the presenter who is the focus of the
449	       meeting.

451	   5.  The 16:9 system could also crop the bottom of the center display.
452	       Visually this reduces the height of the presenter, but maintains
453	       full size.  There is a vertical discontinuity in the panoramic
454	       image.  Whether this is objectionable or not depends on the room
455	       layout.

457	   Strategies 4 and 5 could be accomplished in response to a user
458	   command or automatically.  The details will be discussed in more
459	   detail in future documents.

461	   For the video sent from the 16:9 room to the 4:3 room, the receiving
462	   system simply letterboxes the video displays.  Since the scales are
463	   identical, this full size image displays in the 4:3 room.

465	   For the video sent from the 16:9 room to the 4:3 room, the common
466	   techniques are:

468	   1.  The 4:3 system places the border above the image.  This maintains
469	       eye contact for those who are seated, but cannot maintain eye
470	       contact for the presenter.

472	   2.  The 4:3 system places the border below the images.  If the 16:9
473	       system crops the bottom of the center display then this will
474	       maintain eye contact for the presenter and the remote site.

476	   3.  The 4:3 system centers the images.  Eye contact suffers for
477	       everyone, but the worst case eye contact error is better
478	       controlled.

480	   In this use case, negotiation between the systems is not strictly
481	   necessary, no matter which scheme is used.  However, the best user
482	   experience is obtained if both systems have knowledge about apect
483	   ratios being used and which participants are standing and which are
484	   sitting so they can adjust optimally.

486	5.4.2.  Visual Scale

488	   The visual scale of displays may differ between sites.  Again, let us
489	   use the point to point case as a simple example.  Assume two
490	   conference rooms in a point to point call.  One room is designed for
491	   6 participants, and has three 16:9 screens and 3 cameras.  This room
492	   is designed to show participants at their normal size when seated (2
493	   participants per camera/display).  It does not have adequate display
494	   height to capture those who are standing.  The second room is also
495	   designed for 6 participants, but shows 3 participants per camera/
496	   display also at their full size.  Therefore, it only needs two 16:9
497	   cameras/display pairs.  Since the field of view in both the vertical
498	   and horizontal is increased by 50%, it also shows those who are
499	   standing without cropping.

501	   For the video sent from the 2 screen (larger scale) room to the 3
502	   screen (smaller scale) room, two approaches can be used:

504	   1.  The 3 screen system might simply show the participants on two of
505	       its displays.  Participants will be shown at 67% of their full
506	       size.  Eye contact and gaze awareness will be lost.

508	   2.  The 3 screen system might construct and display a vertically
509	       cropped 3-screen view, showing 2 participants on each screen.
510	       Participants will be shown at full size, with preservation of eye
511	       contact and gaze awareness.

513	   For the video sent from the 3 screen to the 2 screen room, there are
514	   two analogous approaches:

516	   1.  The 2 screen system selects 2 streams and simply shows them on
517	       its displays.  Participants will be shown at 150% of their normal
518	       size.  Eye contact and gaze awareness will be lost, and some of
519	       the remote site is lost.

521	   2.  The 2 screen system might construct and display a 2 screen view
522	       (with a vertical border on the top) which shows 3 participants on
523	       each screen.  Participants will be shown at full size, with
524	       preservation of eye contact and gaze awareness.

526	   Although there is no need for negotiation between the systems, the
527	   best user experience is obtained if both systems have knowledge of
528	   the visual scale, and where individuals are seated, and can then
529	   choose the best manner of display.

531	6.  IANA Considerations

533	   This document contains no IANA considerations.

535	7.  Security Considerations

537	   While there are likely to be security considerations for any solution
538	   for telepresence interoperability, this document has no security
539	   considerations.

541	8.  Acknowledgements

543	   The draft has benefitted from input from a number of people including
544	   Roni Even, Jim Cole, Nermeen Ismail, Nathan Buckles.

546	9.  Informative References

548	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
549	              Requirement Levels", BCP 14, RFC 2119, March 1997.

551	Authors' Addresses

553	   Allyn Romanow
554	   Cisco
555	   San Jose, CA  95134
556	   US

558	   Email: allyn@cisco.com

560	   Stephen Botzko
561	   Polycom
562	   Andover, MA  01810
563	   US

565	   Email: stephen.botzko@polycom.com