idnits 2.17.1 

draft-groves-clue-capture-attr-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet has text resembling
     RFC 2119 boilerplate text.

  -- The document date (February 18, 2013) is 4083 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-25) exists of
     draft-ietf-clue-framework-08


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	CLUE                                                      C. Groves, Ed.
3	Internet-Draft                                                   W. Yang
4	Intended status: Informational                                   R. Even
5	Expires: August 22, 2013                                          Huawei
6	                                                       February 18, 2013

8	                     CLUE media capture description
9	                   draft-groves-clue-capture-attr-01

11	Abstract

13	   This memo discusses how media captures are described and in
14	   particular the content attribute in the current CLUE framework
15	   document and proposes several alternatives.

17	Status of this Memo

19	   This Internet-Draft is submitted in full conformance with the
20	   provisions of BCP 78 and BCP 79.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF).  Note that other groups may also distribute
24	   working documents as Internet-Drafts.  The list of current Internet-
25	   Drafts is at http://datatracker.ietf.org/drafts/current/.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   This Internet-Draft will expire on August 22, 2013.

34	Copyright Notice

36	   Copyright (c) 2013 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents
41	   (http://trustee.ietf.org/license-info) in effect on the date of
42	   publication of this document.  Please review these documents
43	   carefully, as they describe your rights and restrictions with respect
44	   to this document.  Code Components extracted from this document must
45	   include Simplified BSD License text as described in Section 4.e of
46	   the Trust Legal Provisions and are provided without warranty as
47	   described in the Simplified BSD License.

49	1.  Introduction

51	   One of the fundamental aspects of the CLUE framework is the concept
52	   of media captures.  The media captures are sent from a provider to a
53	   consumer.  This consumer then selects which captures it is interested
54	   in and replies back to the consumer.  The question is how does the
55	   consumer choose between what may be many different media captures?

57	   In order to be able to choose between the different media captures
58	   the consumer must have enough information regarding what the media
59	   capture represents and to distinguish between the media captures.

61	   The CLUE framework draft currently defines several media capture
62	   attributes which provide information regarding the capture.  The
63	   draft indicates that Media Capture Attributes describe static
64	   information about the captures.  A provider uses the media capture
65	   attributes to describe the media captures to the consumer.  The
66	   consumer will select the captures it wants to receive.  Attributes
67	   are defined by a variable and its value.

69	   One of the media capture attributes is the content attribute.  As
70	   indicated in the draft it is a field with enumerated values which
71	   describes the role of the media capture and can be applied to any
72	   media type.  The enumerated values are defined by RFC 4796 [RFC4796].
73	   The values for this attribute are the same as the mediacnt values for
74	   the content attribute in RFC 4796 [RFC4796].  This attribute can have
75	   multiple values, for example content={main, speaker}.

77	   RFC 4796 [RFC4796] defines the values as:

79	        slides: the media stream includes presentation slides.  The
80	        media type can be, for example, a video stream or a number of
81	        instant messages with pictures.  Typical use cases for this are
82	        online seminars and courses.  This is similar to the
83	        'presentation' role in H.239.

85	        speaker: the media stream contains the image of the speaker.
86	        The media can be, for example, a video stream or a still image.
87	        Typical use cases for this are online seminars and courses.

89	        sl: the media stream contains sign language.  A typical use case
90	        for this is an audio stream that is translated into sign
91	        language, which is sent over a video stream.

93	   Whilst the above values appear to be a simple way of conveying the
94	   content of a stream the Contributors believe that there are multiple
95	   issues that make the use of the existing "Content" tag insufficient
96	   for CLUE and multi-stream telepresence systems.  These issues are
97	   described in section 3.  Section 4 proposes new capture description
98	   attributes.

100	2.  Terminology

102	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
103	   NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
104	   this document are to be interpreted as described in RFC 2119
105	   [RFC2119].

107	   This document draws liberally from the terminology defined in the
108	   CLUE Framework [I-D.ietf-clue-framework]

110	3.  Issues with Content attribute

112	3.1.  Ambiguous definition

114	   *There is ambiguity in the definitions that may cause problems for
115	   interoperability.  A clear example is "slides" which could be any
116	   form of presentation media.  Another example is the difference
117	   between "main" and "alt".  In a telepresence scenario the room would
118	   be captured by the "main cameras" and a speaker would be captured by
119	   an alternative "camera".  This runs counter with the definition of
120	   "alt".

122	   Another example is a university use case where:

124	   The main site is a university auditorium which is equipped with three
125	   cameras.  One camera is focused on the professor at the podium.  A
126	   second camera is mounted on the wall behind the professor and
127	   captures the class in its entirety.  The third camera is co-located
128	   with the second, and is designed to capture a close up view of a
129	   questioner in the audience.  It automatically zooms in on that
130	   student using sound localization.

132	   For the first camera, it's not clear whether to use "main" or
133	   "speaker".  According to the definition and example of "speaker" in
134	   RFC 4796 [RFC4796], maybe it's more proper to use "speaker" here?
135	   For the third camera it could fit the definition of "main" or "alt"
136	   or"speaker".

138	3.2.  Multiple functions

140	   It appears that the definitions cover disparate functions.  "Main"
141	   and "alt" appear to describe the source from which media is sent.
142	   "Speaker" indicates a role associated with the media stream.

144	   "Slides" and "Sign Language" indicates the actual content.  Also
145	   indirectly some prioritization is applied to these parameters.  For
146	   example: the IMTC document on best practices for H.239 indicates a
147	   display priority between "main" and "alt".  This mixing of functions
148	   per code point can lead to ambiguous behavior and interoperability
149	   problems.  It also is an issue when extending the values.

151	3.3.  Limited Stream Support

153	   The values above appear to be defined based on a small number of
154	   video streams that are typically supported by legacy video
155	   conferencing.  E.g. a main video stream (main), a secondary one (alt)
156	   and perhaps a presentation stream (slides).  It is not clear how this
157	   value scales when many media streams are present.  For example if you
158	   have several main streams and several presentation streams how would
159	   an endpoint distinguish between them?

161	3.4.  Insufficient information for individual parameters

163	   Related to the above point is that some individual values do not
164	   provide sufficient information for an endpoint to make an educated
165	   decision on the content.  For example: Sign language (sl) - If a
166	   conference provides multiple streams each one containing a sign
167	   interpretation in a different sign language how does an endpoint
168	   distinguish between the languages if "sl" is the only label?  Also
169	   for accessible services other functions such a real time captioning
170	   and video description where an additional audio channel is used to
171	   describe the conference for vision impaired people should be
172	   supported.

174	   Note: SDP provide a language attribute.

176	3.5.  Insufficient information for negotiation

178	   CLUE negotiation is likely to be at the start of a session
179	   initiation.  At this point of time only a very simple set of SDP
180	   (i.e. limited media description) may be available (depending on call
181	   flow).  In most cases the supported media captures may be agreed upon
182	   before the full SDP information for each media stream.  The effect of
183	   this is that detailed information would not be available for the
184	   initial decision about which capture to choose.  The obvious solution
185	   is to provide "enough" data in the CLUE provider messages so that a
186	   consumer can choose the appropriate media captures.  The current CLUE
187	   framework already partly addresses this through the "Content"
188	   attribute however based on the current "Content" values it appears
189	   that the information is not sufficient to fully describe the content
190	   of the captures.

192	   The purpose of the CLUE work is to supply enough information for
193	   negotiating multiple streams.  CLUE framework [I-D.ietf-clue-
194	   framework] addresses the spatial relation between the streams but it
195	   looks like it does not provide enough information about the semantic
196	   content of the stream to allow interoperability.

198	   Some information is available in SDP and may be available before the
199	   CLUE exchange but there are still some information missing.

201	4.  Capture description attributes

203	   As indicated above it is proposed to introduce a new attribute/s that
204	   allows the definition of various pieces of information that provide
205	   metadata about a particular media capture.  This information should
206	   be described in a way that it only supplies one atomic function.  It
207	   should also be applicable in a multi-stream environment.  It should
208	   also be extensible to allow new information elements to be introduced
209	   in the future.

211	   As an initial list the following attributes are proposed for use as
212	   metadata associated with media captures.  Further attributes may be
213	   identified in the future.

215	   This document propose to remove the "Content" attribute.  Rather than
216	   describing the "source device" in this way it may be better to
217	   describe its characteristics. i.e.

219	        An attribute to indicate "Presentation" rather than the value
220	        "Slides".

222	        An attribute to describe the "Role" of a capture rather than the
223	        value "Speaker".

225	        An attribute to indicate the actual language used rather than a
226	        value "Sign Language".  This is also applicable to multiple
227	        audio streams.

229	        With respect to "main" and "alt" in a multiple stream
230	        environment it's not clear these values are needed if the
231	        characteristics of the capture are described.  An assumption may
232	        be that a capture is "main" unless described otherwise.

234	   Note: CLUE may have missed a media type "text".  How about a real
235	   time captioning or a real time text conversation associated with a
236	   video meeting?  It's a text based service.  It's not necessarily a
237	   presentation stream.  It's not audio or visual but a valid component
238	   of a conference.

240	   The sections below contain an initial list of attributes.

242	4.1.  Presentation

244	   This attribute indicates that the capture originates from a
245	   presentation device, that is one that provides supplementary
246	   information to a conference through slides, video, still images, data
247	   etc.  Where more information is known about the capture it may be
248	   expanded hierarchically to indicate the different types of
249	   presentation media, e.g. presentation.slides, presentation.image etc.

251	   Note: It is expected that a number of keywords will be defined that
252	   provide more detail on the type of presentation.

254	4.2.  View

256	   The Area of capture attribute provides a physical indication of a
257	   region that the media capture captures.  However the consumer does
258	   not know what this physical region relates to.  In discussions on the
259	   IETF mailing list it is apparent that some people propose to use the
260	   "Description" attribute to describe a scene.  This is a free text
261	   field and as such can be used to signal any piece of information.
262	   This leads to problems with interoperability if this field is
263	   automatically processed.  For interoperability purposes it is
264	   proposed to introduce a set of keywords that could be used as a basis
265	   for the selection of captures.  It is envisaged that this list would
266	   be extendable to allow for future uses not covered by the initial
267	   specification.  Therefore it is proposed to introduce a number of
268	   keywords (that may be expanded) indicating what the spatial region
269	   relates to?  I.e.  Room, table, etc. this is an initial description
270	   of an attribute introducing these keywords.

272	   This attribute provides a textual description of the area that a
273	   media capture captures.  This provides supplementary information in
274	   addition to the spatial information (i.e. area of capture) regarding
275	   the region that is captured.

277	        Room - Captures the entire scene.

279	        Table - Captures the conference table with seated participants

281	        Individual - Captures an individual participant

283	        Lectern - Captures the region of the lectern including the
284	        presenter in classroom style conference

286	        Audience - Captures a region showing the audience in a classroom
287	        style conference.

289	        Others - TBD

291	4.3.  Language

293	   Captures may be offered in different languages in case of multi-
294	   lingual and/or accessible conferences.  It is important to allow the
295	   remote end to distinguish between them.  It is noted that SDP already
296	   contains a language attribute however this may not be available at
297	   the time that an initial CLUE message is sent.  Therefore a language
298	   attribute is needed in CLUE to indicate the language used by the
299	   capture.

301	   This indicates which language is associated with the capture.  For
302	   example: it may provide a language associated with an audio capture
303	   or a language associated with a video capture when sign
304	   interpretation or text is used.

306	   An example where multiple languages may be used is where a capture
307	   includes multiple conference participants who use different
308	   languages.

310	   The possible values for the language tag are the values of the
311	   'Subtag' column for the "Type: language" entries in the "Language
312	   Subtag Registry" defined in RFC 5646 [RFC5646].

314	4.4.  Role

316	   The original definition of "Content" allows the indication that a
317	   particular media stream is related to the speaker.  CLUE should also
318	   allow this identification for captures.  In addition with the advent
319	   of XCON there may be other formal roles that may be associated with
320	   media/captures.  For instance: a remote end may like to always view
321	   the floor controller.  It is envisaged that a remote end may also
322	   chose captures depending on the role of the person/s captured.  For
323	   example: the people at the remote end may wish to always view the
324	   chairman.  This indicates that the capture is associated with an
325	   entity that has a particular role in the conference.  It is possible
326	   for the attribute to have multiple values where the capture has
327	   multiple roles.

329	   The values are grouped into two types: Person roles and Conference
330	   Roles

332	4.4.1.  Person Roles

334	   The roles are related to the titles of the person/s associated with
335	   the capture.

337	        Manager - Indicates that the capture is assigned to a person
338	        with a senior position.

340	        Chairman- indicates who the chairman of the meeting is.

342	        Secretary - indicates that the capture is associated with the
343	        conference secretary.

345	        Lecturer - indicates that the capture is associated with the
346	        conference lecturer.

348	        Audience - indicates that the capture is associated with the
349	        conference audience.

351	        Others

353	4.4.2.  Conference Roles

355	   These roles are related to the establishment and maintenance of the
356	   multimedia conference and is related to the conference system.

358	        Speaker - indicates that the capture relates to the current
359	        speaker.

361	        Controller - indicates that the capture relates to the current
362	        floor controller of the conference.

364	        Others

366	   An example is:

368	              AC1 [Role=Speaker]
369	              VC1 [Role=Lecturer,Speaker]

371	4.5.  Priority

373	   As has been highlighted in discussions on the CLUE mailing list there
374	   appears to be some desire to provide some relative priority between
375	   captures when multiple alternatives are supplied.  This priority can
376	   be used to determine which captures contain the most important
377	   information (according to the provider).  This may be important in
378	   case where the consumer has limited resources and can on render a
379	   subset of captures.  Priority may also be advantageous in congestion
380	   scenarios where media from one capture may be favoured over other
381	   captures in any control algorithms.  This could be supplied via
382	   "ordering" in a CLUE data structure however this may be problematic
383	   if people assume some spatial meaning behind ordering, i.e. given
384	   three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3
385	   if the images are composed this way.  However if your boss sits in
386	   the middle view the priority may be VC2,VC1,VC3.  Explicit signalling
387	   is better.

389	   Additionally currently there are no hints to relative priority among
390	   captures from different capture scenes.  In order to prevent any
391	   misunderstanding with implicit ordering a numeric number that may be
392	   assigned to each capture.

394	   The "priority" attribute indicates a relative priority between
395	   captures.  For example it is possible to assign a priority between
396	   two presentation captures that would allow a remote endpoint to
397	   determine which presentation is more important.  Priority is assigned
398	   at the individual capture level.  It represents the provider's view
399	   of the relative priority between captures with a priority.  The same
400	   priority number may be used across multiple captures.  It indicates
401	   they are equally as important.  If no priority is assigned no
402	   assumptions regarding relative important of the capture can be
403	   assumed.

405	4.6.  Others

407	4.6.1.  Dynamic

409	   In the framework it has been assumed that the capture point is a
410	   fixed point within a telepresence session.  However depending on the
411	   conference scenario this may not be the case.  In tele-medical or
412	   tele-education cases a conference may include cameras that move
413	   during the conference.  For example: a camera may be placed at
414	   different positions in order to provide the best angle to capture a
415	   work task, or may include a camera worn by a participant.  This would
416	   have an effect of changing the capture point, capture axis and area
417	   of capture.  In order that the remote endpoint can chose to layout/
418	   render the capture appropriately an indication of if the camera is
419	   dynamic should be indicated in the initial capture description.

421	   This indicates that the spatial information related to the capture
422	   may be dynamic and change through the conference.  Thus captures may
423	   be characterised as static, dynamic or highly dynamic.  The capture
424	   point of a static capture does not move for the life of the
425	   conference.  The capture point of dynamic captures is categorised by
426	   a change in position followed by a reasonable period of stability.
427	   High dynamic captures are categorised by a capture point that is
428	   constantly moving.  This may assist an endpoint in determining the
429	   correct display layout.  If the "area of capture", "capture point"
430	   and "line of capture" attributes are included with dynamic or highly
431	   dynamic captures they indicate spatial information at the time a CLUE
432	   message is sent.  No information regarding future spatial information
433	   should be assumed.

435	4.6.2.  Embedded Text

437	   In accessible conferences textual information may be added to a
438	   capture before it is transmitted to the remote end.  In the case
439	   where multiple video captures are presented the remote end may
440	   benefit from the ability to choose a video stream containing text
441	   over one that does not.

443	   This attribute indicates that a capture provides embedded textual
444	   information.  For example the video capture may contain speech to
445	   text information composed with the video image.  This attribute is
446	   only applicable to video captures and presentation streams with
447	   visual information.

449	   The EmbeddedText attribute contains a language value according to RFC
450	   5646 [RFC5646] and may use a script sub-tag.  For example:

452	                   EmbeddedText=zh-Hans

454	   Which indicates embedded text in Chinese written using the simplified
455	   Chinese script.

457	4.6.3.  Complementary Feed

459	   Some conferences utilise translators or facilitators that provide an
460	   additional audio stream (i.e. a translation or description of the
461	   conference).  These persons may not be pictured in a video capture.
462	   Where multiple audio captures are presented it may be advantageous
463	   for an endpoint to select a complementary stream instead of or
464	   additional to an audio feed associated with the participants from a
465	   main video capture.

467	   This indicates that a capture provides additional description of the
468	   conference.  For example an additional audio stream that provides a
469	   commentary of a conference that provides complementary information
470	   (e.g. a translation) or extra information to participants in
471	   accessible conferences.

473	   An example is where an additional capture provides a translation of
474	   another capture:

476	                   AC1 [Language = English]
477	                   AC2 [ComplementaryFeed = AC1, Language=Chinese]

479	   The complementary feed attribute indicates the capture to which it is
480	   providing additional information.

482	5.  Summary

484	   The main proposal is a to remove the Content Attribute in favour of
485	   describing the characteristics of captures in a more
486	   functional(atomic) way using the above attributes as the attributes
487	   to describe metadata regarding a capture.

489	6.  Acknowledgements

491	   This template was derived from an initial version written by Pekka
492	   Savola and contributed by him to the xml2rfc project.

494	7.  IANA Considerations

496	   This memo includes no request to IANA.

498	8.  Security Considerations

500	   TBD

502	9.  Changes and Status Since Last Version

504	   Changes from 00 to 01:

506	   1.      Changed source to XML.

508	   2.      4.1 Presentation : No comments or concerns.  No changes.

510	   3.      4.2 View : No comments or concerns.  No changes.

512	   4.      4.3 Language: There were comments that multiple languages
513	           need to be supported e.g. audio in one, embedded text in
514	           another.  The text need to be clear whether it is supported
515	           or preferred language however it was clarified it is neither.
516	           Its the language of the content/capture.  It was also noted
517	           that different speakers using different languages could talk
518	           on the main speakers capture therefore language should be a
519	           list.  Seemed to be support for this.  Text was adapted
520	           accordingly.

522	   5.      4.4 Role: There were a couple of responses for support for
523	           this attribute.  The actual values still need some work.  It
524	           was noted that there were two possible sets of roles: One
525	           group related to the titles of the person: i.e.  Boss,
526	           Chairman, Secretary, Lecturer, Audience.  Another group
527	           related to conference functions: i.e.  Conference initiator,
528	           controller, speaker.  Text was adapted accordingly.

530	   6.      4.5 Priority: No direct comment on the proposal.  There
531	           appeared to be some interest in a prioritisation scheme
532	           during discussions on the framework.  No changes.

534	   7.      4.6.1 Dynamic : No comments or concerns.  No changes.

536	   8.      4.6.2 Embedded text: There was a comment that "text" media
537	           capture was needed.  It was also indicated that it should be
538	           possible to associate a language with embedded text.  It
539	           should be possible to also specify language and script. e.g.
540	           Embedded text could have its own language.  Text adapted
541	           accordingly.

543	   9.      4.6.3 Supplementary Description: There were comments that it
544	           could be interpreted as a free text field.  The intention is
545	           that its more of a flag.  A better name could be
546	           "Complementary feed"?  There was also a comment that perhaps
547	           a specific "translator flag" is needed.  It was noted the
548	           usage was like: AC1 Language=English or AC2 Supplementary
549	           Description = TRUE, Language=Chinese.  Text updated
550	           accordingly.

552	   10.     4.6.4 Telepresence There were a couple of comments
553	           questioning the need for this parameter.  Attribute removed.

555	10.  References

557	10.1.  Normative References

559	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
560	              Requirement Levels", BCP 14, RFC 2119, March 1997.

562	10.2.  Informative References

564	   [I-D.ietf-clue-framework]
565	              Duckworth, M., Pepperell, A., and S. Wenger, "Framework
566	              for Telepresence Multi-Streams",
567	              draft-ietf-clue-framework-08 (work in progress).

569	   [RFC4796]  Hautakorpi, J. and G. Camarillo, "The Session Description
570	              Protocol (SDP) Content Attribute", RFC 4796,
571	              February 2007.

573	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
574	              Languages", BCP 47, RFC 5646, September 2009.

576	Authors' Addresses

578	   Christian Groves (editor)
579	   Huawei
580	   Melbourne,
581	   Australia

583	   Email: Christian.Groves@nteczone.com

585	   Weiwei Yang
586	   Huawei
587	   P.R.China

589	   Email: tommy@huawei.com

591	   Roni Even
592	   Huawei
593	   Tel Aviv,
594	   Isreal

596	   Email: roni.even@mail01.huawei.com