idnits 2.17.1 draft-groves-clue-capture-attr-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (February 18, 2013) is 4083 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-25) exists of draft-ietf-clue-framework-08 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE C. Groves, Ed. 3 Internet-Draft W. Yang 4 Intended status: Informational R. Even 5 Expires: August 22, 2013 Huawei 6 February 18, 2013 8 CLUE media capture description 9 draft-groves-clue-capture-attr-01 11 Abstract 13 This memo discusses how media captures are described and in 14 particular the content attribute in the current CLUE framework 15 document and proposes several alternatives. 17 Status of this Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at http://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on August 22, 2013. 34 Copyright Notice 36 Copyright (c) 2013 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 1. Introduction 51 One of the fundamental aspects of the CLUE framework is the concept 52 of media captures. The media captures are sent from a provider to a 53 consumer. This consumer then selects which captures it is interested 54 in and replies back to the consumer. The question is how does the 55 consumer choose between what may be many different media captures? 57 In order to be able to choose between the different media captures 58 the consumer must have enough information regarding what the media 59 capture represents and to distinguish between the media captures. 61 The CLUE framework draft currently defines several media capture 62 attributes which provide information regarding the capture. The 63 draft indicates that Media Capture Attributes describe static 64 information about the captures. A provider uses the media capture 65 attributes to describe the media captures to the consumer. The 66 consumer will select the captures it wants to receive. Attributes 67 are defined by a variable and its value. 69 One of the media capture attributes is the content attribute. As 70 indicated in the draft it is a field with enumerated values which 71 describes the role of the media capture and can be applied to any 72 media type. The enumerated values are defined by RFC 4796 [RFC4796]. 73 The values for this attribute are the same as the mediacnt values for 74 the content attribute in RFC 4796 [RFC4796]. This attribute can have 75 multiple values, for example content={main, speaker}. 77 RFC 4796 [RFC4796] defines the values as: 79 slides: the media stream includes presentation slides. The 80 media type can be, for example, a video stream or a number of 81 instant messages with pictures. Typical use cases for this are 82 online seminars and courses. This is similar to the 83 'presentation' role in H.239. 85 speaker: the media stream contains the image of the speaker. 86 The media can be, for example, a video stream or a still image. 87 Typical use cases for this are online seminars and courses. 89 sl: the media stream contains sign language. A typical use case 90 for this is an audio stream that is translated into sign 91 language, which is sent over a video stream. 93 Whilst the above values appear to be a simple way of conveying the 94 content of a stream the Contributors believe that there are multiple 95 issues that make the use of the existing "Content" tag insufficient 96 for CLUE and multi-stream telepresence systems. These issues are 97 described in section 3. Section 4 proposes new capture description 98 attributes. 100 2. Terminology 102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 103 NOT","SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 104 this document are to be interpreted as described in RFC 2119 105 [RFC2119]. 107 This document draws liberally from the terminology defined in the 108 CLUE Framework [I-D.ietf-clue-framework] 110 3. Issues with Content attribute 112 3.1. Ambiguous definition 114 *There is ambiguity in the definitions that may cause problems for 115 interoperability. A clear example is "slides" which could be any 116 form of presentation media. Another example is the difference 117 between "main" and "alt". In a telepresence scenario the room would 118 be captured by the "main cameras" and a speaker would be captured by 119 an alternative "camera". This runs counter with the definition of 120 "alt". 122 Another example is a university use case where: 124 The main site is a university auditorium which is equipped with three 125 cameras. One camera is focused on the professor at the podium. A 126 second camera is mounted on the wall behind the professor and 127 captures the class in its entirety. The third camera is co-located 128 with the second, and is designed to capture a close up view of a 129 questioner in the audience. It automatically zooms in on that 130 student using sound localization. 132 For the first camera, it's not clear whether to use "main" or 133 "speaker". According to the definition and example of "speaker" in 134 RFC 4796 [RFC4796], maybe it's more proper to use "speaker" here? 135 For the third camera it could fit the definition of "main" or "alt" 136 or"speaker". 138 3.2. Multiple functions 140 It appears that the definitions cover disparate functions. "Main" 141 and "alt" appear to describe the source from which media is sent. 142 "Speaker" indicates a role associated with the media stream. 144 "Slides" and "Sign Language" indicates the actual content. Also 145 indirectly some prioritization is applied to these parameters. For 146 example: the IMTC document on best practices for H.239 indicates a 147 display priority between "main" and "alt". This mixing of functions 148 per code point can lead to ambiguous behavior and interoperability 149 problems. It also is an issue when extending the values. 151 3.3. Limited Stream Support 153 The values above appear to be defined based on a small number of 154 video streams that are typically supported by legacy video 155 conferencing. E.g. a main video stream (main), a secondary one (alt) 156 and perhaps a presentation stream (slides). It is not clear how this 157 value scales when many media streams are present. For example if you 158 have several main streams and several presentation streams how would 159 an endpoint distinguish between them? 161 3.4. Insufficient information for individual parameters 163 Related to the above point is that some individual values do not 164 provide sufficient information for an endpoint to make an educated 165 decision on the content. For example: Sign language (sl) - If a 166 conference provides multiple streams each one containing a sign 167 interpretation in a different sign language how does an endpoint 168 distinguish between the languages if "sl" is the only label? Also 169 for accessible services other functions such a real time captioning 170 and video description where an additional audio channel is used to 171 describe the conference for vision impaired people should be 172 supported. 174 Note: SDP provide a language attribute. 176 3.5. Insufficient information for negotiation 178 CLUE negotiation is likely to be at the start of a session 179 initiation. At this point of time only a very simple set of SDP 180 (i.e. limited media description) may be available (depending on call 181 flow). In most cases the supported media captures may be agreed upon 182 before the full SDP information for each media stream. The effect of 183 this is that detailed information would not be available for the 184 initial decision about which capture to choose. The obvious solution 185 is to provide "enough" data in the CLUE provider messages so that a 186 consumer can choose the appropriate media captures. The current CLUE 187 framework already partly addresses this through the "Content" 188 attribute however based on the current "Content" values it appears 189 that the information is not sufficient to fully describe the content 190 of the captures. 192 The purpose of the CLUE work is to supply enough information for 193 negotiating multiple streams. CLUE framework [I-D.ietf-clue- 194 framework] addresses the spatial relation between the streams but it 195 looks like it does not provide enough information about the semantic 196 content of the stream to allow interoperability. 198 Some information is available in SDP and may be available before the 199 CLUE exchange but there are still some information missing. 201 4. Capture description attributes 203 As indicated above it is proposed to introduce a new attribute/s that 204 allows the definition of various pieces of information that provide 205 metadata about a particular media capture. This information should 206 be described in a way that it only supplies one atomic function. It 207 should also be applicable in a multi-stream environment. It should 208 also be extensible to allow new information elements to be introduced 209 in the future. 211 As an initial list the following attributes are proposed for use as 212 metadata associated with media captures. Further attributes may be 213 identified in the future. 215 This document propose to remove the "Content" attribute. Rather than 216 describing the "source device" in this way it may be better to 217 describe its characteristics. i.e. 219 An attribute to indicate "Presentation" rather than the value 220 "Slides". 222 An attribute to describe the "Role" of a capture rather than the 223 value "Speaker". 225 An attribute to indicate the actual language used rather than a 226 value "Sign Language". This is also applicable to multiple 227 audio streams. 229 With respect to "main" and "alt" in a multiple stream 230 environment it's not clear these values are needed if the 231 characteristics of the capture are described. An assumption may 232 be that a capture is "main" unless described otherwise. 234 Note: CLUE may have missed a media type "text". How about a real 235 time captioning or a real time text conversation associated with a 236 video meeting? It's a text based service. It's not necessarily a 237 presentation stream. It's not audio or visual but a valid component 238 of a conference. 240 The sections below contain an initial list of attributes. 242 4.1. Presentation 244 This attribute indicates that the capture originates from a 245 presentation device, that is one that provides supplementary 246 information to a conference through slides, video, still images, data 247 etc. Where more information is known about the capture it may be 248 expanded hierarchically to indicate the different types of 249 presentation media, e.g. presentation.slides, presentation.image etc. 251 Note: It is expected that a number of keywords will be defined that 252 provide more detail on the type of presentation. 254 4.2. View 256 The Area of capture attribute provides a physical indication of a 257 region that the media capture captures. However the consumer does 258 not know what this physical region relates to. In discussions on the 259 IETF mailing list it is apparent that some people propose to use the 260 "Description" attribute to describe a scene. This is a free text 261 field and as such can be used to signal any piece of information. 262 This leads to problems with interoperability if this field is 263 automatically processed. For interoperability purposes it is 264 proposed to introduce a set of keywords that could be used as a basis 265 for the selection of captures. It is envisaged that this list would 266 be extendable to allow for future uses not covered by the initial 267 specification. Therefore it is proposed to introduce a number of 268 keywords (that may be expanded) indicating what the spatial region 269 relates to? I.e. Room, table, etc. this is an initial description 270 of an attribute introducing these keywords. 272 This attribute provides a textual description of the area that a 273 media capture captures. This provides supplementary information in 274 addition to the spatial information (i.e. area of capture) regarding 275 the region that is captured. 277 Room - Captures the entire scene. 279 Table - Captures the conference table with seated participants 281 Individual - Captures an individual participant 283 Lectern - Captures the region of the lectern including the 284 presenter in classroom style conference 286 Audience - Captures a region showing the audience in a classroom 287 style conference. 289 Others - TBD 291 4.3. Language 293 Captures may be offered in different languages in case of multi- 294 lingual and/or accessible conferences. It is important to allow the 295 remote end to distinguish between them. It is noted that SDP already 296 contains a language attribute however this may not be available at 297 the time that an initial CLUE message is sent. Therefore a language 298 attribute is needed in CLUE to indicate the language used by the 299 capture. 301 This indicates which language is associated with the capture. For 302 example: it may provide a language associated with an audio capture 303 or a language associated with a video capture when sign 304 interpretation or text is used. 306 An example where multiple languages may be used is where a capture 307 includes multiple conference participants who use different 308 languages. 310 The possible values for the language tag are the values of the 311 'Subtag' column for the "Type: language" entries in the "Language 312 Subtag Registry" defined in RFC 5646 [RFC5646]. 314 4.4. Role 316 The original definition of "Content" allows the indication that a 317 particular media stream is related to the speaker. CLUE should also 318 allow this identification for captures. In addition with the advent 319 of XCON there may be other formal roles that may be associated with 320 media/captures. For instance: a remote end may like to always view 321 the floor controller. It is envisaged that a remote end may also 322 chose captures depending on the role of the person/s captured. For 323 example: the people at the remote end may wish to always view the 324 chairman. This indicates that the capture is associated with an 325 entity that has a particular role in the conference. It is possible 326 for the attribute to have multiple values where the capture has 327 multiple roles. 329 The values are grouped into two types: Person roles and Conference 330 Roles 332 4.4.1. Person Roles 334 The roles are related to the titles of the person/s associated with 335 the capture. 337 Manager - Indicates that the capture is assigned to a person 338 with a senior position. 340 Chairman- indicates who the chairman of the meeting is. 342 Secretary - indicates that the capture is associated with the 343 conference secretary. 345 Lecturer - indicates that the capture is associated with the 346 conference lecturer. 348 Audience - indicates that the capture is associated with the 349 conference audience. 351 Others 353 4.4.2. Conference Roles 355 These roles are related to the establishment and maintenance of the 356 multimedia conference and is related to the conference system. 358 Speaker - indicates that the capture relates to the current 359 speaker. 361 Controller - indicates that the capture relates to the current 362 floor controller of the conference. 364 Others 366 An example is: 368 AC1 [Role=Speaker] 369 VC1 [Role=Lecturer,Speaker] 371 4.5. Priority 373 As has been highlighted in discussions on the CLUE mailing list there 374 appears to be some desire to provide some relative priority between 375 captures when multiple alternatives are supplied. This priority can 376 be used to determine which captures contain the most important 377 information (according to the provider). This may be important in 378 case where the consumer has limited resources and can on render a 379 subset of captures. Priority may also be advantageous in congestion 380 scenarios where media from one capture may be favoured over other 381 captures in any control algorithms. This could be supplied via 382 "ordering" in a CLUE data structure however this may be problematic 383 if people assume some spatial meaning behind ordering, i.e. given 384 three captures VC1, VC2, VC3: it would be natural to send VC1,VC2,VC3 385 if the images are composed this way. However if your boss sits in 386 the middle view the priority may be VC2,VC1,VC3. Explicit signalling 387 is better. 389 Additionally currently there are no hints to relative priority among 390 captures from different capture scenes. In order to prevent any 391 misunderstanding with implicit ordering a numeric number that may be 392 assigned to each capture. 394 The "priority" attribute indicates a relative priority between 395 captures. For example it is possible to assign a priority between 396 two presentation captures that would allow a remote endpoint to 397 determine which presentation is more important. Priority is assigned 398 at the individual capture level. It represents the provider's view 399 of the relative priority between captures with a priority. The same 400 priority number may be used across multiple captures. It indicates 401 they are equally as important. If no priority is assigned no 402 assumptions regarding relative important of the capture can be 403 assumed. 405 4.6. Others 407 4.6.1. Dynamic 409 In the framework it has been assumed that the capture point is a 410 fixed point within a telepresence session. However depending on the 411 conference scenario this may not be the case. In tele-medical or 412 tele-education cases a conference may include cameras that move 413 during the conference. For example: a camera may be placed at 414 different positions in order to provide the best angle to capture a 415 work task, or may include a camera worn by a participant. This would 416 have an effect of changing the capture point, capture axis and area 417 of capture. In order that the remote endpoint can chose to layout/ 418 render the capture appropriately an indication of if the camera is 419 dynamic should be indicated in the initial capture description. 421 This indicates that the spatial information related to the capture 422 may be dynamic and change through the conference. Thus captures may 423 be characterised as static, dynamic or highly dynamic. The capture 424 point of a static capture does not move for the life of the 425 conference. The capture point of dynamic captures is categorised by 426 a change in position followed by a reasonable period of stability. 427 High dynamic captures are categorised by a capture point that is 428 constantly moving. This may assist an endpoint in determining the 429 correct display layout. If the "area of capture", "capture point" 430 and "line of capture" attributes are included with dynamic or highly 431 dynamic captures they indicate spatial information at the time a CLUE 432 message is sent. No information regarding future spatial information 433 should be assumed. 435 4.6.2. Embedded Text 437 In accessible conferences textual information may be added to a 438 capture before it is transmitted to the remote end. In the case 439 where multiple video captures are presented the remote end may 440 benefit from the ability to choose a video stream containing text 441 over one that does not. 443 This attribute indicates that a capture provides embedded textual 444 information. For example the video capture may contain speech to 445 text information composed with the video image. This attribute is 446 only applicable to video captures and presentation streams with 447 visual information. 449 The EmbeddedText attribute contains a language value according to RFC 450 5646 [RFC5646] and may use a script sub-tag. For example: 452 EmbeddedText=zh-Hans 454 Which indicates embedded text in Chinese written using the simplified 455 Chinese script. 457 4.6.3. Complementary Feed 459 Some conferences utilise translators or facilitators that provide an 460 additional audio stream (i.e. a translation or description of the 461 conference). These persons may not be pictured in a video capture. 462 Where multiple audio captures are presented it may be advantageous 463 for an endpoint to select a complementary stream instead of or 464 additional to an audio feed associated with the participants from a 465 main video capture. 467 This indicates that a capture provides additional description of the 468 conference. For example an additional audio stream that provides a 469 commentary of a conference that provides complementary information 470 (e.g. a translation) or extra information to participants in 471 accessible conferences. 473 An example is where an additional capture provides a translation of 474 another capture: 476 AC1 [Language = English] 477 AC2 [ComplementaryFeed = AC1, Language=Chinese] 479 The complementary feed attribute indicates the capture to which it is 480 providing additional information. 482 5. Summary 484 The main proposal is a to remove the Content Attribute in favour of 485 describing the characteristics of captures in a more 486 functional(atomic) way using the above attributes as the attributes 487 to describe metadata regarding a capture. 489 6. Acknowledgements 491 This template was derived from an initial version written by Pekka 492 Savola and contributed by him to the xml2rfc project. 494 7. IANA Considerations 496 This memo includes no request to IANA. 498 8. Security Considerations 500 TBD 502 9. Changes and Status Since Last Version 504 Changes from 00 to 01: 506 1. Changed source to XML. 508 2. 4.1 Presentation : No comments or concerns. No changes. 510 3. 4.2 View : No comments or concerns. No changes. 512 4. 4.3 Language: There were comments that multiple languages 513 need to be supported e.g. audio in one, embedded text in 514 another. The text need to be clear whether it is supported 515 or preferred language however it was clarified it is neither. 516 Its the language of the content/capture. It was also noted 517 that different speakers using different languages could talk 518 on the main speakers capture therefore language should be a 519 list. Seemed to be support for this. Text was adapted 520 accordingly. 522 5. 4.4 Role: There were a couple of responses for support for 523 this attribute. The actual values still need some work. It 524 was noted that there were two possible sets of roles: One 525 group related to the titles of the person: i.e. Boss, 526 Chairman, Secretary, Lecturer, Audience. Another group 527 related to conference functions: i.e. Conference initiator, 528 controller, speaker. Text was adapted accordingly. 530 6. 4.5 Priority: No direct comment on the proposal. There 531 appeared to be some interest in a prioritisation scheme 532 during discussions on the framework. No changes. 534 7. 4.6.1 Dynamic : No comments or concerns. No changes. 536 8. 4.6.2 Embedded text: There was a comment that "text" media 537 capture was needed. It was also indicated that it should be 538 possible to associate a language with embedded text. It 539 should be possible to also specify language and script. e.g. 540 Embedded text could have its own language. Text adapted 541 accordingly. 543 9. 4.6.3 Supplementary Description: There were comments that it 544 could be interpreted as a free text field. The intention is 545 that its more of a flag. A better name could be 546 "Complementary feed"? There was also a comment that perhaps 547 a specific "translator flag" is needed. It was noted the 548 usage was like: AC1 Language=English or AC2 Supplementary 549 Description = TRUE, Language=Chinese. Text updated 550 accordingly. 552 10. 4.6.4 Telepresence There were a couple of comments 553 questioning the need for this parameter. Attribute removed. 555 10. References 557 10.1. Normative References 559 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 560 Requirement Levels", BCP 14, RFC 2119, March 1997. 562 10.2. Informative References 564 [I-D.ietf-clue-framework] 565 Duckworth, M., Pepperell, A., and S. Wenger, "Framework 566 for Telepresence Multi-Streams", 567 draft-ietf-clue-framework-08 (work in progress). 569 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description 570 Protocol (SDP) Content Attribute", RFC 4796, 571 February 2007. 573 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 574 Languages", BCP 47, RFC 5646, September 2009. 576 Authors' Addresses 578 Christian Groves (editor) 579 Huawei 580 Melbourne, 581 Australia 583 Email: Christian.Groves@nteczone.com 585 Weiwei Yang 586 Huawei 587 P.R.China 589 Email: tommy@huawei.com 591 Roni Even 592 Huawei 593 Tel Aviv, 594 Isreal 596 Email: roni.even@mail01.huawei.com