idnits 2.17.1 draft-ietf-clue-telepresence-requirements-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 348 has weird spacing: '...int and multi...' -- The document date (October 31, 2011) is 4533 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'REQMT-2C' is mentioned on line 443, but not defined == Missing Reference: 'REQMT-3b' is mentioned on line 449, but not defined == Missing Reference: 'REQMT-14' is mentioned on line 455, but not defined -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE WG A. Romanow 3 Internet-Draft Cisco Systems 4 Intended status: Informational S. Botzko 5 Expires: May 3, 2012 Polycom 6 October 31, 2011 8 Requirements for Telepresence Multi-Streams 9 draft-ietf-clue-telepresence-requirements-01.txt 11 Abstract 13 This memo discusses the requirements for a specification that enables 14 telepresence interoperability, by describing the relationship between 15 multiple RTP streams. In addition, the problem statement and 16 definitions are also covered herein. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 3, 2012. 35 Copyright Notice 37 Copyright (c) 2011 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 54 3. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 55 4. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 5 56 5. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7 57 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 58 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 59 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 60 9. Informative References . . . . . . . . . . . . . . . . . . . . 10 61 Appendix A. Open issues . . . . . . . . . . . . . . . . . . . . . 11 62 Appendix B. Changes From Earlier Versions . . . . . . . . . . . . 11 63 B.1. Changes From Draft -00 . . . . . . . . . . . . . . . . . . 11 64 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 66 1. Introduction 68 Telepresence systems greatly improve collaboration. In a 69 telepresence conference (as used herein), the goal is to create an 70 environment that gives the users a feeling of (co-located) presence - 71 the feeling that a local user is in the same room with other local 72 users and the remote parties. Currently, systems from different 73 vendors often do not interoperate because they do the same tasks 74 differently, as discussed in the Problem Statement section below. 76 The approach taken in this memo is to set requirements for a future 77 specification that, when fullfilled by an implementation of the 78 specification, provide for interoperability between IETF protocol 79 based telepresence systems. It is anticipated that a solution for 80 the requirements set out in this memo likely involves the exchange of 81 adequate information about participating sites; information that is 82 currently not standardized by the IETF. 84 The purpose of this document is to describe the requirements for a 85 specification that enables interworking between different SIP-based 86 [RFC3261] telepresence systems, by exchanging and negotiating 87 appropriate information. Non IETF protocol based systems, such as 88 those based on ITU-T Rec. H.323, are out of scope. These 89 requirements are for the specification, they are not requirements on 90 the telepresence systems implementing the solution/protocol that will 91 be specified. 93 Telepresence systems of different vendors, today, can follow 94 radically different architectural approaches while offering a similar 95 user experience. It is not the intention of CLUE to dictate 96 telepresence architectural and implementation choices. CLUE enables 97 interoperability between telepresence systems by exchanging 98 information about the systems' characteristics. Systems can use this 99 information to control their behavior to allow for interoperability 100 between those systems. 102 In a telepresence session, required are at least one sending and one 103 receiving endpoint. Most telepresence endpoints are full-duplex in 104 that they are both sending and receiving. Some, especially 105 multiparty telepresence sessions include more than two endpoints, and 106 centralized infrastructure such as Multipoint Control Units (MCUs) or 107 equivalent. CLUE specifies the syntax, semantics, and control flow 108 of information to enable the best possible user experience at those 109 endpoints. 111 Sending endpoints, or MCUs, are not mandated to use any of the CLUE 112 specifications that describe their capabilities, attributes, or 113 behavior. Similarly, it is not envisioned that endpoints or MCUs 114 must ever take into account information received. However, by making 115 available as much information as possible, and by taking into account 116 as much information as has been received or exchanged, MCUs and 117 endpoints are expected to select operation modes that enable the best 118 possible user experience under their constraints. 120 The document structure is as follows: Definitions are set out, 121 followed by a description of the problem of telepresence 122 interoperability that led to this work. Then the requirements to a 123 specification addressing the current shortcomings are enumerated and 124 discussed. 126 2. Terminology 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in RFC 2119 [RFC2119]. 132 3. Definitions 134 The definitions are from draft-wenger-clue-definitions-01.txt. The 135 editor's notes are not included here. 137 Audio Mixing: refers to the accumulation of scaled audio signals 138 to produce a single audio stream. See RTP Topologies, [RFC5117]. 140 Conference: used as defined in [RFC4353], A Framework for 141 Conferencing within the Session Initiation Protocol (SIP). 143 Endpoint: The logical point of final termination through 144 receiving, decoding and rendering, and/or initiation through 145 capturing, encoding, and sending of media streams. An endpoint 146 consists of one or more physical devices which source and sink 147 media streams, and exactly one [RFC4353] Participant (which, in 148 turn, includes exactly one SIP User Agent). In contrast to an 149 endpoint, an MCU may also send and receive media streams, but it 150 is not the initiator nor the final terminator in the sense that 151 Media is Captured or Rendered. Endpoints can be anything from 152 multiscreen/multicamera rooms to handheld devices. 154 Endpoint Characteristics: include placement of Capture and 155 Rendering Devices, capture/render angle, resolution of cameras and 156 screens, spatial location and mixing parameters of microphones. 157 Endpoint characteristics are not specific to individual media 158 streams sent by the endpoint. 160 Layout: How rendered media streams are spatially arranged with 161 respect to each other on a single screen/mono audio telepresence 162 endpoint, and how rendered media streams are arranged with respect 163 to each other on a multiple screen/speaker telepresence endpoint. 164 Note that audio as well as video is encompassed by the term 165 layout--in other words, included is the placement of audio streams 166 on speakers as well as video streams on video screens. 168 Left: to be interpreted as a stage direction, see also 169 [StageDirection(Wikipedia)] 171 Local: Sender and/or receiver physically co-located ("local") in 172 the context of the discussion. 174 MCU: Multipoint Control Unit (MCU) - a device that connects two or 175 more endpoints together into one single multimedia conference 176 [RFC5117]. An MCU includes an [RFC4353] Mixer. 178 Media: Any data that, after suitable encoding, can be conveyed 179 over RTP, including audio, video or timed text. 181 Model: a set of assumptions a telepresence system of a given 182 vendor adheres to and expects the remote telepresence system(s) 183 also to adhere to. 185 Remote: Sender and/or receiver on the other side of the 186 communication channel (depending on context); not Local. A remote 187 can be an Endpoint or an MCU. 189 Render: the process of generating a representation from a media, 190 such as displayed motion video or sound emitted from loudspeakers. 192 Right: to be interpreted as stage direction, see also 193 [StageDirection(Wikipedia)] 195 Telepresence: an environment that gives non co-located users or 196 user groups a feeling of (co-located) presence - the feeling that 197 a Local user is in the same room with other Local users and the 198 Remote parties. The inclusion of Remote parties is achieved 199 through multimedia communication including at least audio and 200 video signals of high fidelity. 202 4. Problem Statement 204 In order to create the "being there" or telepresence experience, 205 media inputs need to be transported, received, and coordinated. 206 Different telepresence systems take diverse approaches in crafting a 207 solution. Or, implement similar solutions quite differently. 209 They use disparate techniques, and they describe, control and 210 negotiate media in dissimilar fashions. Such diversity creates an 211 interoperability problem. The same issues are solved in different 212 ways by different systems, so that they are not directly 213 interoperable. This makes interworking difficult at best and 214 sometimes impossible. 216 Worse, many telepresence use proprietry protocol extensions to solve 217 telepresence-related problems, even if those extensions are based on 218 common standards such as SIP. 220 Some degree of interworking between systems from different vendors is 221 possible through transcoding and translation. This requires 222 additional devices, which are expensive, often not entirely 223 automatic, and sometimes introduce unwelcome side effects such as 224 additional delay or degrading performance. Specialized knowledge is 225 currently required to operate a telepresence conference with 226 endpoints from different vendors, for example to configure 227 transcoding and translating devices. Often such conferences do not 228 commence as planned, or are interrupted by difficulties that arise. 230 The general problem that needs to be solved can be described as 231 follows. Today, the transmitting side sends audio and video captures 232 based upon an implicitely assumed model for rendering a realistic 233 depiction from this information. If the receiving side belongs to 234 the same vendor, it works with the same model and renders the 235 information according to the model implicitely assumed by the vendor. 236 However, if the receiver and the sender are from different vendors, 237 the models they each have for rendering presence can and usually do 238 differ. The result can be that the telepresence systems actually 239 connect, but the user experience suffers, for example because one 240 system assumes that the first video stream stems from the right 241 camera, whereas the other assumes the first video stream stems from 242 the left camera. 244 It is as if Alice and Bob are at different sites. Alice needs to 245 tell Bob information about what her camera and sound equipment see at 246 her site so that Bob's receiver can create a display that will 247 capture the important characteristics of her site. Alice and Bob 248 need to agree on what the salient characteristics are as well as how 249 to represent and communicate them. Characteristics include number, 250 placement, capture/render angle, resolution of cameras and screens, 251 spatial location and audio mixing parameters of microphones. 253 The telepresence multi-stream work seeks to describe the sender 254 situation in a way that allows the receiver to render it 255 realistically, though it may have a different rendering model than 256 the sender; and for the receiver to provide information to the sender 257 in order to help the sender create adequate content for interworking. 259 5. Requirements 261 Although some aspects of these requirements can be met by existing 262 technology, such as SDP, or H.264, nonetheless we state them here to 263 have a complete record of what the requirements for CLUE are, whether 264 new work is needed or they can be met by existing technology. 265 Figuring this out will be part of the solution development, rather 266 than part of the requirements. 268 REQMT-1: The solution MUST support a description of the spatial 269 arrangement of source video images sent in video streams 270 which enables a satisfactory reproduction at the receiver 271 of the original scene. This applies to each site in a 272 point to point or a multipoint meeting and refers to the 273 spatial ordering within a site, not to the ordering of 274 images between sites. 276 Use case point to point symmetric, and all other use cases. 278 REQMT-1a: The solution MUST support a means of allowing 279 the preservation of the order of images in the 280 captured scene. For example, if John is to 281 Susan's right in the image capture, John is 282 also to Susan's right in the rendered image. 284 REQMT-1b: The solution MUST support a means of allowing 285 the preservation of order of images in the 286 scene in two dimensions - horizontal and 287 vertical. 289 REQMT-2: The solution MUST support a description of the spatial 290 arrangement of captured source audio sent in audio streams 291 which enables a satisfactory reproduction at the receiver 292 in a spatially correct manner. This applies to each site 293 in a point to point or a multipoint meeting and refers to 294 the spatial ordering within a site, not the ordering of 295 channels between sites. 297 Use case point to point symmetric, and all use cases, 298 especially heterogeneous. 300 REQMT-2a: The solution MUST support a means of preserving 301 the spatial order of audio in the captured 302 scene. For example, if John sounds as if he is 303 at Susan's right in the captured audio, John 304 voice is also placed at Susan's right in the 305 rendered image. 307 REQMT-2b: The solution MUST support a means to identify 308 the number and spatial arrangement of audio 309 channels including monaural, stereophonic 310 (2.0), and 3.0 (left, center, right) audio 311 channels. 313 REQMT-2c: The solution MUST NOT preclude the use of 314 binaural audio. [Edt. This is an outstanding 315 issue. Text will be changed when the issue is 316 resolved.] 318 REQMT-3: The solution MUST support a mechanism to enable a 319 satisfactory spatial matching between audio and video 320 streams coming from the same endpoints. 322 Use case is point to point symmetric, and all use cases. 324 REQMT-3a: The solution MUST enable individual audio 325 streams to be associated with one or more video 326 image captures, and individual video image 327 captures to be associated with one or more 328 audio captures, for the purpose of rendering 329 proper position. 331 REQMT-3b: The solution MUST enable individual audio 332 streams to be rendered in any desired spatial 333 position. 335 Edt: Rendering is an open issue. Text will 336 be changed when it is resolved.] 338 REQMT-4: The solution MUST enable interoperability between 339 endpoints that have a different number of similar devices. 340 For example, one endpoint may have 1 screen, 1 speaker, 1 341 camera, 1 mic, and another endpoint may have 3 screens, 2 342 speakers, 3 cameras and 2 mics. Or, in a multi-point 343 conference, one endpoint may have one screen, another may 344 have 2 screens and a third may have 3 screens. This 345 includes endpoints where the number of devices of a given 346 type is zero. 348 Use case is asymmetric point to point and multipoint. 350 REQMT-5: The solution MUST support means of enabling 351 interoperability between telepresence endpoints where 352 cameras are of different picture aspect ratios. 354 REQMT-6: The solution MUST provide scaling information which 355 enables rendering of a video image at the actual size of 356 the captured scene. 358 REQMT-7: The solution MUST support means of enabling 359 interoperability between telepresence endpoints where 360 displays are of different resolutions. 362 REQMT-8: The solution MUST support methods for handling different 363 bit rates in the same conference. 365 REQMT-9: The solution MUST support means of enabling 366 interoperability between endpoints that send and receive 367 different numbers of media streams. 369 Use case heterogeneous and multipoint. 371 REQMT-10: The solution MUST make it possible for endpoints without 372 support for telepresence extensions to participate in a 373 telepresence session with those that do. 375 REQMT-11: The solution MUST support a mechanism for determining 376 whether or not an endpoint or MCU is capable of 377 telepresence extensions. 379 REQMT-12: The solution MUST support a means to enable more than two 380 sites to participate in a teleconference. 382 Use case multipoint. 384 REQMT-13: The solution MUST support both transcoding and switching 385 approaches to providing multipoint conferences. 387 REQMT-14: The solution MUST support mechanisms to make possible for 388 either or both site switching or segment switching. [Edt: 389 This needs rewording. Deferred until layout discussion is 390 resolved.] 392 REQMT-15: The solution MUST support mechanisms for presentations in 393 such a way that: 395 * Presentations can have different sources 397 * Presentations can be seen by all 399 * There can be variation in placement, number and size of 400 presentations 402 REQMT-16: The solution MUST include extensibility mechanisms. 404 6. Acknowledgements 406 This draft has benefitted from all the comments on the mailing list 407 and a number of discussions. So many people contributed that it is 408 not possible to list them all. 410 7. IANA Considerations 412 TBD 414 8. Security Considerations 416 TBD 418 9. Informative References 420 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 421 Requirement Levels", BCP 14, RFC 2119, March 1997. 423 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 424 A., Peterson, J., Sparks, R., Handley, M., and E. 425 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 426 June 2002. 428 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 429 Session Initiation Protocol (SIP)", RFC 4353, 430 February 2006. 432 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 433 January 2008. 435 [StageDirection(Wikipedia)] 436 Wikipedia, "Blocking (stage), available from http:// 437 en.wikipedia.org/wiki/Stage_direction#Stage_directions", 438 May 2011, . 441 Appendix A. Open issues 443 OPEN-1 Binaural Audio [REQMT-2C] The need to support of binaural 444 audio is unresolved, and the "MUST NOT preclude" language in 445 this requirement is problematic. The authors believe this 446 requirement needs to be either changed or withdrawn, 447 depending on how the issue is resolved. 449 OPEN-2 Reference to Rendering [REQMT-3b] This is the only 450 requirement which refers to rendering. It may also be empty, 451 since receivers can rendering audio captures as they wish. 452 This is deferred until broader discussion on rendering 453 requirements is concluded. 455 OPEN-3 Conference modes [REQMT-14] This wording of this requirement 456 is problematic in part because the conference modes (site 457 switching and segment switching) are not defined. It at 458 least needs rewording. This is deferred until broader 459 discussion on layout is concluded. 461 OPEN-4 Need to capture requirement that attributes can change at any 462 time during the call. 464 OPEN-5 Need to add requirement for three dimensions in the right 465 place 467 OPEN-6 Multi-view, is there a requirement needed? 469 Appendix B. Changes From Earlier Versions 471 Note to the RFC-Editor: please remove this section prior to 472 publication as an RFC. 474 B.1. Changes From Draft -00 476 o Requirement #2, The solution MUST support a means to identify 477 monaural, stereophonic (2.0), and 3.0 (left, center, right) audio 478 channels. 480 changed to 482 The solution MUST support a means to identify the number and 483 spatial arrangement of audio channels including monaural, 484 stereophonic (2.0), and 3.0 (left, center, right) audio channels. 486 o Added back references to the Use case document. 488 * Requirement #1 Use case point to point symmetric, and all other 489 use cases. 491 * Requirement #2 Use case point to point symmetric, and all use 492 cases, especially heterogeneous. 494 * Requirement #3 Use case point to point symmetric, and all use 495 cases. 497 * Requirement #4 Use case is asymmetric point to point, and 498 multipoint. 500 * Requirement #9 Use case heterogeneous and multipoint. 502 * Requirement #12 Use case multipoint. 504 Authors' Addresses 506 Allyn Romanow 507 Cisco Systems 508 San Jose, CA 95134 509 USA 511 Email: allyn@cisco.com 513 Stephen Botzko 514 Polycom 515 Andover, MA 01810 516 US 518 Email: stephen.botzko@polycom.com