idnits 2.17.1 draft-romanow-clue-audio-rendering-tag-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The abstract seems to contain references ([I-D.ietf-clue-framework]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (May 31, 2012) is 4347 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-25) exists of draft-ietf-clue-framework-05 == Outdated reference: A later version (-04) exists of draft-lennox-clue-rtp-usage-03 Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE A. Romanow 3 Internet-Draft R. Hansen 4 Intended status: Standards Track Cisco Systems 5 Expires: December 2, 2012 A. Pepperell 6 Silverflare 7 B. Baldino 8 Cisco Systems 9 May 31, 2012 11 The need for audio rendering tag mechanism in the CLUE Framework 12 draft-romanow-clue-audio-rendering-tag-00 14 Abstract 16 The purpose of this draft is for discussion in the CLUE working 17 group. 19 It proposes adding an audio rendering tag to the CLUE framework 20 [I-D.ietf-clue-framework], which makes it possible for the consumer 21 to correctly render audio with respect to video in a multistream 22 video conference. The solution proposed is in partial response to 23 CLUE Task #10, Does Framework provide sufficient info for receiver? 25 Status of this Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on December 2, 2012. 42 Copyright Notice 44 Copyright (c) 2012 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Motivation- the issue . . . . . . . . . . . . . . . . . . . . . 3 60 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Audio Rendering Tag Mechanism . . . . . . . . . . . . . . . . . 3 62 4. Use of the RTP header extension . . . . . . . . . . . . . . . . 5 63 5. Use case note . . . . . . . . . . . . . . . . . . . . . . . . . 6 64 6. Security Considerations . . . . . . . . . . . . . . . . . . . . 6 65 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 6 66 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 6 67 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 68 9.1. Normative References . . . . . . . . . . . . . . . . . . . 7 69 9.2. Informative References . . . . . . . . . . . . . . . . . . 7 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7 72 1. Motivation- the issue 74 A goal for CLUE audio is that listeners perceive the direction of a 75 sound source to be the same as that of the visual image of the 76 source; this is referred to as directional audio. In some situations 77 the existing clue mechanisms are adequate. The consumer can use the 78 spatial information to correctly place the audio when the provider 79 advertisement includes spatial information (point of origin and 80 capture area) giving a static relationship between both video and 81 associated audio captures. 83 However, in some circumstances, for different reasons, the audio 84 and/or video spatial information is not sent in the provider 85 advertisement. For instance, the case of a three-screen system 86 advertising three video captures and one switched audio capture, 87 where the audio is switched from the loudest of three microphones. 88 In this case, how will the consumer know how to associate the audio 89 with the correct video so it can be played out in the correct 90 location? 92 Here we suggest a simple mechanism -- audio rendering tagging. 94 When audio and video cannot be matched through provider advertisement 95 spatial information, we would like the ability to play out audio on 96 multiple speakers matching the position of the speaker in the 97 original scene. Also, the audio may be assigned to a speaker in 98 real-time. It may need to be mixe locally and played out on any 99 speaker. For example, if the consumer wants to hear the top 3 100 speakers, regardless of where they are located remotely, if all 3 top 101 speakers are coming from the left, then the 3 speakers need to be 102 mixed, perhaps locally, and played out on the left. 104 Note: Several typical scenarios are described at the end of this note 105 in section titled Use Case. 107 2. Terminology 109 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 111 document are to be interpreted as described in RFC 2119 [RFC2119] and 112 indicate requirement levels for compliant implementations. 114 3. Audio Rendering Tag Mechanism 116 We propose an audio tagging mechanism In order to cope with a 117 changing mapping of the most significant audio and video participants 118 (i.e., normal MCU operations in the presence of more participants' 119 media streams that can be rendered simultaneously) and to get audio 120 played out correctly to multiple speakers. A consumer optionally 121 tells the provider an audio tag value corresponding to each of its 122 chosen video captures which enables received audio to be associated 123 with the correct video stream, even when the set of audible 124 participants changes. This information is included with the consumer 125 request so there is no need for additional CLUE message exchanges 126 (specifically, no additional provider capture advertisements or 127 consumer requests). 129 The audio tags are defined in the consumer request as opposed to in a 130 capture advertised by a producer. The reason for this is that it is 131 valid for a consumer to request a capture multiple times (with 132 different encodings, for example) and hence a method is required for 133 differentiating between these streams. 135 When the consumer configures the provider, saying which captures it 136 wants, it also optionally includes an audio tag with each capture 137 request. For example, VC1, ATag1; VC2, ATag2. When the provider 138 sends audio packets to the consumer, it includes the appropriate 139 audio tag in an RTP header extension. For example, if the provider 140 is sending audio packets that are associated with VC1, it tags the 141 packets with ATag1. The consumer can then play out the audio in a 142 position appropriate for video from VC1. 144 Suppose that several audio streams need to be played out through the 145 same speaker - for example, the 3 audio streams (AC1, AC2, AC3) need 146 to be played out at the speaker associated with VC1. The provider 147 would send: 149 AC1 ATag1 150 AC2 ATag1 151 AC3 ATag1 153 AC1, AC2 and AC3 are all played out on the same speaker, the audio 154 output associated with VC1. This takes care of the issue of dynamic 155 audio output - assigning the right speaker to audio streams. 157 Figure 1 illustrates an example showing 3 screens, each with a main 158 video and 3 PIPs. Below each screen is a list of the video captures, 159 VCs with the associated Audio Tag. 161 ----------------------3 Screens --------------------- 162 |------------------+- -----------------+------------------Y 163 | | | | 164 | VC1 | VC2 | VC3 | 165 | | | | 166 | | | | 167 | | | | 168 | ''''|'''''''''| | ''''|'''''|'''| | '''''|''''|''''|| 169 | |VC4|.VC5.|VC6| | |VC7|.VC8.|VC9| | |VC10|VC11|VC12|| 170 '------------------+-------------------+------------------- 171 VC1 VC2 VC3 172 VC4 Audio Tag 1 VC7 Audio tag 2 VC10 Audio tag 3 173 VC5 VC8 VC11 174 VC6 VC9 VC12 176 Figure 1: Audio rendering tags for 3 screen example 178 The provider may choose not to include the extension header in an 179 audio packet, signaling that there is no association between the 180 current audio and current video (i.e., an audio-only participant). 181 It may also include more than one audio tag in the extension header, 182 signaling that this audio is associated with multiple current video 183 participants, due perhaps to a capture being received multiple times 184 at different resolutions, or two video captures that both include the 185 current speaker. 187 This mechanism also allows multiple audio streams to be associated 188 with a single video stream (i.e. for a composed video stream); this 189 simply requires the appropriate audio packets to be tagged with the 190 same id. 192 4. Use of the RTP header extension 194 We propose that audio tags are integer numbers between 0 and 255 195 optionally set by the consumer per requested capture. This allows up 196 to 16 tags to be included in a one-byte RTP header extension [RFC 197 5285]. An example header extension for an audio packet with one tag 198 follows. The audio tag extension is ID1. The example includes 199 another header extension (ID0) to show how the proposal would 200 interact with [I-D.lennox-clue-rtp-usage]: 202 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 203 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 204 | 0xBE | 0xDE | length=1 | 205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 206 | ID0 | L=0 | data | ID1 | L=0 | Tag | 207 -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 209 RTP ext headers for audio rendering tag and capture ID 211 The lack of the RTP header extension in a packet means that the audio 212 packet is not associated with any of the requested video streams that 213 included audio tags. 215 5. Use case note 217 o An endpoint can receive multiple video and audio streams and 218 render complex layouts locally. 219 o It may have a wide display area so directional audio is important. 220 o It may have one loudspeaker per display, or perhaps some entirely 221 different multi-loudspeaker setup known only to the endpoint 222 itself. 223 o The endpoint may therefore have the capability of playing back 224 audio from a wide range of positions. 225 o Either from a few fixed zones or with fine granularity. 226 o Either by routing a sound source to a single loudspeaker, by 227 panning between pairs of loudspeakers, or by some other advanced 228 distribution scheme involving several or even all loudspeakers. 230 6. Security Considerations 232 TBD 234 7. Acknowledgements 236 Thanks to Johan Nielsen for discussions and adding the Use case 237 note.cuss 239 8. IANA Considerations 241 TBD 243 9. References 244 9.1. Normative References 246 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 247 Requirement Levels", BCP 14, RFC 2119, March 1997. 249 9.2. Informative References 251 [I-D.ietf-clue-framework] 252 Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, 253 "Framework for Telepresence Multi-Streams", 254 draft-ietf-clue-framework-05 (work in progress), May 2012. 256 [I-D.lennox-clue-rtp-usage] 257 Lennox, J., Witty, P., and A. Romanow, "Real-Time 258 Transport Protocol (RTP) Usage for Telepresence Sessions", 259 draft-lennox-clue-rtp-usage-03 (work in progress), 260 March 2012. 262 Authors' Addresses 264 Allyn Romanow 265 Cisco Systems 266 San Jose, CA 95134 267 USA 269 Email: allyn@cisco.com 271 Robert Hansen 272 Cisco Systems 273 Langley, 274 UK 276 Email: rohanse2@cisco.com 278 Andy Pepperell 279 Silverflare 281 Email: andy.pepperell@silverflare.com 282 Brian Baldino 283 Cisco Systems 284 San Jose, CA 95134 285 USA 287 Email: bbaldino@cisco.com