idnits 2.17.1 draft-ietf-avtext-mixer-to-client-audio-level-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 18, 2011) is 4788 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC3551' is defined on line 492, but no explicit reference was found in the text ** Obsolete normative reference: RFC 5285 (Obsoleted by RFC 8285) -- Obsolete informational reference (is this intentional?): RFC 3920 (Obsoleted by RFC 6120) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Ivov, Ed. 3 Internet-Draft SIP Communicator 4 Intended status: Informational E. Marocco, Ed. 5 Expires: August 22, 2011 Telecom Italia 6 J. Lennox 7 Vidyo, Inc. 8 February 18, 2011 10 A Real-Time Transport Protocol (RTP) Header Extension for Mixer-to- 11 Client Audio Level Indication 12 draft-ietf-avtext-mixer-to-client-audio-level-00 14 Abstract 16 This document describes a mechanism for RTP-level mixers in audio 17 conferences to deliver information about the audio level of the 18 individual participants. Such audio level indicators are transported 19 in the same RTP packets as the audio data they pertain to. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 22, 2011. 38 Copyright Notice 40 Copyright (c) 2011 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 3. Protocol Operation . . . . . . . . . . . . . . . . . . . . . . 4 58 4. Header Format . . . . . . . . . . . . . . . . . . . . . . . . 6 59 5. Audio level encoding . . . . . . . . . . . . . . . . . . . . . 6 60 6. Signaling Information . . . . . . . . . . . . . . . . . . . . 7 61 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 62 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 63 9. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 10 64 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 65 11. Appendix: Design choices . . . . . . . . . . . . . . . . . . . 10 66 11.1. SIP event package for conference state . . . . . . . . . 10 67 11.2. The RTP Control Protocol (RTCP) . . . . . . . . . . . . . 11 68 11.3. Encoding levels in the payload . . . . . . . . . . . . . 11 69 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 70 12.1. Normative References . . . . . . . . . . . . . . . . . . 12 71 12.2. Informative References . . . . . . . . . . . . . . . . . 12 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 74 1. Introduction 76 The Framework for Conferencing with the Session Initiation Protocol 77 (SIP) defined in RFC 4353 [RFC4353] presents an overall architecture 78 for multi-party conferencing. Among others, the framework borrows 79 from RTP [RFC3550] and extends the concept of a mixer entity 80 "responsible for combining the media streams that make up a 81 conference, and generating one or more output streams that are 82 delivered to recipients". Every participant would hence receive, in 83 a flat single stream, media originating from all the others. 85 Using such centralized mixer-based architectures simplifies support 86 for conference calls on the client side since they would hardly 87 differ from one-to-one conversations. However, the method also 88 introduces a few limitations. The flat nature of the streams that a 89 mixer would output and send to participants makes it difficult for 90 users to identify the original source of what they are hearing. 92 Mechanisms that allow the mixer to send to participants cues on 93 current speakers (e.g. the CSRC fields in RTP [RFC3550]) only work 94 for speaking/silent binary indications. There are, however, a number 95 of use cases where one would require more detailed information. 96 Possible examples include the presence of background chat/noise/ 97 music/typing, someone breathing noisily in their microphone, or other 98 cases where identifying the source of the disturbance would make it 99 easy to remove it (e.g. by sending a private IM to the concerned 100 party asking them to mute their microphone). A more advanced 101 scenario could involve an intense discussion between multiple 102 participants that the user does not personally know. Audio level 103 information would help better recognize the speakers by associating 104 with them complex (but still human readable) characteristics like 105 loudness and speed for example. 107 One way of presenting such information in a user friendly manner 108 would be for a conferencing client to attach audio level indicators 109 to the corresponding participant related components in the user 110 interface as displayed in Figure 1. 112 ________________________ 113 | | 114 | 00:42 | Weekly Call | 115 |________________________| 116 | | 117 | | 118 | Alice |====== | (S) | 119 | | 120 | Bob |= | | 121 | | 122 | Carol | | (M) | 123 | | 124 | Dave |=== | | 125 | | 126 |________________________| 128 Figure 1: Displaying detailed speaker information to the user by 129 including audio level for every participant. 131 Implementing a user interface like the above requires analysis of the 132 media sent from other participants. In a conventional audio 133 conference this is only possible for the mixer since all other 134 conference participants are generally receiving a single, flat audio 135 stream and have therefore no immediate way of determining individual 136 audio levels. 138 This document specifies an RTP extension header that allows such 139 mixers to deliver audio level information to conference participants 140 by including it directly in the RTP packets transporting the 141 corresponding audio data. 143 2. Terminology 145 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 146 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 147 document are to be interpreted as described in RFC 2119 [RFC2119]. 149 3. Protocol Operation 151 According to RFC 3550 [RFC3550] a mixer is expected to include in 152 outgoing RTP packets a list of identifiers (CSRC IDs) indicating the 153 sources that contributed to the resulting stream. The presence of 154 such CSRC IDs allows an RTP client to determine, in a binary way, the 155 active speaker(s) in any given moment. RTCP also provides a basic 156 mechanism to map the CSRC IDs to user identities through the CNAME 157 field. More advanced mechanisms, may exist depending on the 158 signaling protocol used to establish and control a conference. In 159 the case of the Session Initiation Protocol [RFC3261] for example, 160 the Event Package for Conference State [RFC4575] defines a 161 tag which binds CSRC IDs to media streams and SIP URIs. 163 This document describes an RTP header extension that allows mixers to 164 indicate the audio-level of every conference participant (CSRC) in 165 addition to simply indicating their on/off status. This new header 166 extension is based on the "General Mechanism for RTP Header 167 Extensions" [RFC5285]. 169 Each instance of this header contains a list of one-octet audio 170 levels expressed in -dBov, with values from 0 to 127 representing 0 171 to -127 dBov(see Section 4 and Section 5). 173 Every audio level value pertains to the CSRC identifier located at 174 the corresponding position in the CSRC list. In other words, the 175 first value would indicate the audio level of the conference 176 participant represented by the first CSRC identifier in that packet 177 and so forth. The number and order of these values MUST therefore 178 match the number and order of the CSRC IDs present in the same 179 packet. 181 When encoding audio level information, a mixer SHOULD include in a 182 packet information that corresponds to the audio data being 183 transported in that same packet. It is important that these values 184 follow the actual stream as closely as possible. Therefore a mixer 185 SHOULD also calculate the values after the original contributing 186 stream has undergone possible processing such as level normalization, 187 and noise reduction for example. 189 Note that in some cases a mixer may be sending an RTP audio stream 190 that only contains audio level information and no actual audio. 191 Updating a (web) interface conference module may be one reason for 192 this to happen. 194 It may sometimes happen that a conference involves more than a single 195 mixer. In such cases each of the mixers MAY choose to relay the CSRC 196 list and audio-level information they receive from peer mixers (as 197 long as the total CSRC count remains below 16). Given that the 198 maximum audio level is not precisely defined by this specification, 199 it is likely that in such situations average audio levels would be 200 perceptibly different for the participants located behind the 201 different mixers. 203 4. Header Format 205 The audio level indicators are delivered to the receivers in-band 206 using the "General Mechanism for RTP Header Extensions" [RFC5285]. 207 The payload of this extension is an ordered sequence of 8-bit audio 208 level indicators encoded as per Section 5. 210 0 1 2 3 211 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 212 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 213 | ID | len |0| level 1 |0| level 2 |0| level 3 ... 214 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 216 Figure 2: Audio level indicators extension format 218 The 4-bit len field is the number minus one of data bytes (i.e. audio 219 level values) transported in this header extension element following 220 the one-byte header. Therefore, the value zero in this field 221 indicates that one byte of data follows. A value of 15 is not 222 allowed by this specification and it MUST NOT be used as the RTP 223 header can carry a maximum of 15 CSRC IDs. The maximum value allowed 224 is therefore 14 indicating a following sequence of 15 audio level 225 values. 227 Note that use of the two-byte header defined in RFC 5285 [RFC5285] 228 follows the same rules the only change being the length of the ID and 229 len fields. 231 5. Audio level encoding 233 Audio level indicators are encoded in the same manner as audio noise 234 level in the RTP Payload Comfort Noise specification [RFC3389] and 235 audio level in the RTP Extension Header for Client-to-mixer Audio 236 Level Notification [I-D.lennox-avt-rtp-audio-level-exthdr] 237 specification. The magnitude of the audio level is packed into the 238 least significant bits of one audio-level byte with the most 239 significant bit unused and always set to 0 as shown below in 240 Figure 3. 242 0 1 2 3 4 5 6 7 243 +-+-+-+-+-+-+-+-+ 244 |0| level | 245 +-+-+-+-+-+-+-+-+ 247 Figure 3: Audio Level Encoding 249 The audio level is expressed in -dBov, with values from 0 to 127 250 representing 0 to -127 dBov. dBov is the level, in decibels, relative 251 to the overload point of the system, i.e. the maximum-amplitude 252 signal that can be handled by the system without clipping. (Note: 253 Representation relative to the overload point of a system is 254 particularly useful for digital implementations, since one does not 255 need to know the relative calibration of the analog circuitry.) For 256 example, in the case of u-law (audio/pcmu) audio [ITU.G.711], the 0 257 dBov reference would be a square wave with values +/- 8031. (This 258 translates to 6.18 dBm0, relative to u-law's dBm0 definition in Table 259 6 of G.711.) 261 6. Signaling Information 263 The URI for declaring the audio level header extension in an SDP 264 extmap attribute and mapping it to a local extension header 265 identifier is "urn:ietf:params:rtp-hdrext:csrc-audio-level". There 266 is no additional setup information needed for this extension (i.e. no 267 extensionattributes). 269 An example attribute line in the SDP, for a conference might be: 271 a=extmap:7 urn:ietf:params:rtp-hdrext:csrc-audio-level 273 The above mapping will most often be provided per media stream (in 274 the media-level section(s) of SDP, i.e., after an "m=" line) or 275 globally if there is more than one stream containing audio level 276 indicators in a session. 278 Presence of the above attribute in the SDP description of a media 279 stream indicates that some or all RTP packets in that stream would 280 contain the audio level information RTP extension header. 282 Conferencing clients that support audio level indicators and have no 283 mixing capabilities SHOULD always include the direction parameter in 284 the "extmap" attribute setting it to "recvonly". Conference focus 285 entities with mixing capabilities MAY omit the direction or set it to 286 "sendrecv" in SDP offers. Such entities SHOULD set it to "sendonly" 287 in SDP answers to offers with a "recvonly" parameter and to 288 "sendrecv" when answering other "sendrecv" offers. 290 The following Figure 4 and Figure 5 show two example offer/answer 291 exchanges between a conferencing client and a focus, and between two 292 conference focus entities. 294 v=0 295 o=alice 2890844526 2890844526 IN IP6 host.example.com 296 c=IN IP6 host.example.com 297 t=0 0 298 m=audio 49170 RTP/AVP 0 4 299 a=rtpmap:0 PCMU/8000 300 a=rtpmap:4 G723/8000 301 a=extmap:1/recvonly urn:ietf:params:rtp-hdrext:csrc-audio-level 303 v=0 304 i=A Seminar on the session description protocol 305 o=conf-focus 2890844730 2890844730 IN IP6 focus.example.net 306 c=IN IP6 focus.example.net 307 t=0 0 308 m=audio 52543 RTP/AVP 0 309 a=rtpmap:0 PCMU/8000 310 a=extmap:1/sendonly urn:ietf:params:rtp-hdrext:csrc-audio-level 312 A client-initiated example SDP offer/answer exchange negotiating an 313 audio stream with one-way flow of of audio level information. 315 Figure 4 317 v=0 318 i=Un seminaire sur le protocole de description des sessions 319 o=fr-focus 2890844730 2890844730 IN IP6 focus.fr.example.net 320 c=IN IP6 focus.fr.example.net 321 t=0 0 322 m=audio 49170 RTP/AVP 0 323 a=rtpmap:0 PCMU/8000 324 a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-audio-level 326 v=0 327 i=A Seminar on the session description protocol 328 o=us-focus 2890844526 2890844526 IN IP6 focus.us.example.net 329 c=IN IP6 focus.us.example.net 330 t=0 0 331 m=audio 52543 RTP/AVP 0 332 a=rtpmap:0 PCMU/8000 333 a=extmap:1/sendrecv urn:ietf:params:rtp-hdrext:csrc-audio-level 335 An example SDP offer/answer exchange between two conference focus 336 entities with mixing capabilities negotiating an audio stream with 337 bidirectional flwo of audio level information. 339 Figure 5 341 7. Security Considerations 343 1. This document defines a means of attributing audio level to a 344 particular participant in a conference. An attacker may try to 345 modify the content of RTP packets in a way that would make audio 346 activity from one participant appear as coming from another. 347 2. Furthermore, the fact that audio level values would not be 348 protected even in an SRTP session may be of concern in some cases 349 where the activity of a particular participant in a conference is 350 confidential. 351 3. Both of the above are concerns that stem from the design of the 352 RTP protocol itself and they would probably also apply when using 353 CSRC identifiers the way they were specified in RFC 3550 354 [RFC3550]. It is therefore important that according to the needs 355 of a particular scenario, implementors and deployers consider use 356 of a lower level security and authentication mechanism. 358 8. IANA Considerations 360 This document defines a new extension URI that, if approved, would 361 need to be added to the RTP Compact Header Extensions sub-registry of 362 the Real-Time Transport Protocol (RTP) Parameters registry, according 363 to the following data: 365 Extension URI: urn:ietf:params:rtp-hdrext:csrc-audio-level 366 Description: Mixer-to-client audio level indicators 367 Contact: emcho@sip-communicator.org 368 Reference: RFC XXXX 370 9. Open Issues 372 At the time of writing of this document the authors have no clear 373 view on how and if the following list of issues should be address 374 here: 375 1. Audio levels in video streams. This specification allows use of 376 audio level values in "silent" audio streams that don't otherwise 377 carry any payload thus allowing their delivery within systems 378 where the various focus/mixer components communicate with each 379 other as conference participants. The same train of thought may 380 very well justify audio level transport in video streams. 381 2. It has been suggested to reference ITU P.56 [ITU.P56.1993] for 382 level measurement. This needs to be investigated. 384 10. Acknowledgments 386 Roni Even, Ingemar Johansson, Michael Ramalho and several others 387 provided helpful feedback over the dispatch mailing list. 389 SIP Communicator's participation in this specification is funded by 390 the NLnet Foundation. 392 11. Appendix: Design choices 394 During discussions on the subject of audio levels the decision to 395 transport audio levels in RTP packets, rather than another protocol 396 was questioned several times which is why the authors find it worth 397 explaining here. The following subsections describe alternative 398 mechanisms for delivering audio levels and the reasons why authors 399 decided not to use them. 401 11.1. SIP event package for conference state 403 RFC 4575 [RFC4575] defines a conference event package for tightly 404 coupled conferences using the Session Initiation Protocol (SIP) 405 events framework. It allows for the delivery of various conference 406 related details such as conference descriptions, participant count 407 and identity. The document also provides a way of indicating who the 408 speakers are at any given moment by specifying a mechanism for 409 mapping conference participants to RTP SSRC/CSRC identifiers. All 410 these details are dispatched in an asynchronous manner using the SIP 411 events framework, or, in other words, through NOTIFY SIP requests 412 following an initial SUBSCRIBE from a participant. 414 Contrary to "plain" active speaker infomation, where significant 415 changes only occur once every several seconds, audio level in human 416 speech is obviously a very time sensitive characteristic which would 417 require frequent updates (i.e. approximately once every 50-100 ms). 418 In order for the update of the user interface to appear "natural" to 419 the user, audio level information would probably have to be delivered 420 for every one or two RTP packets. Using RFC 4575 [RFC4575] or SIP in 421 general for this would generate traffic on the (often low-bandwidth) 422 signalling path comparable to, if not exceeding, the media itself. 423 It may also prove relatively hard for client developers to 424 synchronize the information they receive from SIP messages with the 425 one they obtain from the media flows. 427 It is probably also worth mentioning that the use of RFC 4575 428 [RFC4575] for such a feature would make the mechanism incompatible 429 with non-SIP signaling protocols like, for example, XMPP [RFC3920] 430 and its Jingle extensions. 432 11.2. The RTP Control Protocol (RTCP) 434 Similar to using SIP, delivering audio levels through RTCP would 435 cause bandidth and synchronization issues. Furthermore the RTP 436 specification [RFC3550] explicitly recommends that the fraction of 437 the session bandwidth added for RTCP be fixed at 5% which could not 438 be sufficient for the transport of audio level indicators. 440 11.3. Encoding levels in the payload 442 Given the content specific nature of audio levels, it has been 443 suggested that audio level information be encoded and transmitted as 444 part of the payload. While this is indeed a feasible approach, 445 implementing it would require a substantial effort. In order to 446 implement support for such a feature, client developers would need to 447 explicitly handle it in all individual codec modules of their 448 application. Compared to RTP extensions, the mechanism would 449 therefore represent a substantial additional effort without offering 450 any meaningful advantages. 452 12. References 453 12.1. Normative References 455 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 456 Requirement Levels", BCP 14, RFC 2119, March 1997. 458 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 459 Jacobson, "RTP: A Transport Protocol for Real-Time 460 Applications", STD 64, RFC 3550, July 2003. 462 [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP 463 Header Extensions", RFC 5285, July 2008. 465 12.2. Informative References 467 [I-D.lennox-avt-rtp-audio-level-exthdr] 468 Lennox, J., Ivov, E., and E. Marocco, "A Real-Time 469 Transport Protocol (RTP) Header Extension for Client-to- 470 Mixer Audio Level Indication", 471 draft-lennox-avt-rtp-audio-level-exthdr-02 (work in 472 progress), July 2010. 474 [ITU.G.711] 475 International Telecommunications Union, "Pulse Code 476 Modulation (PCM) of Voice Frequencies", ITU- 477 T Recommendation G.711, November 1988. 479 [ITU.P56.1993] 480 International Telecommunications Union, "Objective 481 Measurement of Active Speech Level", ITU-T Recommendation 482 P.56, March 1988. 484 [RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston, 485 A., Peterson, J., Sparks, R., Handley, M., and E. 486 Schooler, "SIP: Session Initiation Protocol", RFC 3261, 487 June 2002. 489 [RFC3389] Zopf, R., "Real-time Transport Protocol (RTP) Payload for 490 Comfort Noise (CN)", RFC 3389, September 2002. 492 [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and 493 Video Conferences with Minimal Control", STD 65, RFC 3551, 494 July 2003. 496 [RFC3920] Saint-Andre, P., Ed., "Extensible Messaging and Presence 497 Protocol (XMPP): Core", RFC 3920, October 2004. 499 [RFC4353] Rosenberg, J., "A Framework for Conferencing with the 500 Session Initiation Protocol (SIP)", RFC 4353, 501 February 2006. 503 [RFC4575] Rosenberg, J., Schulzrinne, H., and O. Levin, "A Session 504 Initiation Protocol (SIP) Event Package for Conference 505 State", RFC 4575, August 2006. 507 Authors' Addresses 509 Emil Ivov (editor) 510 SIP Communicator 511 Strasbourg 67000 512 France 514 Email: emcho@sip-communicator.org 516 Enrico Marocco (editor) 517 Telecom Italia 518 Via G. Reiss Romoli, 274 519 Turin 10148 520 Italy 522 Email: enrico.marocco@telecomitalia.it 524 Jonathan Lennox 525 Vidyo, Inc. 526 433 Hackensack Avenue 527 Seventh Floor 528 Hackensack, NJ 07601 529 US 531 Email: jonathan@vidyo.com