idnits 2.17.1 draft-ietf-avtext-client-to-mixer-audio-level-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 18, 2011) is 4788 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5285 (Obsoleted by RFC 8285) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AVT J. Lennox, Ed. 3 Internet-Draft Vidyo 4 Intended status: Standards Track E. Ivov 5 Expires: August 22, 2011 SIP Communicator 6 E. Marocco 7 Telecom Italia 8 February 18, 2011 10 A Real-Time Transport Protocol (RTP) Header Extension for Client-to- 11 Mixer Audio Level Indication 12 draft-ietf-avtext-client-to-mixer-audio-level-00 14 Abstract 16 This document defines a mechanism by which packets of Real-Time 17 Transport Protocol (RTP) audio streams can indicate, in an RTP header 18 extension, the audio level of the audio sample carried in the RTP 19 packet. In large conferences, this can reduce the load on an audio 20 mixer or other middlebox which wants to forward only a few of the 21 loudest audio streams, without requiring it to decode and measure 22 every stream that is received. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on August 22, 2011. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 3. Audio Levels . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 4. Signaling (Setup) Information . . . . . . . . . . . . . . . . 6 74 5. Considerations on Use . . . . . . . . . . . . . . . . . . . . 6 75 6. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 6 76 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 77 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 79 9.1. Normative References . . . . . . . . . . . . . . . . . . . 8 80 9.2. Informative References . . . . . . . . . . . . . . . . . . 8 81 Appendix A. Open issues . . . . . . . . . . . . . . . . . . . . . 9 82 Appendix B. Changes From Earlier Versions . . . . . . . . . . . . 9 83 B.1. Changes From Individual Submission Draft -01 . . . . . . . 9 84 B.2. Changes From Individual Submission Draft -00 . . . . . . . 9 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 87 1. Introduction 89 In a centralized Real-Time Transport Protocol (RTP) [RFC3550] audio 90 conference, an audio mixer or forwarder receives audio streams from 91 many or all of the conference participants. It then selectively 92 forwards some of them to other participants in the conference. In 93 large conferences, it is possible that such a server might be 94 receiving a large number of streams, of which only a few should be 95 forwarded to the other conference participants. 97 In such a scenario, in order to pick the audio streams to forward, a 98 centralized server needs to decode, measure audio levels, and 99 possibly perform voice activity detection on audio data from a large 100 number of streams. The need for such processing limits the size or 101 number of conferences such a server can support. 103 As an alternative, this document defines an RTP header extension 104 [RFC5285] through which senders of audio packets can indicate the 105 audio level of the packets' payload, reducing the processing load for 106 a server. 108 The header extension in this draft is different to, but complementary 109 with, the one defined in [I-D.ivov-avt-slic], which defines a 110 mechanism by which audio mixers can indicate to clients the levels of 111 the contributing sources that made up the mixed audio. 113 2. Terminology 115 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 116 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 117 document are to be interpreted as described in RFC 2119 [RFC2119] and 118 indicate requirement levels for compliant implementations. 120 3. Audio Levels 122 The audio level header extension carries both the level of the audio 123 carried in the RTP payload of the packet it is associated with, as 124 well as an indication as to whether voice activity has been detected 125 in the packet. 127 The form of the audio level extension block is as follows: 129 0 1 130 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 131 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 132 | ID | len=0 |V| level | 133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 135 Figure 1 137 The length field takes the value 0 to indicate that 1 byte follows. 139 The audio level is defined in the same manner as is audio noise level 140 in the RTP Comfort Noise [RFC3389] specification. In that 141 specification, the overall magnitude of the noise level is encoded 142 into the first byte of the payload, with spectral information about 143 the noise in subsequent bytes. This specification's audio level 144 parameter is defined so as to be identical to the comfort noise 145 payload's noise-level byte. 147 The magnitude of the audio level is packed into the seven least 148 significant bits of the single byte of the header extension, shown in 149 Figure 1. The least significant bit of the audio level magnitude is 150 packed into the least significant bit of the byte. The most 151 significant bit of the byte is used as a separate flag bit "V", 152 defined below. 154 The audio level is expressed in -dBov, with values from 0 to 127 155 representing 0 to -127 dBov. dBov is the level, in decibels, relative 156 to the overload point of the system, i.e. the maximum-amplitude 157 signal that can be handled by the system without clipping. (Note: 158 Representation relative to the overload point of a system is 159 particularly useful for digital implementations, since one does not 160 need to know the relative calibration of the analog circuitry.) For 161 example, in the case of u-law (audio/pcmu) audio [ITU.G711.1988], the 162 0 dBov reference would be a square wave with values +/- 8031. (This 163 translates to 6.18 dBm0, relative to u-law's dBm0 definition in Table 164 6 of G.711.) 166 In addition, a flag bit (labeled V) indicates whether the encoder 167 believes the audio packet contains voice activity (1) or does not 168 (0). The voice activity detection algorithm is unspecified and left 169 implementation-specific. 171 The audio level for digital silence (e.g. all-0 pcmu audio), for 172 example for a muted audio source, MAY be represented as 127 (-127 173 dBov), regardless of the dynamic range of the encoded audio format. 175 When this header extension is used with RTP data sent using the RTP 176 Payload for Redundant Audio Data [RFC2198], the header's data 177 describes the contents of the primary encoding. 179 4. Signaling (Setup) Information 181 The URI for declaring this header extension in an extmap attribute is 182 "urn:ietf:params:rtp-hdrext:audio-level". There is no additional 183 setup information needed for this extension (no extensionattributes). 185 5. Considerations on Use 187 Mixers and forwarders generally should not base audio forwarding 188 decisions directly on packet-by-packet audio level information, but 189 rather should apply some analysis of the audio levels and trends. 190 This general rule applies whether audio levels are provided by 191 endpoints (as defined in this document), or are calculated at a 192 server, as would be done in the absence of this information. This 193 section discusses several issues that mixers and forwarders may wish 194 to take into account. (Note that this section provides design 195 guidance only, and is not normative.) 197 First of all, audio levels should generally be measured over longer 198 intervals than that of a single audio packet. In order to avoid 199 false-positives for short bursts of sound (such as a cough or a 200 dropped microphone), it is often useful to require that a 201 participant's audio level be maintained for some period of time 202 before considering it to be "real", i.e. some type of low-pass filter 203 should be applied to the audio levels. Note, though, that such 204 filtering must be balanced with the need to avoid clipping of the 205 beginning of a speaker's speech. 207 Additionally, different participants may have their audio input set 208 differently. It may be useful to apply some sort of automatic gain 209 control to the audio levels. There are a number of possible 210 approaches to acheiving this, e.g. by measuring peak audio levels, by 211 average audio levels during speech, or by measuring background audio 212 levels (average audio level levels during non-speech). 214 6. Limitations 216 The audio levels carried by the extension header defined by this 217 document are defined as dBov, decibels below system overload. 219 In principle, it could be more useful to have, instead, dB SPL, 220 decibels of sound pressure level. In traditional telephony systems, 221 telephone handsets were calibrated such that a particular (e.g.) 222 u-law audio level, or analog voltage, corresponded to a particular 223 sound pressure level at the handset's mouthpiece. 225 However, in many environments, this information is not available. 226 Notably, PC soundcard hardware can only determine the levels of mic- 227 or line-in at the hardware input, and operating systems usually allow 228 further adjustments of audio input levels without providing 229 information about these transformations to applications. 230 Furthermore, in many circumstances, such as speech synthesis or mixed 231 audio, an "audio" signal may in fact never have actually existed as 232 sound pressure at all. 234 Thus, while information about the correspondance between dB SPL and 235 dBov, or encoded audio, could be useful, this document does not 236 attempt to define it. If there are circumstances in which this 237 information would be useful, a separate header extension would be 238 straightforward to define. (The information carried by such a header 239 extension could indeed be useful independently from the information 240 in the header extension defined by this document.) 242 7. Security Considerations 244 A malicious endpoint could choose to set the values in this header 245 extension falsely, so as to falsely claim that audio or voice is or 246 is not present. It is not clear what could be gained by falsely 247 claiming that audio is not present, but an endpoint falsely claiming 248 that audio is present could perform a denial-of-service attack on an 249 audio conference, so as to send silence to suppress other conference 250 members' audio. Thus, a device relying on audio level data from 251 untrusted endpoints SHOULD periodically audit the level information 252 transmitted, taking appropriate corrective action if endpoints appear 253 to be sending incorrect data. (Note that endpoints MAY choose to 254 measure audio levels prior to encoding, so some degree of discrepancy 255 SHOULD be tolerated.) 257 In the Secure Real-Time Transport Protocol (SRTP) [RFC3711], RTP 258 header extensions are authenticated but not encrypted. When this 259 header extension is used, audio levels are therefore visible on a 260 packet-by-packet basis to an attacker passively observing the audio 261 stream. As discussed in [I-D.perkins-avt-srtp-vbr-audio], such an 262 attacker might be able to infer information about the conversation, 263 possibly with phoneme-level resolution. In scenarios where this is a 264 concern, additional mechanisms SHOULD be used to protect the 265 confidentiality of the header extension. One solution would be 266 header extension encryption 267 [I-D.lennox-avt-srtp-encrypted-extension-headers]. 269 8. IANA Considerations 271 This document defines a new extension URI to the RTP Compact Header 272 Extensions subregistry of the Real-Time Transport Protocol (RTP) 273 Parameters registry, according to the following data: 275 Extension URI: urn:ietf:params:rtp-hdrext:audio-level 276 Description: Audio Level 277 Contact: jonathan@vidyo.com 278 Reference: RFC XXXX 280 9. References 282 9.1. Normative References 284 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 285 Requirement Levels", BCP 14, RFC 2119, March 1997. 287 [RFC2198] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V., 288 Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse- 289 Parisis, "RTP Payload for Redundant Audio Data", RFC 2198, 290 September 1997. 292 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 293 Jacobson, "RTP: A Transport Protocol for Real-Time 294 Applications", STD 64, RFC 3550, July 2003. 296 [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP 297 Header Extensions", RFC 5285, July 2008. 299 9.2. Informative References 301 [I-D.ivov-avt-slic] 302 Ivov, E., Marocco, E., and J. Lennox, "A Real-Time 303 Transport Protocol (RTP) Extension Header for Mixer-to- 304 client Audio Level Indication", draft-ivov-avt-slic-04 305 (work in progress), January 2011. 307 [I-D.lennox-avt-srtp-encrypted-extension-headers] 308 Lennox, J., "Encryption of Header Extensions in the Secure 309 Real-Time Transport Protocol (SRTP)", 310 draft-lennox-avt-srtp-encrypted-extension-headers-02 (work 311 in progress), October 2010. 313 [I-D.perkins-avt-srtp-vbr-audio] 314 Perkins, C. and J. Valin, "Guidelines for the use of 315 Variable Bit Rate Audio with Secure RTP", 316 draft-perkins-avt-srtp-vbr-audio-05 (work in progress), 317 December 2010. 319 [ITU.G711.1988] 320 International Telecommunications Union, "Pulse Code 321 Modulation (PCM) of Voice Frequencies", ITU- 322 T Recommendation G.711, November 1988. 324 [ITU.P56.1993] 325 International Telecommunications Union, "Objective 326 Measurement of Active Speech Level", ITU-T Recommendation 327 P.56, March 1988. 329 [RFC3389] Zopf, R., "Real-time Transport Protocol (RTP) Payload for 330 Comfort Noise (CN)", RFC 3389, September 2002. 332 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 333 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 334 RFC 3711, March 2004. 336 Appendix A. Open issues 338 o In order to more accurately determine signal-to-noise ratio, it 339 would be useful for a sender to also send its estimate of its 340 current audio noise floor. If so, it's unclear whether this would 341 be better as a separate header extension element, or added to this 342 header extension element. 343 o It has been suggested to reference ITU P.56 [ITU.P56.1993] for 344 level measurement. This needs to be investigated. 346 Appendix B. Changes From Earlier Versions 348 Note to the RFC-Editor: please remove this section prior to 349 publication as an RFC. 351 B.1. Changes From Individual Submission Draft -01 353 o This version is primarily a document refresh. 354 o Emil Ivov and Enrico Marocco have been added as co-authors. 355 o Additional open issues listed. 357 B.2. Changes From Individual Submission Draft -00 359 o The draft name has been changed to clarify that this document 360 defines Client-To-Mixer Audio Levels, to more clearly distinguish 361 it from [I-D.ivov-avt-slic]. 363 o The header extension format has been changed from a two-byte to a 364 one-byte payload, eliminating the 7 reserved bits and the one 365 must-be-zero bit. 366 o The sections Considerations on Use (Section 5) and Limitations 367 (Section 6) have been added. 368 o It has been noted that senders MAY indicate -127 dBov for digital 369 silence, and that level measurement MAY be done prior to encoding 370 audio. 371 o A reference to [I-D.lennox-avt-srtp-encrypted-extension-headers] 372 has been added to the security considerations. 373 o The term "header extension" is now used consistentenly throughout 374 the document (as opposed to "extension header"). 376 Authors' Addresses 378 Jonathan Lennox (editor) 379 Vidyo, Inc. 380 433 Hackensack Avenue 381 Seventh Floor 382 Hackensack, NJ 07601 383 US 385 Email: jonathan@vidyo.com 387 Emil Ivov 388 SIP Communicator 389 Strasbourg 67000 390 France 392 Email: emcho@sip-communicator.org 394 Enrico Marocco 395 Telecom Itialia 396 Via G. Reiss Romoli, 274 397 Turin 10148 398 Italy 400 Email: enrico.marocco@telecomitalia.it