idnits 2.17.1 draft-ietf-avtext-client-to-mixer-audio-level-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 14, 2011) is 4509 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5285 (Obsoleted by RFC 8285) == Outdated reference: A later version (-05) exists of draft-ietf-avtcore-srtp-encrypted-header-ext-01 == Outdated reference: A later version (-04) exists of draft-ietf-avtcore-srtp-vbr-audio-03 == Outdated reference: A later version (-06) exists of draft-ietf-avtext-mixer-to-client-audio-level-05 Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AVT J. Lennox, Ed. 3 Internet-Draft Vidyo 4 Intended status: Standards Track E. Ivov 5 Expires: May 17, 2012 Jitsi 6 E. Marocco 7 Telecom Italia 8 November 14, 2011 10 A Real-Time Transport Protocol (RTP) Header Extension for Client-to- 11 Mixer Audio Level Indication 12 draft-ietf-avtext-client-to-mixer-audio-level-06 14 Abstract 16 This document defines a mechanism by which packets of Real-Time 17 Transport Protocol (RTP) audio streams can indicate, in an RTP header 18 extension, the audio level of the audio sample carried in the RTP 19 packet. In large conferences, this can reduce the load on an audio 20 mixer or other middlebox which wants to forward only a few of the 21 loudest audio streams, without requiring it to decode and measure 22 every stream that is received. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on May 17, 2012. 41 Copyright Notice 43 Copyright (c) 2011 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. Audio Levels . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 4. Signaling (Setup) Information . . . . . . . . . . . . . . . . 5 62 5. Considerations on Use . . . . . . . . . . . . . . . . . . . . 6 63 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 64 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 65 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 66 8.1. Normative References . . . . . . . . . . . . . . . . . . . 8 67 8.2. Informative References . . . . . . . . . . . . . . . . . . 8 68 Appendix A. Changes From Earlier Versions . . . . . . . . . . . . 9 69 A.1. Changes From Draft -05 . . . . . . . . . . . . . . . . . . 9 70 A.2. Changes From Draft -04 . . . . . . . . . . . . . . . . . . 9 71 A.3. Changes From Draft -03 . . . . . . . . . . . . . . . . . . 9 72 A.4. Changes From Draft -02 . . . . . . . . . . . . . . . . . . 10 73 A.5. Changes From Draft -01 . . . . . . . . . . . . . . . . . . 10 74 A.6. Changes From Individual Submission Draft -01 . . . . . . . 10 75 A.7. Changes From Individual Submission Draft -00 . . . . . . . 10 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 78 1. Introduction 80 In a centralized Real-Time Transport Protocol (RTP) [RFC3550] audio 81 conference, an audio mixer or forwarder receives audio streams from 82 many or all of the conference participants. It then selectively 83 forwards some of them to other participants in the conference. In 84 large conferences, it is possible that such a server might be 85 receiving a large number of streams, of which only a few are intended 86 to be forwarded to the other conference participants. 88 In such a scenario, in order to pick the audio streams to forward, a 89 centralized server needs to decode, measure audio levels, and 90 possibly perform voice activity detection on audio data from a large 91 number of streams. The need for such processing limits the size or 92 number of conferences such a server can support. 94 As an alternative, this document defines an RTP header extension 95 [RFC5285] through which senders of audio packets can indicate the 96 audio level of the packets' payload, reducing the processing load for 97 a server. 99 The header extension in this draft is different than, but 100 complementary with, the one defined in 101 [I-D.ietf-avtext-mixer-to-client-audio-level], which defines a 102 mechanism by which audio mixers can indicate to clients the levels of 103 the contributing sources that made up the mixed audio. 105 2. Terminology 107 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 108 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 109 document are to be interpreted as described in RFC 2119 [RFC2119] and 110 indicate requirement levels for compliant implementations. 112 3. Audio Levels 114 The audio level header extension carries the level of the audio in 115 the RTP [RFC3550] payload of the packet it is associated with. This 116 information is carried in an RTP header extension element as defined 117 by the "General Mechanism for RTP Header Extensions" [RFC5285]. 119 The payload of the audio level header extension element can be 120 encoded using the one-byte or the two-byte header defined in 121 [RFC5285]. Figure 1 and Figure 2 show sample audio level encodings 122 with each of them. 124 0 1 125 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 126 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 127 | ID | len=0 |V| level | 128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 130 Sample audio level encoding using the one-byte header format 132 Figure 1 134 0 1 2 3 135 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 136 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 137 | ID | len=1 |V| level | 0 (pad) | 138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 140 Sample audio level encoding using the two-byte header format 142 Figure 2 144 Note that, as indicated in [RFC5285] length field in the one-byte 145 header format takes the value 0 to indicate that 1 byte follows. In 146 the two-byte header format on the other hand it takes the value of 1. 148 The magnitude of the audio level itself is packed into the seven 149 least significant bits of the single byte of the header extension, 150 shown in Figure 1 and Figure 2. The least significant bit of the 151 audio level magnitude is packed into the least significant bit of the 152 byte. The most significant bit of the byte is used as a separate 153 flag bit "V", defined below. 155 The audio level is expressed in -dBov, with values from 0 to 127 156 representing 0 to -127 dBov. dBov is the level, in decibels, relative 157 to the overload point of the system, i.e. the highest-intensity 158 signal encodable by the payload format. (Note: Representation 159 relative to the overload point of a system is particularly useful for 160 digital implementations, since one does not need to know the relative 161 calibration of the analog circuitry.) For example, in the case of 162 u-law (audio/pcmu) audio [ITU.G711.1988], the 0 dBov reference would 163 be a square wave with values +/- 8031. (This translates to 6.18 164 dBm0, relative to u-law's dBm0 definition in Table 6 of G.711.) 166 The audio level for digital silence, for example for a muted audio 167 source, MUST be represented as 127 (-127 dBov), regardless of the 168 dynamic range of the encoded audio format. 170 The audio level header extension only carries the level of the audio 171 in the RTP payload of the packet it is associated with, with no long- 172 term averaging or smoothing applied. For payload formats that 173 contain extra error-correction bits or loss-concealment information, 174 the level corresponds only to the data that would result from the 175 payload's normal decoding process, not what it would produce under 176 error or packet loss concealment. The level is measured as a root 177 mean square of all the samples in the audio encoded by the packet. 179 To simplify implementation of the encoding procedures described here, 180 the reference implementation section in 181 [I-D.ietf-avtext-mixer-to-client-audio-level] provides a sample Java 182 implementation of an audio level calculator that helps obtain such 183 values from raw linear PCM audio samples. 185 In addition, a flag bit (labeled V) optionally indicates whether the 186 encoder believes the audio packet contains voice activity. If the V 187 bit is in use, the value 1 indicates that the encoder believes the 188 audio packet contains voice activity, and the value 0 indicates that 189 the encoder believes it does not. (The voice activity detection 190 algorithm is unspecified and left implementation-specific.) If the V 191 bit is not in use, its value is unspecified and MUST be ignored by 192 receivers. The use of the V bit is signaled using the extension 193 attribute "vad", discussed in Section 4. 195 When this header extension is used with RTP data sent using the RTP 196 Payload for Redundant Audio Data [RFC2198], the header's data 197 describes the contents of the primary encoding. 199 Note: This audio level is defined in the same manner as is audio 200 noise level in the RTP Payload Comfort Noise specification [RFC3389]. 201 In the comfort noise specification, the overall magnitude of the 202 noise level in comfort noise is encoded into the first byte of the 203 payload, with spectral information about the noise in subsequent 204 bytes. This specification's audio level parameter is defined so as 205 to be identical to the comfort noise payload's noise-level byte. 207 4. Signaling (Setup) Information 209 The URI for declaring this header extension in an extmap attribute is 210 "urn:ietf:params:rtp-hdrext:ssrc-audio-level". 212 It has a single extension attribute, named "vad". It takes the form 213 "vad=on" or "vad=off". If the header extension element is signaled 214 with "vad=on", the "V" bit described in Section 3 is in use, and MUST 215 be set by senders. If the header extension element is signaled with 216 "vad=off", the "V" bit is not in use, and its value MUST be ignored 217 by receivers. If the "vad" extension attribute is not specified, the 218 default is "vad=on". 220 An example attribute line in the SDP, for a conference might hence 221 be: 223 a=extmap:6 urn:ietf:params:rtp-hdrext:ssrc-audio-level vad=on 225 The "vad" extension attribute only controls the semantics of this 226 header extension attribute, and does not make any statement about 227 whether the sender is using any other voice activity detection 228 features such as discontinuous transmission, comfort noise, or 229 silence suppression. 231 Using the mechanisms of [RFC5285], an endpoint MAY signal multiple 232 instances of the header extension element, with different values of 233 the vad attribute, so long as these instances use different values 234 for the extension identifier. However, again following the rules of 235 [RFC5285], the semantics chosen for a header extension element 236 (including its vad setting) for a particular extension identifier 237 value MUST NOT be changed within an RTP session. 239 5. Considerations on Use 241 Mixers and forwarders generally ought not base audio forwarding 242 decisions directly on packet-by-packet audio level information, but 243 rather ought to apply some analysis of the audio levels and trends. 244 This general rule applies whether audio levels are provided by 245 endpoints (as defined in this document), or are calculated at a 246 server, as would be done in the absence of this information. This 247 section discusses several issues that mixers and forwarders may wish 248 to take into account. (Note that this section provides design 249 guidance only, and is not normative.) 251 First of all, audio levels generally ought to be measured over longer 252 intervals than that of a single audio packet. In order to avoid 253 false-positives for short bursts of sound (such as a cough or a 254 dropped microphone), it is often useful to require that a 255 participant's audio level be maintained for some period of time 256 before considering it to be "real", i.e. some type of low-pass filter 257 ought to be applied to the audio levels. Note, though, that such 258 filtering must be balanced with the need to avoid clipping of the 259 beginning of a speaker's speech. 261 Additionally, different participants may have their audio input set 262 differently. It may be useful to apply some sort of automatic gain 263 control to the audio levels. There are a number of possible 264 approaches to acheiving this, e.g. by measuring peak audio levels, by 265 average audio levels during speech, or by measuring background audio 266 levels (average audio level levels during non-speech). 268 6. Security Considerations 270 A malicious endpoint could choose to set the values in this header 271 extension falsely, so as to falsely claim that audio or voice is or 272 is not present. It is not clear what could be gained by falsely 273 claiming that audio is not present, but an endpoint falsely claiming 274 that audio is present could perform a denial-of-service attack on an 275 audio conference, so as to send silence to suppress other conference 276 members' audio, or could dominate a conference (by seizing its 277 speaker-selection algorithm) without actually speaking. Thus, if a 278 device relies on audio level data from untrusted endpoints, it SHOULD 279 periodically audit the level information transmitted, taking 280 appropriate corrective action against endpoints that appear to be 281 sending incorrect data. (However, as it is valid for an endpoint to 282 choose to measure audio levels prior to encoding, some degree of 283 discrepancy could be present. This would not indicate that an 284 endpoint is malicous.) 286 In the Secure Real-Time Transport Protocol (SRTP) [RFC3711], RTP 287 header extensions are authenticated but not encrypted. When this 288 header extension is used, audio levels are therefore visible on a 289 packet-by-packet basis to an attacker passively observing the audio 290 stream. As discussed in [I-D.ietf-avtcore-srtp-vbr-audio], such an 291 attacker might be able to infer information about the conversation, 292 possibly with phoneme-level resolution. In scenarios where this is a 293 concern, additional mechanisms MUST be used to protect the 294 confidentiality of the header extension. This mechanism could be 295 header extension encryption 296 [I-D.ietf-avtcore-srtp-encrypted-header-ext], or a lower-level 297 security and authentication mechanism such as IPsec [RFC4301]. 299 7. IANA Considerations 301 This document defines a new extension URI to the RTP Compact Header 302 Extensions subregistry of the Real-Time Transport Protocol (RTP) 303 Parameters registry, according to the following data: 305 Extension URI: urn:ietf:params:rtp-hdrext:ssrc-audio-level 306 Description: Audio Level 307 Contact: jonathan@vidyo.com 308 Reference: RFC XXXX 310 Note to RFC Editor: please replace "RFC XXXX" with the number of this 311 RFC. 313 8. References 315 8.1. Normative References 317 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 318 Requirement Levels", BCP 14, RFC 2119, March 1997. 320 [RFC2198] Perkins, C., Kouvelas, I., Hodson, O., Hardman, V., 321 Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse- 322 Parisis, "RTP Payload for Redundant Audio Data", RFC 2198, 323 September 1997. 325 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 326 Jacobson, "RTP: A Transport Protocol for Real-Time 327 Applications", STD 64, RFC 3550, July 2003. 329 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 330 Internet Protocol", RFC 4301, December 2005. 332 [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP 333 Header Extensions", RFC 5285, July 2008. 335 8.2. Informative References 337 [I-D.ietf-avtcore-srtp-encrypted-header-ext] 338 Lennox, J., "Encryption of Header Extensions in the Secure 339 Real-Time Transport Protocol (SRTP)", 340 draft-ietf-avtcore-srtp-encrypted-header-ext-01 (work in 341 progress), October 2011. 343 [I-D.ietf-avtcore-srtp-vbr-audio] 344 Perkins, C. and J. Valin, "Guidelines for the use of 345 Variable Bit Rate Audio with Secure RTP", 346 draft-ietf-avtcore-srtp-vbr-audio-03 (work in progress), 347 July 2011. 349 [I-D.ietf-avtext-mixer-to-client-audio-level] 350 Ivov, E., Marocco, E., and J. Lennox, "A Real-Time 351 Transport Protocol (RTP) Header Extension for Mixer-to- 352 Client Audio Level Indication", 353 draft-ietf-avtext-mixer-to-client-audio-level-05 (work in 354 progress), September 2011. 356 [ITU.G711.1988] 357 International Telecommunications Union, "Pulse Code 358 Modulation (PCM) of Voice Frequencies", ITU- 359 T Recommendation G.711, November 1988. 361 [RFC3389] Zopf, R., "Real-time Transport Protocol (RTP) Payload for 362 Comfort Noise (CN)", RFC 3389, September 2002. 364 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 365 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 366 RFC 3711, March 2004. 368 Appendix A. Changes From Earlier Versions 370 Note to the RFC-Editor: please remove this section prior to 371 publication as an RFC. 373 A.1. Changes From Draft -05 375 o Added an informative reference to RFC 4301 (IPsec). (Brought up 376 by Stephen Farrell) 377 o Clarified the meaning of "overload point of the system". (Brought 378 up by Robert Sparks). 379 o Clarified that levels correspond only to the audio carried in the 380 normal decoding process, not error or packet loss concealment. 381 (Brought up by Robert Sparks). 382 o Added security consideration that false audio levels could be used 383 to seize a speaker-selection algorithm (Brought up by Robert 384 Sparks and Stewart Bryant). 385 o Updated reference to [I-D.ietf-avtcore-srtp-vbr-audio]. 387 A.2. Changes From Draft -04 389 o Adjusted IPR header. 391 A.3. Changes From Draft -03 393 o Added vad extension attribute to negotiate use of the V bit. 394 o Addressed editorial comments made on the mailing list. 396 A.4. Changes From Draft -02 398 o Changed encoding related text so that it would cover both the one- 399 byte and the two-byte header formats. 400 o Clarified use of root mean square for dBov calculation 401 o Added references to the sample level calculator in 402 [I-D.ietf-avtext-mixer-to-client-audio-level]. 403 o Changed affiliation for Emil Ivov. 404 o Other minor editorial changes. 406 A.5. Changes From Draft -01 408 o Changed the URI for declaring this header extension from 409 "urn:ietf:params:rtp-hdrext:audio-level" to 410 "urn:ietf:params:rtp-hdrext:ssrc-audio-level" for consistency with 411 [I-D.ietf-avtext-mixer-to-client-audio-level]. 412 o Removed the "Limitations" section; it was discussing a potential 413 extension that consensus indicated was out of scope of this 414 document. 415 o Closed the P.56 open issue. It was agreed on IETF 80 that P.56 is 416 mostly about speech levels and the levels transported by the 417 extension defined here should also be able to serve as an 418 indication for noise. 419 o Closed the open issue about transmitting noise floor information. 420 Noise floor is (loosely) inferrable by observing the per-packet 421 level information over a period of time, so the additional 422 complexity seemed unnecessary. 423 o Editorial changes for consistency with 424 [I-D.ietf-avtext-mixer-to-client-audio-level]. 425 o Moved several descriptions of normative items that previously had 426 only been described in informative sections of the text. 427 o Other editorial clarifications. 429 A.6. Changes From Individual Submission Draft -01 431 o This version is primarily a document refresh. 432 o Emil Ivov and Enrico Marocco have been added as co-authors. 433 o Additional open issues listed. 435 A.7. Changes From Individual Submission Draft -00 437 o The draft name has been changed to clarify that this document 438 defines Client-To-Mixer Audio Levels, to more clearly distinguish 439 it from [I-D.ietf-avtext-mixer-to-client-audio-level]. 440 o The header extension format has been changed from a two-byte to a 441 one-byte payload, eliminating the 7 reserved bits and the one 442 must-be-zero bit. 444 o The sections Considerations on Use (Section 5) and Limitations 445 have been added. 446 o It has been noted that senders MAY indicate -127 dBov for digital 447 silence, and that level measurement MAY be done prior to encoding 448 audio. 449 o A reference to [I-D.ietf-avtcore-srtp-encrypted-header-ext] has 450 been added to the security considerations. 451 o The term "header extension" is now used consistentenly throughout 452 the document (as opposed to "extension header"). 454 Authors' Addresses 456 Jonathan Lennox (editor) 457 Vidyo, Inc. 458 433 Hackensack Avenue 459 Seventh Floor 460 Hackensack, NJ 07601 461 US 463 Email: jonathan@vidyo.com 465 Emil Ivov 466 Jitsi 467 Strasbourg 67000 468 France 470 Email: emcho@jitsi.org 472 Enrico Marocco 473 Telecom Itialia 474 Via G. Reiss Romoli, 274 475 Turin 10148 476 Italy 478 Email: enrico.marocco@telecomitalia.it