idnits 2.17.1 draft-ietf-codec-oggopus-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document updates RFC5334, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year (Using the creation date from RFC5334, updated by this document, for RFC5378 checks: 2007-12-03) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 28, 2016) is 3011 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 1453 -- Looks like a reference, but probably isn't: '8' on line 1289 == Missing Reference: 'RFCXXXX' is mentioned on line 1323, but not defined ** Downref: Normative reference to an Informational RFC: RFC 3533 ** Downref: Normative reference to an Informational RFC: RFC 4732 ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) -- Possible downref: Non-RFC (?) normative reference: ref. 'EBU-R128' -- Obsolete informational reference (is this intentional?): RFC 6982 (Obsoleted by RFC 7942) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 codec T. Terriberry 3 Internet-Draft Mozilla Corporation 4 Updates: 5334 (if approved) R. Lee 5 Intended status: Standards Track Voicetronix 6 Expires: July 31, 2016 R. Giles 7 Mozilla Corporation 8 January 28, 2016 10 Ogg Encapsulation for the Opus Audio Codec 11 draft-ietf-codec-oggopus-11 13 Abstract 15 This document defines the Ogg encapsulation for the Opus interactive 16 speech and audio codec. This allows data encoded in the Opus format 17 to be stored in an Ogg logical bitstream. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on July 31, 2016. 36 Copyright Notice 38 Copyright (c) 2016 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 3. Packet Organization . . . . . . . . . . . . . . . . . . . . . 3 56 4. Granule Position . . . . . . . . . . . . . . . . . . . . . . 5 57 4.1. Repairing Gaps in Real-time Streams . . . . . . . . . . . 6 58 4.2. Pre-skip . . . . . . . . . . . . . . . . . . . . . . . . 7 59 4.3. PCM Sample Position . . . . . . . . . . . . . . . . . . . 8 60 4.4. End Trimming . . . . . . . . . . . . . . . . . . . . . . 9 61 4.5. Restrictions on the Initial Granule Position . . . . . . 9 62 4.6. Seeking and Pre-roll . . . . . . . . . . . . . . . . . . 10 63 5. Header Packets . . . . . . . . . . . . . . . . . . . . . . . 11 64 5.1. Identification Header . . . . . . . . . . . . . . . . . . 11 65 5.1.1. Channel Mapping . . . . . . . . . . . . . . . . . . . 15 66 5.2. Comment Header . . . . . . . . . . . . . . . . . . . . . 20 67 5.2.1. Tag Definitions . . . . . . . . . . . . . . . . . . . 23 68 6. Packet Size Limits . . . . . . . . . . . . . . . . . . . . . 25 69 7. Encoder Guidelines . . . . . . . . . . . . . . . . . . . . . 26 70 7.1. LPC Extrapolation . . . . . . . . . . . . . . . . . . . . 26 71 7.2. Continuous Chaining . . . . . . . . . . . . . . . . . . . 27 72 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 27 73 9. Security Considerations . . . . . . . . . . . . . . . . . . . 28 74 10. Content Type . . . . . . . . . . . . . . . . . . . . . . . . 28 75 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 76 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 77 13. RFC Editor Notes . . . . . . . . . . . . . . . . . . . . . . 30 78 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 79 14.1. Normative References . . . . . . . . . . . . . . . . . . 30 80 14.2. Informative References . . . . . . . . . . . . . . . . . 31 81 14.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 32 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 84 1. Introduction 86 The IETF Opus codec is a low-latency audio codec optimized for both 87 voice and general-purpose audio. See [RFC6716] for technical 88 details. This document defines the encapsulation of Opus in a 89 continuous, logical Ogg bitstream [RFC3533]. Ogg encapsulation 90 provides Opus with a long-term storage format supporting all of the 91 essential features, including metadata, fast and accurate seeking, 92 corruption detection, recapture after errors, low overhead, and the 93 ability to multiplex Opus with other codecs (including video) with 94 minimal buffering. It also provides a live streamable format, 95 capable of delivery over a reliable stream-oriented transport, 96 without requiring all the data, or even the total length of the data, 97 up-front, in a form that is identical to the on-disk storage format. 99 Ogg bitstreams are made up of a series of 'pages', each of which 100 contains data from one or more 'packets'. Pages are the fundamental 101 unit of multiplexing in an Ogg stream. Each page is associated with 102 a particular logical stream and contains a capture pattern and 103 checksum, flags to mark the beginning and end of the logical stream, 104 and a 'granule position' that represents an absolute position in the 105 stream, to aid seeking. A single page can contain up to 65,025 106 octets of packet data from up to 255 different packets. Packets can 107 be split arbitrarily across pages, and continued from one page to the 108 next (allowing packets much larger than would fit on a single page). 109 Each page contains 'lacing values' that indicate how the data is 110 partitioned into packets, allowing a demultiplexer (demuxer) to 111 recover the packet boundaries without examining the encoded data. A 112 packet is said to 'complete' on a page when the page contains the 113 final lacing value corresponding to that packet. 115 This encapsulation defines the contents of the packet data, including 116 the necessary headers, the organization of those packets into a 117 logical stream, and the interpretation of the codec-specific granule 118 position field. It does not attempt to describe or specify the 119 existing Ogg container format. Readers unfamiliar with the basic 120 concepts mentioned above are encouraged to review the details in 121 [RFC3533]. 123 2. Terminology 125 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 126 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 127 "OPTIONAL" in this document are to be interpreted as described in 128 [RFC2119]. 130 3. Packet Organization 132 An Ogg Opus stream is organized as follows. 134 There are two mandatory header packets. The first packet in the 135 logical Ogg bitstream MUST contain the identification (ID) header, 136 which uniquely identifies a stream as Opus audio. The format of this 137 header is defined in Section 5.1. It is placed alone (without any 138 other packet data) on the first page of the logical Ogg bitstream, 139 and completes on that page. This page has its 'beginning of stream' 140 flag set. 142 The second packet in the logical Ogg bitstream MUST contain the 143 comment header, which contains user-supplied metadata. The format of 144 this header is defined in Section 5.2. It MAY span multiple pages, 145 beginning on the second page of the logical stream. However many 146 pages it spans, the comment header packet MUST finish the page on 147 which it completes. 149 All subsequent pages are audio data pages, and the Ogg packets they 150 contain are audio data packets. Each audio data packet contains one 151 Opus packet for each of N different streams, where N is typically one 152 for mono or stereo, but MAY be greater than one for multichannel 153 audio. The value N is specified in the ID header (see 154 Section 5.1.1), and is fixed over the entire length of the logical 155 Ogg bitstream. 157 The first (N - 1) Opus packets, if any, are packed one after another 158 into the Ogg packet, using the self-delimiting framing from 159 Appendix B of [RFC6716]. The remaining Opus packet is packed at the 160 end of the Ogg packet using the regular, undelimited framing from 161 Section 3 of [RFC6716]. All of the Opus packets in a single Ogg 162 packet MUST be constrained to have the same duration. An 163 implementation of this specification SHOULD treat any Opus packet 164 whose duration is different from that of the first Opus packet in an 165 Ogg packet as if it were a malformed Opus packet with an invalid 166 Table Of Contents (TOC) sequence. 168 The TOC sequence at the beginning of each Opus packet indicates the 169 coding mode, audio bandwidth, channel count, duration (frame size), 170 and number of frames per packet, as described in Section 3.1 171 of [RFC6716]. The coding mode is one of SILK, Hybrid, or Constrained 172 Energy Lapped Transform (CELT). The combination of coding mode, 173 audio bandwidth, and frame size is referred to as the configuration 174 of an Opus packet. 176 Packets are placed into Ogg pages in order until the end of stream. 177 Audio data packets might span page boundaries. The first audio data 178 page could have the 'continued packet' flag set (indicating the first 179 audio data packet is continued from a previous page) if, for example, 180 it was a live stream joined mid-broadcast, with the headers pasted on 181 the front. A demuxer SHOULD NOT attempt to decode the data for the 182 first packet on a page with the 'continued packet' flag set if the 183 previous page with packet data does not end in a continued packet 184 (i.e., did not end with a lacing value of 255) or if the page 185 sequence numbers are not consecutive, unless the demuxer has some 186 special knowledge that would allow it to interpret this data despite 187 the missing pieces. An implementation MUST treat a zero-octet audio 188 data packet as if it were a malformed Opus packet as described in 189 Section 3.4 of [RFC6716]. 191 A logical stream ends with a page with the 'end of stream' flag set, 192 but implementations need to be prepared to deal with truncated 193 streams that do not have a page marked 'end of stream'. There is no 194 reason for the final packet on the last page to be a continued 195 packet, i.e., for the final lacing value to be 255. However, 196 demuxers might encounter such streams, possibly as the result of a 197 transfer that did not complete or of corruption. A demuxer SHOULD 198 NOT attempt to decode the data from a packet that continues onto a 199 subsequent page (i.e., when the page ends with a lacing value of 255) 200 if the next page with packet data does not have the 'continued 201 packet' flag set or does not exist, or if the page sequence numbers 202 are not consecutive, unless the demuxer has some special knowledge 203 that would allow it to interpret this data despite the missing 204 pieces. There MUST NOT be any more pages in an Opus logical 205 bitstream after a page marked 'end of stream'. 207 4. Granule Position 209 The granule position MUST be zero for the ID header page and the page 210 where the comment header completes. That is, the first page in the 211 logical stream, and the last header page before the first audio data 212 page both have a granule position of zero. 214 The granule position of an audio data page encodes the total number 215 of PCM samples in the stream up to and including the last fully- 216 decodable sample from the last packet completed on that page. The 217 granule position of the first audio data page will usually be larger 218 than zero, as described in Section 4.5. 220 A page that is entirely spanned by a single packet (that completes on 221 a subsequent page) has no granule position, and the granule position 222 field is set to the special value '-1' in two's complement. 224 The granule position of an audio data page is in units of PCM audio 225 samples at a fixed rate of 48 kHz (per channel; a stereo stream's 226 granule position does not increment at twice the speed of a mono 227 stream). It is possible to run an Opus decoder at other sampling 228 rates, but all of them evenly divide 48 kHz. Therefore, the value in 229 the granule position field always counts samples assuming a 48 kHz 230 decoding rate, and the rest of this specification makes the same 231 assumption. 233 The duration of an Opus packet as defined in [RFC6716] can be any 234 multiple of 2.5 ms, up to a maximum of 120 ms. This duration is 235 encoded in the TOC sequence at the beginning of each packet. The 236 number of samples returned by a decoder corresponds to this duration 237 exactly, even for the first few packets. For example, a 20 ms packet 238 fed to a decoder running at 48 kHz will always return 960 samples. A 239 demuxer can parse the TOC sequence at the beginning of each Ogg 240 packet to work backwards or forwards from a packet with a known 241 granule position (i.e., the last packet completed on some page) in 242 order to assign granule positions to every packet, or even every 243 individual sample. The one exception is the last page in the stream, 244 as described below. 246 All other pages with completed packets after the first MUST have a 247 granule position equal to the number of samples contained in packets 248 that complete on that page plus the granule position of the most 249 recent page with completed packets. This guarantees that a demuxer 250 can assign individual packets the same granule position when working 251 forwards as when working backwards. For this to work, there cannot 252 be any gaps. 254 4.1. Repairing Gaps in Real-time Streams 256 In order to support capturing a real-time stream that has lost or not 257 transmitted packets, a multiplexer (muxer) SHOULD emit packets that 258 explicitly request the use of Packet Loss Concealment (PLC) in place 259 of the missing packets. Implementations that fail to do so still 260 MUST NOT increment the granule position for a page by anything other 261 than the number of samples contained in packets that actually 262 complete on that page. 264 Only gaps that are a multiple of 2.5 ms are repairable, as these are 265 the only durations that can be created by packet loss or 266 discontinuous transmission. Muxers need not handle other gap sizes. 267 Creating the necessary packets involves synthesizing a TOC byte 268 (defined in Section 3.1 of [RFC6716])--and whatever additional 269 internal framing is needed--to indicate the packet duration for each 270 stream. The actual length of each missing Opus frame inside the 271 packet is zero bytes, as defined in Section 3.2.1 of [RFC6716]. 273 Zero-byte frames MAY be packed into packets using any of codes 0, 1, 274 2, or 3. When successive frames have the same configuration, the 275 higher code packings reduce overhead. Likewise, if the TOC 276 configuration matches, the muxer MAY further combine the empty frames 277 with previous or subsequent non-zero-length frames (using code 2 or 278 VBR code 3). 280 [RFC6716] does not impose any requirements on the PLC, but this 281 section outlines choices that are expected to have a positive 282 influence on most PLC implementations, including the reference 283 implementation. Synthesized TOC sequences SHOULD maintain the same 284 mode, audio bandwidth, channel count, and frame size as the previous 285 packet (if any). This is the simplest and usually the most well- 286 tested case for the PLC to handle and it covers all losses that do 287 not include a configuration switch, as defined in Section 4.5 288 of [RFC6716]. 290 When a previous packet is available, keeping the audio bandwidth and 291 channel count the same allows the PLC to provide maximum continuity 292 in the concealment data it generates. However, if the size of the 293 gap is not a multiple of the most recent frame size, then the frame 294 size will have to change for at least some frames. Such changes 295 SHOULD be delayed as long as possible to simplify things for PLC 296 implementations. 298 As an example, a 95 ms gap could be encoded as nineteen 5 ms frames 299 in two bytes with a single CBR code 3 packet. If the previous frame 300 size was 20 ms, using four 20 ms frames followed by three 5 ms frames 301 requires 4 bytes (plus an extra byte of Ogg lacing overhead), but 302 allows the PLC to use its well-tested steady state behavior for as 303 long as possible. The total bitrate of the latter approach, 304 including Ogg overhead, is about 0.4 kbps, so the impact on file size 305 is minimal. 307 Changing modes is discouraged, since this causes some decoder 308 implementations to reset their PLC state. However, SILK and Hybrid 309 mode frames cannot fill gaps that are not a multiple of 10 ms. If 310 switching to CELT mode is needed to match the gap size, a muxer 311 SHOULD do so at the end of the gap to allow the PLC to function for 312 as long as possible. 314 In the example above, if the previous frame was a 20 ms SILK mode 315 frame, the better solution is to synthesize a packet describing four 316 20 ms SILK frames, followed by a packet with a single 10 ms SILK 317 frame, and finally a packet with a 5 ms CELT frame, to fill the 95 ms 318 gap. This also requires four bytes to describe the synthesized 319 packet data (two bytes for a CBR code 3 and one byte each for two 320 code 0 packets) but three bytes of Ogg lacing overhead are needed to 321 mark the packet boundaries. At 0.6 kbps, this is still a minimal 322 bitrate impact over a naive, low quality solution. 324 Since medium-band audio is an option only in the SILK mode, wideband 325 frames SHOULD be generated if switching from that configuration to 326 CELT mode, to ensure that any PLC implementation which does try to 327 migrate state between the modes will be able to preserve all of the 328 available audio bandwidth. 330 4.2. Pre-skip 332 There is some amount of latency introduced during the decoding 333 process, to allow for overlap in the CELT mode, stereo mixing in the 334 SILK mode, and resampling. The encoder might have introduced 335 additional latency through its own resampling and analysis (though 336 the exact amount is not specified). Therefore, the first few samples 337 produced by the decoder do not correspond to real input audio, but 338 are instead composed of padding inserted by the encoder to compensate 339 for this latency. These samples need to be stored and decoded, as 340 Opus is an asymptotically convergent predictive codec, meaning the 341 decoded contents of each frame depend on the recent history of 342 decoder inputs. However, a player will want to skip these samples 343 after decoding them. 345 A 'pre-skip' field in the ID header (see Section 5.1) signals the 346 number of samples that SHOULD be skipped (decoded but discarded) at 347 the beginning of the stream, though some specific applications might 348 have a reason for looking at that data. This amount need not be a 349 multiple of 2.5 ms, MAY be smaller than a single packet, or MAY span 350 the contents of several packets. These samples are not valid audio. 352 For example, if the first Opus frame uses the CELT mode, it will 353 always produce 120 samples of windowed overlap-add data. However, 354 the overlap data is initially all zeros (since there is no prior 355 frame), meaning this cannot, in general, accurately represent the 356 original audio. The SILK mode requires additional delay to account 357 for its analysis and resampling latency. The encoder delays the 358 original audio to avoid this problem. 360 The pre-skip field MAY also be used to perform sample-accurate 361 cropping of already encoded streams. In this case, a value of at 362 least 3840 samples (80 ms) provides sufficient history to the decoder 363 that it will have converged before the stream's output begins. 365 4.3. PCM Sample Position 367 The PCM sample position is determined from the granule position using 368 the formula 370 'PCM sample position' = 'granule position' - 'pre-skip' . 372 For example, if the granule position of the first audio data page is 373 59,971, and the pre-skip is 11,971, then the PCM sample position of 374 the last decoded sample from that page is 48,000. 376 This can be converted into a playback time using the formula 378 'PCM sample position' 379 'playback time' = --------------------- . 380 48000.0 382 The initial PCM sample position before any samples are played is 383 normally '0'. In this case, the PCM sample position of the first 384 audio sample to be played starts at '1', because it marks the time on 385 the clock _after_ that sample has been played, and a stream that is 386 exactly one second long has a final PCM sample position of '48000', 387 as in the example here. 389 Vorbis streams use a granule position smaller than the number of 390 audio samples contained in the first audio data page to indicate that 391 some of those samples are trimmed from the output (see 392 [vorbis-trim]). However, to do so, Vorbis requires that the first 393 audio data page contains exactly two packets, in order to allow the 394 decoder to perform PCM position adjustments before needing to return 395 any PCM data. Opus uses the pre-skip mechanism for this purpose 396 instead, since the encoder might introduce more than a single 397 packet's worth of latency, and since very large packets in streams 398 with a very large number of channels might not fit on a single page. 400 4.4. End Trimming 402 The page with the 'end of stream' flag set MAY have a granule 403 position that indicates the page contains less audio data than would 404 normally be returned by decoding up through the final packet. This 405 is used to end the stream somewhere other than an even frame 406 boundary. The granule position of the most recent audio data page 407 with completed packets is used to make this determination, or '0' is 408 used if there were no previous audio data pages with a completed 409 packet. The difference between these granule positions indicates how 410 many samples to keep after decoding the packets that completed on the 411 final page. The remaining samples are discarded. The number of 412 discarded samples SHOULD be no larger than the number decoded from 413 the last packet. 415 4.5. Restrictions on the Initial Granule Position 417 The granule position of the first audio data page with a completed 418 packet MAY be larger than the number of samples contained in packets 419 that complete on that page, however it MUST NOT be smaller, unless 420 that page has the 'end of stream' flag set. Allowing a granule 421 position larger than the number of samples allows the beginning of a 422 stream to be cropped or a live stream to be joined without rewriting 423 the granule position of all the remaining pages. This means that the 424 PCM sample position just before the first sample to be played MAY be 425 larger than '0'. Synchronization when multiplexing with other 426 logical streams still uses the PCM sample position relative to '0' to 427 compute sample times. This does not affect the behavior of pre-skip: 428 exactly 'pre-skip' samples SHOULD be skipped from the beginning of 429 the decoded output, even if the initial PCM sample position is 430 greater than zero. 432 On the other hand, a granule position that is smaller than the number 433 of decoded samples prevents a demuxer from working backwards to 434 assign each packet or each individual sample a valid granule 435 position, since granule positions are non-negative. An 436 implementation MUST treat any stream as invalid if the granule 437 position is smaller than the number of samples contained in packets 438 that complete on the first audio data page with a completed packet, 439 unless that page has the 'end of stream' flag set. It MAY defer this 440 action until it decodes the last packet completed on that page. 442 If that page has the 'end of stream' flag set, a demuxer MUST treat 443 any stream as invalid if its granule position is smaller than the 444 'pre-skip' amount. This would indicate that there are more samples 445 to be skipped from the initial decoded output than exist in the 446 stream. If the granule position is smaller than the number of 447 decoded samples produced by the packets that complete on that page, 448 then a demuxer MUST use an initial granule position of '0', and can 449 work forwards from '0' to timestamp individual packets. If the 450 granule position is larger than the number of decoded samples 451 available, then the demuxer MUST still work backwards as described 452 above, even if the 'end of stream' flag is set, to determine the 453 initial granule position, and thus the initial PCM sample position. 454 Both of these will be greater than '0' in this case. 456 4.6. Seeking and Pre-roll 458 Seeking in Ogg files is best performed using a bisection search for a 459 page whose granule position corresponds to a PCM position at or 460 before the seek target. With appropriately weighted bisection, 461 accurate seeking can be performed in just one or two bisections on 462 average, even in multi-gigabyte files. See [seeking] for an example 463 of general implementation guidance. 465 When seeking within an Ogg Opus stream, an implementation SHOULD 466 start decoding (and discarding the output) at least 3840 samples 467 (80 ms) prior to the seek target in order to ensure that the output 468 audio is correct by the time it reaches the seek target. This 'pre- 469 roll' is separate from, and unrelated to, the 'pre-skip' used at the 470 beginning of the stream. If the point 80 ms prior to the seek target 471 comes before the initial PCM sample position, an implementation 472 SHOULD start decoding from the beginning of the stream, applying pre- 473 skip as normal, regardless of whether the pre-skip is larger or 474 smaller than 80 ms, and then continue to discard samples to reach the 475 seek target (if any). 477 5. Header Packets 479 An Ogg Opus logical stream contains exactly two mandatory header 480 packets: an identification header and a comment header. 482 5.1. Identification Header 484 0 1 2 3 485 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 486 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 487 | 'O' | 'p' | 'u' | 's' | 488 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 489 | 'H' | 'e' | 'a' | 'd' | 490 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 491 | Version = 1 | Channel Count | Pre-skip | 492 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 493 | Input Sample Rate (Hz) | 494 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 495 | Output Gain (Q7.8 in dB) | Mapping Family| | 496 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : 497 | | 498 : Optional Channel Mapping Table... : 499 | | 500 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 502 Figure 1: ID Header Packet 504 The fields in the identification (ID) header have the following 505 meaning: 507 1. Magic Signature: 509 This is an 8-octet (64-bit) field that allows codec 510 identification and is human-readable. It contains, in order, the 511 magic numbers: 513 0x4F 'O' 515 0x70 'p' 517 0x75 'u' 519 0x73 's' 521 0x48 'H' 523 0x65 'e' 524 0x61 'a' 526 0x64 'd' 528 Starting with "Op" helps distinguish it from audio data packets, 529 as this is an invalid TOC sequence. 531 2. Version (8 bits, unsigned): 533 The version number MUST always be '1' for this version of the 534 encapsulation specification. Implementations SHOULD treat 535 streams where the upper four bits of the version number match 536 that of a recognized specification as backwards-compatible with 537 that specification. That is, the version number can be split 538 into "major" and "minor" version sub-fields, with changes to the 539 "minor" sub-field (in the lower four bits) signaling compatible 540 changes. For example, an implementation of this specification 541 SHOULD accept any stream with a version number of '15' or less, 542 and SHOULD assume any stream with a version number '16' or 543 greater is incompatible. The initial version '1' was chosen to 544 keep implementations from relying on this octet as a null 545 terminator for the "OpusHead" string. 547 3. Output Channel Count 'C' (8 bits, unsigned): 549 This is the number of output channels. This might be different 550 than the number of encoded channels, which can change on a 551 packet-by-packet basis. This value MUST NOT be zero. The 552 maximum allowable value depends on the channel mapping family, 553 and might be as large as 255. See Section 5.1.1 for details. 555 4. Pre-skip (16 bits, unsigned, little endian): 557 This is the number of samples (at 48 kHz) to discard from the 558 decoder output when starting playback, and also the number to 559 subtract from a page's granule position to calculate its PCM 560 sample position. When cropping the beginning of existing Ogg 561 Opus streams, a pre-skip of at least 3,840 samples (80 ms) is 562 RECOMMENDED to ensure complete convergence in the decoder. 564 5. Input Sample Rate (32 bits, unsigned, little endian): 566 This is the sample rate of the original input (before encoding), 567 in Hz. This field is _not_ the sample rate to use for playback 568 of the encoded data. 570 Opus can switch between internal audio bandwidths of 4, 6, 8, 12, 571 and 20 kHz. Each packet in the stream can have a different audio 572 bandwidth. Regardless of the audio bandwidth, the reference 573 decoder supports decoding any stream at a sample rate of 8, 12, 574 16, 24, or 48 kHz. The original sample rate of the audio passed 575 to the encoder is not preserved by the lossy compression. 577 An Ogg Opus player SHOULD select the playback sample rate 578 according to the following procedure: 580 1. If the hardware supports 48 kHz playback, decode at 48 kHz. 582 2. Otherwise, if the hardware's highest available sample rate is 583 a supported rate, decode at this sample rate. 585 3. Otherwise, if the hardware's highest available sample rate is 586 less than 48 kHz, decode at the next higher Opus supported 587 rate above the highest available hardware rate and resample. 589 4. Otherwise, decode at 48 kHz and resample. 591 However, the 'Input Sample Rate' field allows the muxer to pass 592 the sample rate of the original input stream as metadata. This 593 is useful when the user requires the output sample rate to match 594 the input sample rate. For example, when not playing the output, 595 an implementation writing PCM format samples to disk might choose 596 to resample the audio back to the original input sample rate to 597 reduce surprise to the user, who might reasonably expect to get 598 back a file with the same sample rate. 600 A value of zero indicates 'unspecified'. Muxers SHOULD write the 601 actual input sample rate or zero, but implementations which do 602 something with this field SHOULD take care to behave sanely if 603 given crazy values (e.g., do not actually upsample the output to 604 10 MHz if requested). Implementations SHOULD support input 605 sample rates between 8 kHz and 192 kHz (inclusive). Rates 606 outside this range MAY be ignored by falling back to the default 607 rate of 48 kHz instead. 609 6. Output Gain (16 bits, signed, little endian): 611 This is a gain to be applied when decoding. It is 20*log10 of 612 the factor by which to scale the decoder output to achieve the 613 desired playback volume, stored in a 16-bit, signed, two's 614 complement fixed-point value with 8 fractional bits (i.e., Q7.8). 616 To apply the gain, an implementation could use 618 sample *= pow(10, output_gain/(20.0*256)) , 620 where output_gain is the raw 16-bit value from the header. 622 Players and media frameworks SHOULD apply it by default. If a 623 player chooses to apply any volume adjustment or gain 624 modification, such as the R128_TRACK_GAIN (see Section 5.2), the 625 adjustment MUST be applied in addition to this output gain in 626 order to achieve playback at the normalized volume. 628 A muxer SHOULD set this field to zero, and instead apply any gain 629 prior to encoding, when this is possible and does not conflict 630 with the user's wishes. A nonzero output gain indicates the gain 631 was adjusted after encoding, or that a user wished to adjust the 632 gain for playback while preserving the ability to recover the 633 original signal amplitude. 635 Although the output gain has enormous range (+/- 128 dB, enough 636 to amplify inaudible sounds to the threshold of physical pain), 637 most applications can only reasonably use a small portion of this 638 range around zero. The large range serves in part to ensure that 639 gain can always be losslessly transferred between OpusHead and 640 R128 gain tags (see below) without saturating. 642 7. Channel Mapping Family (8 bits, unsigned): 644 This octet indicates the order and semantic meaning of the output 645 channels. 647 Each currently specified value of this octet indicates a mapping 648 family, which defines a set of allowed channel counts, and the 649 ordered set of channel names for each allowed channel count. The 650 details are described in Section 5.1.1. 652 8. Channel Mapping Table: This table defines the mapping from 653 encoded streams to output channels. Its contents are specified 654 in Section 5.1.1. 656 All fields in the ID headers are REQUIRED, except for the channel 657 mapping table, which MUST be omitted when the channel mapping family 658 is 0, but is REQUIRED otherwise. Implementations SHOULD treat a 659 stream as invalid if it contains an ID header that does not have 660 enough data for these fields, even if it contain a valid Magic 661 Signature. Future versions of this specification, even backwards- 662 compatible versions, might include additional fields in the ID 663 header. If an ID header has a compatible major version, but a larger 664 minor version, an implementation MUST NOT treat it as invalid for 665 containing additional data not specified here, provided it still 666 completes on the first page. 668 5.1.1. Channel Mapping 670 An Ogg Opus stream allows mapping one number of Opus streams (N) to a 671 possibly larger number of decoded channels (M + N) to yet another 672 number of output channels (C), which might be larger or smaller than 673 the number of decoded channels. The order and meaning of these 674 channels are defined by a channel mapping, which consists of the 675 'channel mapping family' octet and, for channel mapping families 676 other than family 0, a channel mapping table, as illustrated in 677 Figure 2. 679 0 1 2 3 680 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 681 +-+-+-+-+-+-+-+-+ 682 | Stream Count | 683 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 684 | Coupled Count | Channel Mapping... : 685 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 687 Figure 2: Channel Mapping Table 689 The fields in the channel mapping table have the following meaning: 691 1. Stream Count 'N' (8 bits, unsigned): 693 This is the total number of streams encoded in each Ogg packet. 694 This value is necessary to correctly parse the packed Opus 695 packets inside an Ogg packet, as described in Section 3. This 696 value MUST NOT be zero, as without at least one Opus packet with 697 a valid TOC sequence, a demuxer cannot recover the duration of an 698 Ogg packet. 700 For channel mapping family 0, this value defaults to 1, and is 701 not coded. 703 2. Coupled Stream Count 'M' (8 bits, unsigned): This is the number 704 of streams whose decoders are to be configured to produce two 705 channels (stereo). This MUST be no larger than the total number 706 of streams, N. 708 Each packet in an Opus stream has an internal channel count of 1 709 or 2, which can change from packet to packet. This is selected 710 by the encoder depending on the bitrate and the audio being 711 encoded. The original channel count of the audio passed to the 712 encoder is not necessarily preserved by the lossy compression. 714 Regardless of the internal channel count, any Opus stream can be 715 decoded as mono (a single channel) or stereo (two channels) by 716 appropriate initialization of the decoder. The 'coupled stream 717 count' field indicates that the decoders for the first M Opus 718 streams are to be initialized for stereo (two-channel) output, 719 and the remaining (N - M) decoders are to be initialized for mono 720 (a single channel) only. The total number of decoded channels, 721 (M + N), MUST be no larger than 255, as there is no way to index 722 more channels than that in the channel mapping. 724 For channel mapping family 0, this value defaults to (C - 1) 725 (i.e., 0 for mono and 1 for stereo), and is not coded. 727 3. Channel Mapping (8*C bits): This contains one octet per output 728 channel, indicating which decoded channel is to be used for each 729 one. Let 'index' be the value of this octet for a particular 730 output channel. This value MUST either be smaller than (M + N), 731 or be the special value 255. If 'index' is less than 2*M, the 732 output MUST be taken from decoding stream ('index'/2) as stereo 733 and selecting the left channel if 'index' is even, and the right 734 channel if 'index' is odd. If 'index' is 2*M or larger, but less 735 than 255, the output MUST be taken from decoding stream 736 ('index' - M) as mono. If 'index' is 255, the corresponding 737 output channel MUST contain pure silence. 739 The number of output channels, C, is not constrained to match the 740 number of decoded channels (M + N). A single index value MAY 741 appear multiple times, i.e., the same decoded channel might be 742 mapped to multiple output channels. Some decoded channels might 743 not be assigned to any output channel, as well. 745 For channel mapping family 0, the first index defaults to 0, and 746 if C == 2, the second index defaults to 1. Neither index is 747 coded. 749 After producing the output channels, the channel mapping family 750 determines the semantic meaning of each one. There are three defined 751 mapping families in this specification. 753 5.1.1.1. Channel Mapping Family 0 755 Allowed numbers of channels: 1 or 2. RTP mapping. This is the same 756 channel interpretation as [RFC7587]. 758 o 1 channel: monophonic (mono). 760 o 2 channels: stereo (left, right). 762 Special mapping: This channel mapping value also indicates that the 763 contents consists of a single Opus stream that is stereo if and only 764 if C == 2, with stream index 0 mapped to output channel 0 (mono, or 765 left channel) and stream index 1 mapped to output channel 1 (right 766 channel) if stereo. When the 'channel mapping family' octet has this 767 value, the channel mapping table MUST be omitted from the ID header 768 packet. 770 5.1.1.2. Channel Mapping Family 1 772 Allowed numbers of channels: 1...8. Vorbis channel order (see 773 below). 775 Each channel is assigned to a speaker location in a conventional 776 surround arrangement. Specific locations depend on the number of 777 channels, and are given below in order of the corresponding channel 778 indices. 780 o 1 channel: monophonic (mono). 782 o 2 channels: stereo (left, right). 784 o 3 channels: linear surround (left, center, right) 786 o 4 channels: quadraphonic (front left, front right, rear left, 787 rear right). 789 o 5 channels: 5.0 surround (front left, front center, front right, 790 rear left, rear right). 792 o 6 channels: 5.1 surround (front left, front center, front right, 793 rear left, rear right, LFE). 795 o 7 channels: 6.1 surround (front left, front center, front right, 796 side left, side right, rear center, LFE). 798 o 8 channels: 7.1 surround (front left, front center, front right, 799 side left, side right, rear left, rear right, LFE) 801 This set of surround options and speaker location orderings is the 802 same as those used by the Vorbis codec [vorbis-mapping]. The 803 ordering is different from the one used by the WAVE 804 [wave-multichannel] and Free Lossless Audio Codec (FLAC) [flac] 805 formats, so correct ordering requires permutation of the output 806 channels when decoding to or encoding from those formats. 'LFE' here 807 refers to a Low Frequency Effects channel, often mapped to a 808 subwoofer with no particular spatial position. Implementations 809 SHOULD identify 'side' or 'rear' speaker locations with 'surround' 810 and 'back' as appropriate when interfacing with audio formats or 811 systems which prefer that terminology. 813 5.1.1.3. Channel Mapping Family 255 815 Allowed numbers of channels: 1...255. No defined channel meaning. 817 Channels are unidentified. General-purpose players SHOULD NOT 818 attempt to play these streams. Offline implementations MAY 819 deinterleave the output into separate PCM files, one per channel. 820 Implementations SHOULD NOT produce output for channels mapped to 821 stream index 255 (pure silence) unless they have no other way to 822 indicate the index of non-silent channels. 824 5.1.1.4. Undefined Channel Mappings 826 The remaining channel mapping families (2...254) are reserved. A 827 demuxer implementation encountering a reserved channel mapping family 828 value SHOULD act as though the value is 255. 830 5.1.1.5. Downmixing 832 An Ogg Opus player MUST support any valid channel mapping with a 833 channel mapping family of 0 or 1, even if the number of channels does 834 not match the physically connected audio hardware. Players SHOULD 835 perform channel mixing to increase or reduce the number of channels 836 as needed. 838 Implementations MAY use the following matrices to implement 839 downmixing from multichannel files using Channel Mapping Family 1 840 (Section 5.1.1.2), which are known to give acceptable results for 841 stereo. Matrices for 3 and 4 channels are normalized so each 842 coefficient row sums to 1 to avoid clipping. For 5 or more channels 843 they are normalized to 2 as a compromise between clipping and dynamic 844 range reduction. 846 In these matrices the front left and front right channels are 847 generally passed through directly. When a surround channel is split 848 between both the left and right stereo channels, coefficients are 849 chosen so their squares sum to 1, which helps preserve the perceived 850 intensity. Rear channels are mixed more diffusely or attenuated to 851 maintain focus on the front channels. 853 L output = ( 0.585786 * left + 0.414214 * center ) 854 R output = ( 0.414214 * center + 0.585786 * right ) 856 Exact coefficient values are 1 and 1/sqrt(2), multiplied by 1/(1 + 1/ 857 sqrt(2)) for normalization. 859 Figure 3: Stereo downmix matrix for the linear surround channel 860 mapping 862 / \ / \ / FL \ 863 | L output | | 0.422650 0.000000 0.366025 0.211325 | | FR | 864 | R output | = | 0.000000 0.422650 0.211325 0.366025 | | RL | 865 \ / \ / \ RR / 867 Exact coefficient values are 1, sqrt(3)/2 and 1/2, multiplied by 868 1/(1 + sqrt(3)/2 + 1/2) for normalization. 870 Figure 4: Stereo downmix matrix for the quadraphonic channel mapping 872 / FL \ 873 / \ / \ | FC | 874 | L | | 0.650802 0.460186 0.000000 0.563611 0.325401 | | FR | 875 | R | = | 0.000000 0.460186 0.650802 0.325401 0.563611 | | RL | 876 \ / \ / | RR | 877 \ / 879 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, 880 multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2) for normalization. 882 Figure 5: Stereo downmix matrix for the 5.0 surround mapping 883 /FL \ 884 / \ / \ |FC | 885 |L| | 0.529067 0.374107 0.000000 0.458186 0.264534 0.374107 | |FR | 886 |R| = | 0.000000 0.374107 0.529067 0.264534 0.458186 0.374107 | |RL | 887 \ / \ / |RR | 888 \LFE/ 890 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, 891 multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + 1/sqrt(2)) for 892 normalization. 894 Figure 6: Stereo downmix matrix for the 5.1 surround mapping 896 / \ 897 | 0.455310 0.321953 0.000000 0.394310 0.227655 0.278819 0.321953 | 898 | 0.000000 0.321953 0.455310 0.227655 0.394310 0.278819 0.321953 | 899 \ / 901 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2, 1/2 and 902 sqrt(3)/2/sqrt(2), multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + 903 sqrt(3)/2/sqrt(2) + 1/sqrt(2)) for normalization. The coefficients 904 are in the same order as in Section 5.1.1.2, and the matrices above. 906 Figure 7: Stereo downmix matrix for the 6.1 surround mapping 908 / \ 909 | .388631 .274804 .000000 .336565 .194316 .336565 .194316 .274804 | 910 | .000000 .274804 .388631 .194316 .336565 .194316 .336565 .274804 | 911 \ / 913 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, 914 multiplied by 2/(2 + 2/sqrt(2) + sqrt(3)) for normalization. The 915 coefficients are in the same order as in Section 5.1.1.2, and the 916 matrices above. 918 Figure 8: Stereo downmix matrix for the 7.1 surround mapping 920 5.2. Comment Header 921 0 1 2 3 922 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 923 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 924 | 'O' | 'p' | 'u' | 's' | 925 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 926 | 'T' | 'a' | 'g' | 's' | 927 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 928 | Vendor String Length | 929 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 930 | | 931 : Vendor String... : 932 | | 933 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 934 | User Comment List Length | 935 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 936 | User Comment #0 String Length | 937 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 938 | | 939 : User Comment #0 String... : 940 | | 941 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 942 | User Comment #1 String Length | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 944 : : 946 Figure 9: Comment Header Packet 948 The comment header consists of a 64-bit magic signature, followed by 949 data in the same format as the [vorbis-comment] header used in Ogg 950 Vorbis, except (like Ogg Theora and Speex) the final "framing bit" 951 specified in the Vorbis spec is not present. 953 1. Magic Signature: 955 This is an 8-octet (64-bit) field that allows codec 956 identification and is human-readable. It contains, in order, the 957 magic numbers: 959 0x4F 'O' 961 0x70 'p' 963 0x75 'u' 965 0x73 's' 967 0x54 'T' 968 0x61 'a' 970 0x67 'g' 972 0x73 's' 974 Starting with "Op" helps distinguish it from audio data packets, 975 as this is an invalid TOC sequence. 977 2. Vendor String Length (32 bits, unsigned, little endian): 979 This field gives the length of the following vendor string, in 980 octets. It MUST NOT indicate that the vendor string is longer 981 than the rest of the packet. 983 3. Vendor String (variable length, UTF-8 vector): 985 This is a simple human-readable tag for vendor information, 986 encoded as a UTF-8 string [RFC3629]. No terminating null octet 987 is necessary. 989 This tag is intended to identify the codec encoder and 990 encapsulation implementations, for tracing differences in 991 technical behavior. User-facing applications can use the 992 'ENCODER' user comment tag to identify themselves. 994 4. User Comment List Length (32 bits, unsigned, little endian): 996 This field indicates the number of user-supplied comments. It 997 MAY indicate there are zero user-supplied comments, in which case 998 there are no additional fields in the packet. It MUST NOT 999 indicate that there are so many comments that the comment string 1000 lengths would require more data than is available in the rest of 1001 the packet. 1003 5. User Comment #i String Length (32 bits, unsigned, little endian): 1005 This field gives the length of the following user comment string, 1006 in octets. There is one for each user comment indicated by the 1007 'user comment list length' field. It MUST NOT indicate that the 1008 string is longer than the rest of the packet. 1010 6. User Comment #i String (variable length, UTF-8 vector): 1012 This field contains a single user comment string. There is one 1013 for each user comment indicated by the 'user comment list length' 1014 field. 1016 The vendor string length and user comment list length are REQUIRED, 1017 and implementations SHOULD treat a stream as invalid if it contains a 1018 comment header that does not have enough data for these fields, or 1019 that does not contain enough data for the corresponding vendor string 1020 or user comments they describe. Making this check before allocating 1021 the associated memory to contain the data helps prevent a possible 1022 Denial-of-Service (DoS) attack from small comment headers that claim 1023 to contain strings longer than the entire packet or more user 1024 comments than than could possibly fit in the packet. 1026 Immediately following the user comment list, the comment header MAY 1027 contain zero-padding or other binary data which is not specified 1028 here. If the least-significant bit of the first byte of this data is 1029 1, then editors SHOULD preserve the contents of this data when 1030 updating the tags, but if this bit is 0, all such data MAY be treated 1031 as padding, and truncated or discarded as desired. This allows 1032 informal experimentation with the format of this binary data until it 1033 can be specified later. 1035 The comment header can be arbitrarily large and might be spread over 1036 a large number of Ogg pages. Implementations MUST avoid attempting 1037 to allocate excessive amounts of memory when presented with a very 1038 large comment header. To accomplish this, implementations MAY treat 1039 a stream as invalid if it has a comment header larger than 1040 125,829,120 octets, and MAY ignore individual comments that are not 1041 fully contained within the first 61,440 octets of the comment header. 1043 5.2.1. Tag Definitions 1045 The user comment strings follow the NAME=value format described by 1046 [vorbis-comment] with the same recommended tag names: ARTIST, TITLE, 1047 DATE, ALBUM, and so on. 1049 Two new comment tags are introduced here: 1051 First, an optional gain for track normalization: 1053 R128_TRACK_GAIN=-573 1055 representing the volume shift needed to normalize the track's volume 1056 during isolated playback, in random shuffle, and so on. The gain is 1057 a Q7.8 fixed point number in dB, as in the ID header's 'output gain' 1058 field. This tag is similar to the REPLAYGAIN_TRACK_GAIN tag in 1059 Vorbis [replay-gain], except that the normal volume reference is the 1060 [EBU-R128] standard. 1062 Second, an optional gain for album normalization: 1064 R128_ALBUM_GAIN=111 1066 representing the volume shift needed to normalize the overall volume 1067 when played as part of a particular collection of tracks. The gain 1068 is also a Q7.8 fixed point number in dB, as in the ID header's 1069 'output gain' field. 1071 An Ogg Opus stream MUST NOT have more than one of each of these tags, 1072 and if present their values MUST be an integer from -32768 to 32767, 1073 inclusive, represented in ASCII as a base 10 number with no 1074 whitespace. A leading '+' or '-' character is valid. Leading zeros 1075 are also permitted, but the value MUST be represented by no more than 1076 6 characters. Other non-digit characters MUST NOT be present. 1078 If present, R128_TRACK_GAIN and R128_ALBUM_GAIN MUST correctly 1079 represent the R128 normalization gain relative to the 'output gain' 1080 field specified in the ID header. If a player chooses to make use of 1081 the R128_TRACK_GAIN tag or the R128_ALBUM_GAIN tag, it MUST apply 1082 those gains _in addition_ to the 'output gain' value. If a tool 1083 modifies the ID header's 'output gain' field, it MUST also update or 1084 remove the R128_TRACK_GAIN and R128_ALBUM_GAIN comment tags if 1085 present. A muxer SHOULD place the gain it wants other tools to use 1086 by default into the 'output gain' field, and not the comment tag. 1088 To avoid confusion with multiple normalization schemes, an Opus 1089 comment header SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN, 1090 REPLAYGAIN_TRACK_PEAK, REPLAYGAIN_ALBUM_GAIN, or 1091 REPLAYGAIN_ALBUM_PEAK tags, unless they are only to be used in some 1092 context where there is guaranteed to be no such confusion. 1093 [EBU-R128] normalization is preferred to the earlier REPLAYGAIN 1094 schemes because of its clear definition and adoption by industry. 1095 Peak normalizations are difficult to calculate reliably for lossy 1096 codecs because of variation in excursion heights due to decoder 1097 differences. In the authors' investigations they were not applied 1098 consistently or broadly enough to merit inclusion here. 1100 6. Packet Size Limits 1102 Technically, valid Opus packets can be arbitrarily large due to the 1103 padding format, although the amount of non-padding data they can 1104 contain is bounded. These packets might be spread over a similarly 1105 enormous number of Ogg pages. When encoding, implementations SHOULD 1106 limit the use of padding in audio data packets to no more than is 1107 necessary to make a variable bitrate (VBR) stream constant bitrate 1108 (CBR), unless they have no reasonable way to determine what is 1109 necessary. Demuxers SHOULD treat audio data packets as invalid 1110 (treat them as if they were malformed Opus packets with an invalid 1111 TOC sequence) if they are larger than 61,440 octets per Opus stream, 1112 unless they have a specific reason for allowing extra padding. Such 1113 packets necessarily contain more padding than needed to make a stream 1114 CBR. Demuxers MUST avoid attempting to allocate excessive amounts of 1115 memory when presented with a very large packet. Demuxers MAY treat 1116 audio data packets as invalid or partially process them if they are 1117 larger than 61,440 octets in an Ogg Opus stream with channel mapping 1118 families 0 or 1. Demuxers MAY treat audio data packets as invalid or 1119 partially process them in any Ogg Opus stream if the packet is larger 1120 than 61,440 octets and also larger than 7,680 octets per Opus stream. 1121 The presence of an extremely large packet in the stream could 1122 indicate a memory exhaustion attack or stream corruption. 1124 In an Ogg Opus stream, the largest possible valid packet that does 1125 not use padding has a size of (61,298*N - 2) octets. With 1126 255 streams, this is 15,630,988 octets and can span up to 61,298 Ogg 1127 pages, all but one of which will have a granule position of -1. This 1128 is of course a very extreme packet, consisting of 255 streams, each 1129 containing 120 ms of audio encoded as 2.5 ms frames, each frame using 1130 the maximum possible number of octets (1275) and stored in the least 1131 efficient manner allowed (a VBR code 3 Opus packet). Even in such a 1132 packet, most of the data will be zeros as 2.5 ms frames cannot 1133 actually use all 1275 octets. 1135 The largest packet consisting of entirely useful data is 1136 (15,326*N - 2) octets. This corresponds to 120 ms of audio encoded 1137 as 10 ms frames in either SILK or Hybrid mode, but at a data rate of 1138 over 1 Mbps, which makes little sense for the quality achieved. 1140 A more reasonable limit is (7,664*N - 2) octets. This corresponds to 1141 120 ms of audio encoded as 20 ms stereo CELT mode frames, with a 1142 total bitrate just under 511 kbps (not counting the Ogg encapsulation 1143 overhead). For channel mapping family 1, N=8 provides a reasonable 1144 upper bound, as it allows for each of the 8 possible output channels 1145 to be decoded from a separate stereo Opus stream. This gives a size 1146 of 61,310 octets, which is rounded up to a multiple of 1,024 octets 1147 to yield the audio data packet size of 61,440 octets that any 1148 implementation is expected to be able to process successfully. 1150 7. Encoder Guidelines 1152 When encoding Opus streams, Ogg muxers SHOULD take into account the 1153 algorithmic delay of the Opus encoder. 1155 In encoders derived from the reference implementation [RFC6716], the 1156 number of samples can be queried with: 1158 opus_encoder_ctl(encoder_state, OPUS_GET_LOOKAHEAD(&delay_samples)); 1160 To achieve good quality in the very first samples of a stream, 1161 implementations MAY use linear predictive coding (LPC) extrapolation 1162 to generate at least 120 extra samples at the beginning to avoid the 1163 Opus encoder having to encode a discontinuous signal. For more 1164 information on linear prediction, see [linear-prediction]. For an 1165 input file containing 'length' samples, the implementation SHOULD set 1166 the pre-skip header value to (delay_samples + extra_samples), encode 1167 at least (length + delay_samples + extra_samples) samples, and set 1168 the granule position of the last page to 1169 (length + delay_samples + extra_samples). This ensures that the 1170 encoded file has the same duration as the original, with no time 1171 offset. The best way to pad the end of the stream is to also use LPC 1172 extrapolation, but zero-padding is also acceptable. 1174 7.1. LPC Extrapolation 1176 The first step in LPC extrapolation is to compute linear prediction 1177 coefficients. [lpc-sample] When extending the end of the signal, 1178 order-N (typically with N ranging from 8 to 40) LPC analysis is 1179 performed on a window near the end of the signal. The last N samples 1180 are used as memory to an infinite impulse response (IIR) filter. 1182 The filter is then applied on a zero input to extrapolate the end of 1183 the signal. Let a(k) be the kth LPC coefficient and x(n) be the nth 1184 sample of the signal, each new sample past the end of the signal is 1185 computed as: 1187 N 1188 --- 1189 x(n) = \ a(k)*x(n-k) 1190 / 1191 --- 1192 k=1 1194 The process is repeated independently for each channel. It is 1195 possible to extend the beginning of the signal by applying the same 1196 process backward in time. When extending the beginning of the 1197 signal, it is best to apply a "fade in" to the extrapolated signal, 1198 e.g. by multiplying it by a half-Hanning window [hanning]. 1200 7.2. Continuous Chaining 1202 In some applications, such as Internet radio, it is desirable to cut 1203 a long stream into smaller chains, e.g. so the comment header can be 1204 updated. This can be done simply by separating the input streams 1205 into segments and encoding each segment independently. The drawback 1206 of this approach is that it creates a small discontinuity at the 1207 boundary due to the lossy nature of Opus. A muxer MAY avoid this 1208 discontinuity by using the following procedure: 1210 1. Encode the last frame of the first segment as an independent 1211 frame by turning off all forms of inter-frame prediction. De- 1212 emphasis is allowed. 1214 2. Set the granule position of the last page to a point near the end 1215 of the last frame. 1217 3. Begin the second segment with a copy of the last frame of the 1218 first segment. 1220 4. Set the pre-skip value of the second stream in such a way as to 1221 properly join the two streams. 1223 5. Continue the encoding process normally from there, without any 1224 reset to the encoder. 1226 In encoders derived from the reference implementation, inter-frame 1227 prediction can be turned off by calling: 1229 opus_encoder_ctl(encoder_state, OPUS_SET_PREDICTION_DISABLED(1)); 1231 For best results, this implementation requires that prediction be 1232 explicitly enabled again before resuming normal encoding, even after 1233 a reset. 1235 8. Implementation Status 1237 A brief summary of major implementations of this draft is available 1238 at [1], along with their status. 1240 [Note to RFC Editor: please remove this entire section before final 1241 publication per [RFC6982], along with its references.] 1243 9. Security Considerations 1245 Implementations of the Opus codec need to take appropriate security 1246 considerations into account, as outlined in [RFC4732]. This is just 1247 as much a problem for the container as it is for the codec itself. 1248 Robustness against malicious payloads is extremely important. 1249 Malicious payloads MUST NOT cause an implementation to overrun its 1250 allocated memory or to take an excessive amount of resources to 1251 decode. Although problems in encoding applications are typically 1252 rarer, the same applies to the muxer. Malicious audio input streams 1253 MUST NOT cause an implementation to overrun its allocated memory or 1254 consume excessive resources because this would allow an attacker to 1255 attack transcoding gateways. 1257 Like most other container formats, Ogg Opus streams SHOULD NOT be 1258 used with insecure ciphers or cipher modes that are vulnerable to 1259 known-plaintext attacks. Elements such as the Ogg page capture 1260 pattern and the magic signatures in the ID header and the comment 1261 header all have easily predictable values, in addition to various 1262 elements of the codec data itself. 1264 10. Content Type 1266 An "Ogg Opus file" consists of one or more sequentially multiplexed 1267 segments, each containing exactly one Ogg Opus stream. The 1268 RECOMMENDED mime-type for Ogg Opus files is "audio/ogg". 1270 If more specificity is desired, one MAY indicate the presence of Opus 1271 streams using the codecs parameter defined in [RFC6381] and 1272 [RFC5334], e.g., 1274 audio/ogg; codecs=opus 1276 for an Ogg Opus file. 1278 The RECOMMENDED filename extension for Ogg Opus files is '.opus'. 1280 When Opus is concurrently multiplexed with other streams in an Ogg 1281 container, one SHOULD use one of the "audio/ogg", "video/ogg", or 1282 "application/ogg" mime-types, as defined in [RFC5334]. Such streams 1283 are not strictly "Ogg Opus files" as described above, since they 1284 contain more than a single Opus stream per sequentially multiplexed 1285 segment, e.g. video or multiple audio tracks. In such cases the the 1286 '.opus' filename extension is NOT RECOMMENDED. 1288 In either case, this document updates [RFC5334] to add 'opus' as a 1289 codecs parameter value with char[8]: 'OpusHead' as Codec Identifier. 1291 11. IANA Considerations 1293 This document updates the IANA Media Types registry to add .opus as a 1294 file extension for "audio/ogg", and to add itself as a reference 1295 alongside [RFC5334] for "audio/ogg", "video/ogg", and "application/ 1296 ogg" Media Types. 1298 This document defines a new registry "Opus Channel Mapping Families" 1299 to indicate how the semantic meanings of the channels in a multi- 1300 channel Opus stream are described. IANA is requested to create a new 1301 name space of "Opus Channel Mapping Families". This will be a new 1302 registry on the IANA Matrix, and not a subregistry of an existing 1303 registry. Modifications to this registry follow the "Specification 1304 Required with Expert Review" registration policy as defined in 1305 [RFC5226]. Each registry entry consists of a Channel Mapping Family 1306 Number, which is specified in decimal in the range 0 to 255, 1307 inclusive, and a Reference (or list of references) Each Reference 1308 must point to sufficient documentation to describe what information 1309 is coded in the Opus identification header for this channel mapping 1310 family, how a demuxer determines the Stream Count ('N') and Coupled 1311 Stream Count ('M') from this information, and how it determines the 1312 proper interpretation of each of the decoded channels. 1314 This document defines three initial assignments for this registry. 1316 +-------+---------------------------+ 1317 | Value | Reference | 1318 +-------+---------------------------+ 1319 | 0 | [RFCXXXX] Section 5.1.1.1 | 1320 | | | 1321 | 1 | [RFCXXXX] Section 5.1.1.2 | 1322 | | | 1323 | 255 | [RFCXXXX] Section 5.1.1.3 | 1324 +-------+---------------------------+ 1326 The designated expert will determine if the Reference points to a 1327 specification that meets the requirements for permanence and ready 1328 availability laid out in [RFC5226] and that it specifies the 1329 information described above with sufficient clarity to allow 1330 interoperable implementations. 1332 12. Acknowledgments 1334 Thanks to Ben Campbell, Mark Harris, Greg Maxwell, Christopher 1335 "Monty" Montgomery, Jean-Marc Valin, and Mo Zanaty for their valuable 1336 contributions to this document. Additional thanks to Andrew 1337 D'Addesio, Greg Maxwell, and Vincent Penquerc'h for their feedback 1338 based on early implementations. 1340 13. RFC Editor Notes 1342 In Section 11, "RFCXXXX" is to be replaced with the RFC number 1343 assigned to this draft. 1345 In the Copyright Notice at the start of the document, the following 1346 paragraph is to be appended after the regular copyright notice text: 1348 "The licenses granted by the IETF Trust to this RFC under Section 3.c 1349 of the Trust Legal Provisions shall also include the right to extract 1350 text from Sections 1 through 14 of this RFC and create derivative 1351 works from these extracts, and to copy, publish, display, and 1352 distribute such derivative works in any medium and for any purpose, 1353 provided that no such derivative work shall be presented, displayed, 1354 or published in a manner that states or implies that it is part of 1355 this RFC or any other IETF Document." 1357 14. References 1359 14.1. Normative References 1361 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1362 Requirement Levels", BCP 14, RFC 2119, March 1997. 1364 [RFC3533] Pfeiffer, S., "The Ogg Encapsulation Format Version 0", 1365 RFC 3533, May 2003. 1367 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1368 10646", STD 63, RFC 3629, November 2003. 1370 [RFC4732] Handley, M., Rescorla, E., and IAB, "Internet Denial-of- 1371 Service Considerations", RFC 4732, December 2006. 1373 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 1374 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 1375 DOI 10.17487/RFC5226, May 2008, 1376 . 1378 [RFC5334] Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media 1379 Types", RFC 5334, September 2008. 1381 [RFC6381] Gellens, R., Singer, D., and P. Frojdh, "The 'Codecs' and 1382 'Profiles' Parameters for "Bucket" Media Types", RFC 6381, 1383 August 2011. 1385 [RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the 1386 Opus Audio Codec", RFC 6716, September 2012. 1388 [EBU-R128] 1389 EBU Technical Committee, "Loudness Recommendation EBU 1390 R128", August 2011, . 1392 [vorbis-comment] 1393 Montgomery, C., "Ogg Vorbis I Format Specification: 1394 Comment Field and Header Specification", July 2002, 1395 . 1397 14.2. Informative References 1399 [RFC6982] Sheffer, Y. and A. Farrel, "Improving Awareness of Running 1400 Code: The Implementation Status Section", RFC 6982, July 1401 2013. 1403 [RFC7587] Spittka, J., Vos, K., and JM. Valin, "RTP Payload Format 1404 for the Opus Speech and Audio Codec", RFC 7587, DOI 1405 10.17487/RFC7587, June 2015, 1406 . 1408 [flac] Coalson, J., "FLAC - Free Lossless Audio Codec Format 1409 Description", January 2008, . 1412 [hanning] Wikipedia, "Hann window", May 2013, 1413 . 1416 [linear-prediction] 1417 Wikipedia, "Linear Predictive Coding", January 2014, 1418 . 1420 [lpc-sample] 1421 Degener, J. and C. Bormann, "Autocorrelation LPC coeff 1422 generation algorithm (Vorbis source code)", November 1994, 1423 . 1425 [replay-gain] 1426 Parker, C. and M. Leese, "VorbisComment: Replay Gain", 1427 June 2009, . 1430 [seeking] Pfeiffer, S., Parker, C., and G. Maxwell, "Granulepos 1431 Encoding and How Seeking Really Works", May 2012, 1432 . 1434 [vorbis-mapping] 1435 Montgomery, C., "The Vorbis I Specification, Section 4.3.9 1436 Output Channel Order", January 2010, 1437 . 1440 [vorbis-trim] 1441 Montgomery, C., "The Vorbis I Specification, Appendix A: 1442 Embedding Vorbis into an Ogg stream", November 2008, 1443 . 1446 [wave-multichannel] 1447 Microsoft Corporation, "Multiple Channel Audio Data and 1448 WAVE Files", March 2007, . 1451 14.3. URIs 1453 [1] https://wiki.xiph.org/OggOpusImplementation 1455 Authors' Addresses 1457 Timothy B. Terriberry 1458 Mozilla Corporation 1459 650 Castro Street 1460 Mountain View, CA 94041 1461 USA 1463 Phone: +1 650 903-0800 1464 Email: tterribe@xiph.org 1466 Ron Lee 1467 Voicetronix 1468 246 Pulteney Street, Level 1 1469 Adelaide, SA 5000 1470 Australia 1472 Phone: +61 8 8232 9112 1473 Email: ron@debian.org 1474 Ralph Giles 1475 Mozilla Corporation 1476 163 West Hastings Street 1477 Vancouver, BC V6B 1H5 1478 Canada 1480 Phone: +1 778 785 1540 1481 Email: giles@xiph.org