idnits 2.17.1 

draft-terriberry-oggopus-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 16, 2012) is 4302 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Downref: Normative reference to an Informational RFC: RFC 3533

  -- Possible downref: Non-RFC (?) normative reference: ref. 'RFCOpus'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'EBU-R128'


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	codec                                                      T. Terriberry
3	Internet-Draft                                       Mozilla Corporation
4	Intended status: Standards Track                                  R. Lee
5	Expires: January 17, 2013                                    Voicetronix
6	                                                                R. Giles
7	                                                     Mozilla Corporation
8	                                                           July 16, 2012

10	               Ogg Encapsulation for the Opus Audio Codec
11	                      draft-terriberry-oggopus-01

13	Abstract

15	   This document defines the Ogg encapsulation for the Opus interactive
16	   speech and audio codec.  This allows data encoded in the Opus format
17	   to be stored in an Ogg logical bitstream.  Ogg encapsulation provides
18	   Opus with a long-term storage format supporting all of the essential
19	   features, including metadata, fast and accurate seeking, corruption
20	   detection, recapture after errors, low overhead, and the ability to
21	   multiplex Opus with other codecs (including video) with minimal
22	   buffering.  It also provides a live streamable format, capable of
23	   delivery over a reliable stream-oriented transport, without requiring
24	   all the data, or even the total length of the data, up-front, in a
25	   form that is identical to the on-disk storage format.

27	Status of this Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on January 17, 2013.

44	Copyright Notice

46	   Copyright (c) 2012 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
62	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  4
63	   3.  Packet Organization  . . . . . . . . . . . . . . . . . . . . .  5
64	   4.  Granule Position . . . . . . . . . . . . . . . . . . . . . . .  7
65	     4.1.  Pre-skip . . . . . . . . . . . . . . . . . . . . . . . . .  7
66	     4.2.  PCM Sample Position  . . . . . . . . . . . . . . . . . . .  8
67	     4.3.  End Trimming . . . . . . . . . . . . . . . . . . . . . . .  9
68	     4.4.  Restrictions on the Initial Granule Position . . . . . . .  9
69	     4.5.  Seeking and Pre-roll . . . . . . . . . . . . . . . . . . . 10
70	   5.  Header Packets . . . . . . . . . . . . . . . . . . . . . . . . 11
71	     5.1.  Identification Header  . . . . . . . . . . . . . . . . . . 11
72	       5.1.1.  Channel Mapping  . . . . . . . . . . . . . . . . . . . 15
73	     5.2.  Comment Header . . . . . . . . . . . . . . . . . . . . . . 18
74	   6.  Packet Size Limits . . . . . . . . . . . . . . . . . . . . . . 22
75	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 23
76	   8.  Content Type . . . . . . . . . . . . . . . . . . . . . . . . . 24
77	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
78	   10. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 26
79	   11. Copying Conditions . . . . . . . . . . . . . . . . . . . . . . 27
80	   12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28
81	     12.1. Normative References . . . . . . . . . . . . . . . . . . . 28
82	     12.2. Informative References . . . . . . . . . . . . . . . . . . 28
83	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30

85	1.  Introduction

87	   The IETF Opus codec is a low-latency audio codec optimized for both
88	   voice and general-purpose audio.  See [RFCOpus] for technical
89	   details.  This document defines the encapsulation of Opus in a
90	   continuous, logical Ogg bitstream [RFC3533].

92	   Ogg bitstreams are made up of a series of 'pages', each of which
93	   contains data from one or more 'packets'.  Pages are the fundamental
94	   unit of multiplexing in an Ogg stream.  Each page is associated with
95	   a particular logical stream and contains a capture pattern and
96	   checksum, flags to mark the beginning and end of the logical stream,
97	   and a 'granule position' that represents an absolute position in the
98	   stream, to aid seeking.  A single page can contain up to 65,025
99	   octets of packet data from up to 255 different packets.  Packets may
100	   be split arbitrarily across pages, and continued from one page to the
101	   next (allowing packets much larger than would fit on a single page).
102	   Each page contains 'lacing values' that indicate how the data is
103	   partitioned into packets, allowing a demuxer to recover the packet
104	   boundaries without examining the encoded data.  A packet is said to
105	   'complete' on a page when the page contains the final lacing value
106	   corresponding to that packet.

108	   This encapsulation defines the required contents of the packet data,
109	   including the necessary headers, the organization of those packets
110	   into a logical stream, and the interpretation of the codec-specific
111	   granule position field.  It does not attempt to describe or specify
112	   the existing Ogg container format.  Readers unfamiliar with the basic
113	   concepts mentioned above are encouraged to review the details in
114	   [RFC3533].

116	2.  Terminology

118	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
119	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
120	   document are to be interpreted as described in [RFC2119].

122	   Implementations that fail to satisfy one or more "MUST" requirements
123	   are considered non-compliant.  Implementations that satisfy all
124	   "MUST" requirements, but fail to satisfy one or more "SHOULD"
125	   requirements are said to be "conditionally compliant".  All other
126	   implementations are "unconditionally compliant".

128	3.  Packet Organization

130	   An Opus stream is organized as follows.

132	   There are two mandatory header packets.  The granule position of the
133	   pages on which these packets complete MUST be zero.

135	   The first packet in the logical Ogg bitstream MUST contain the
136	   identification (ID) header, which uniquely identifies a stream as
137	   Opus audio.  The format of this header is defined in Section 5.1.  It
138	   MUST be placed alone (without any other packet data) on the first
139	   page of the logical Ogg bitstream, and must complete on that page.
140	   This page MUST have its 'beginning of stream' flag set.

142	   The second packet in the logical Ogg bitstream MUST contain the
143	   comment header, which contains user-supplied metadata.  The format of
144	   this header is defined in Section 5.2.  It MAY span one or more
145	   pages, beginning on the second page of the logical stream.  However
146	   many pages it spans, the comment header packet MUST finish the page
147	   on which it completes.

149	   All subsequent pages are audio data pages, and the Ogg packets they
150	   contain are audio data packets.  Each audio data packet contains one
151	   Opus packet for each of N different streams, where N is typically one
152	   for mono or stereo, but may be greater than one for, e.g.,
153	   multichannel audio.  The value N is specified in the ID header (see
154	   Section 5.1.1), and is fixed over the entire length of the logical
155	   Ogg bitstream.

157	   The first N-1 Opus packets, if any, are packed one after another into
158	   the Ogg packet, using the self-delimiting framing from Appendix B of
159	   [RFCOpus].  The remaining Opus packet is packed at the end of the Ogg
160	   packet using the regular, undelimited framing from Section 3 of
161	   [RFCOpus].  All of the Opus packets in a single Ogg packet MUST be
162	   constrained to have the same duration.  The duration and coding modes
163	   of each Opus packet are contained in the TOC (table of contents)
164	   sequence in the first few bytes.  A decoder SHOULD treat any Opus
165	   packet whose duration is different from that of the first Opus packet
166	   in an Ogg packet as if it were an Opus packet with an illegal TOC
167	   sequence.

169	   The first audio data page SHOULD NOT have the 'continued packet' flag
170	   set (which would indicated the first audio data packet is continued
171	   from a previous page).  Packets MUST be placed into Ogg pages in
172	   order until the end of stream.  Audio packets MAY span page
173	   boundaries.  A decoder MUST treat a zero-octet audio data packet as
174	   if it were an Opus packet with an illegal TOC sequence.  The last
175	   page SHOULD have the 'end of stream' flag set, but implementations
176	   should be prepared to deal with truncated streams that do not have a
177	   page marked 'end of stream'.  The final packet on the last page
178	   SHOULD NOT be a continued packet, i.e., the final lacing value should
179	   be less than 255.  There MUST NOT be any more pages in an Opus
180	   logical bitstream after a page marked 'end of stream'.

182	4.  Granule Position

184	   The granule position of an audio data page encodes the total number
185	   of PCM samples in the stream up to and including the last fully-
186	   decodable sample from the last packet completed on that page.  A page
187	   that is entirely spanned by a single packet (that completes on a
188	   subsequent page) has no granule position, and the granule position
189	   field MUST be set to the special value '-1' in two's complement.

191	   The granule position of an audio data page is in units of PCM audio
192	   samples at a fixed rate of 48 kHz (per channel; a stereo stream's
193	   granule position does not increment at twice the speed of a mono
194	   stream).  It is possible to run an Opus decoder at other sampling
195	   rates, but the value in the granule position field always counts
196	   samples assuming a 48 kHz decoding rate, and the rest of this
197	   specification makes the same assumption.

199	   The duration of an Opus packet may be any multiple of 2.5 ms, up to a
200	   maximum of 120 ms.  This duration is encoded in the TOC sequence at
201	   the beginning of each packet.  The number of samples returned by a
202	   decoder corresponds to this duration exactly, even for the first few
203	   packets.  For example, a 20 ms packet fed to a decoder running at
204	   48 kHz will always return 960 samples.  A demuxer can parse the TOC
205	   sequence at the beginning of each Ogg packet to work backwards or
206	   forwards from a packet with a known granule position (i.e., the last
207	   packet completed on some page) in order to assign granule positions
208	   to every packet, or even every individual sample.  The one exception
209	   is the last page in the stream, as described below.

211	   All other pages with completed packets after the first MUST have a
212	   granule position equal to the number of samples contained in packets
213	   that complete on that page plus the granule position of the most
214	   recent page with completed packets.  This guarantees that a demuxer
215	   can assign individual packets the same granule position when working
216	   forwards as when working backwards.  For this to work, there cannot
217	   be any gaps.  In order to support capturing a stream that uses
218	   discontinuous transmission (DTX), an encoder SHOULD emit packets that
219	   explicitly request the use of Packet Loss Concealment (PLC) (i.e.,
220	   with a frame length of 0, as defined in Section 3.2.1 of [RFCOpus])
221	   in place of the packets that were not transmitted.

223	4.1.  Pre-skip

225	   There is some amount of latency introduced during the decoding
226	   process, to allow for overlap in the MDCT modes, stereo mixing in the
227	   LP modes, and resampling, and the encoder will introduce even more
228	   latency (though the exact amount is not specified).  Therefore, the
229	   first few samples produced by the decoder do not correspond to real
230	   input audio, but are instead composed of padding inserted by the
231	   encoder to compensate for this latency.  These samples need to be
232	   stored and decoded, as Opus is an asymptotically convergent
233	   predictive codec, meaning the decoded contents of each frame depend
234	   on the recent history of decoder inputs.  However, a decoder will
235	   want to skip these samples after decoding them.

237	   A 'pre-skip' field in the ID header (see Section 5.1) signals the
238	   number of samples which should be skipped (decoded but discarded) at
239	   the beginning of the stream.  This provides sufficient history to the
240	   decoder so that it has already converged before the stream's output
241	   begins.  It may also be used to perform sample-accurate cropping of
242	   existing encoded streams.  This amount need not be a multiple of
243	   2.5 ms, may be smaller than a single packet, or may span the contents
244	   of several packets.

246	4.2.  PCM Sample Position

248	   The PCM sample position is determined from the granule position using
249	   the formula

251	         'PCM sample position' = 'granule position' - 'pre-skip' .

253	   For example, if the granule position of the first audio data page is
254	   59,971, and the pre-skip is 11,971, then the PCM sample position of
255	   the last decoded sample from that page is 48,000.  This can be
256	   converted into a playback time using the formula

258	                                   'PCM sample position'
259	                 'playback time' = --------------------- .
260	                                          48000.0

262	   The initial PCM sample position before any samples are played is
263	   normally '0'.  In this case, the PCM sample position of the first
264	   audio sample to be played starts at '1', because it marks the time on
265	   the clock _after_ that sample has been played, and a stream that is
266	   exactly one second long has a final PCM sample position of '48000',
267	   as in the example here.

269	   Vorbis streams use a granule position smaller than the number of
270	   audio samples contained in the first audio data page to indicate that
271	   some of those samples must be trimmed from the output (see
272	   [vorbis-trim]).  However, to do so, Vorbis requires that the first
273	   audio data page contains exactly two packets, in order to allow the
274	   decoder to perform PCM position adjustments before needing to return
275	   any PCM data.  Opus uses the pre-skip mechanism for this purpose
276	   instead, since the encoder may introduce more than a single packet's
277	   worth of latency, and since very large packets in streams with a very
278	   large number of channels might not fit on a single page.

280	4.3.  End Trimming

282	   The page with the 'end of stream' flag set MAY have a granule
283	   position that indicates the page contains less audio data than would
284	   normally be returned by decoding up through the final packet.  This
285	   is used to end the stream somewhere other than an even frame
286	   boundary.  The granule position of the most recent audio data page
287	   with completed packets is used to make this determination, or '0' is
288	   used if there were no previous audio data pages with a completed
289	   packet.  The difference between these granule positions indicates how
290	   many samples to keep after decoding the packets that completed on the
291	   final page.  The remaining samples are discarded.  The number of
292	   discarded samples SHOULD be no larger than the number decoded from
293	   the last packet.

295	4.4.  Restrictions on the Initial Granule Position

297	   The granule position of the first audio data page with a completed
298	   packet MAY be larger than the number of samples contained in packets
299	   that complete on that page, however it MUST NOT be smaller, unless
300	   that page has the 'end of stream' flag set.  Allowing a granule
301	   position larger than the number of samples allows the beginning of a
302	   stream to be cropped or a live stream to be joined without rewriting
303	   the granule position of all the remaining pages.  This means that the
304	   PCM sample position just before the first sample to be played may be
305	   larger than '0'.  Synchronization when multiplexing with other
306	   logical streams still uses the PCM sample position relative to '0' to
307	   compute sample times.  This does not affect the behavior of pre-skip:
308	   exactly 'pre-skip' samples should be skipped from the beginning of
309	   the decoded output, even if the initial PCM sample position is
310	   greater than zero.

312	   On the other hand, a granule position that is smaller than the number
313	   of decoded samples prevents a demuxer from working backwards to
314	   assign each packet or each individual sample a valid granule
315	   position, since granule positions must be non-negative.  A decoder
316	   MUST reject as invalid any stream where the granule position is
317	   smaller than the number of samples contained in packets that complete
318	   on the first audio data page with a completed packet, unless that
319	   page has the 'end of stream' flag set.  It MAY defer this action
320	   until it decodes the last packet completed on that page.  If that
321	   page has the 'end of stream' flag set, a demuxer can work forwards
322	   from the granule position '0', but MUST reject as invalid any stream
323	   where the granule position is smaller than the 'pre-skip' amount.
324	   This would indicate that more samples should be skipped from the
325	   initial decoded output than exist in the stream.

327	4.5.  Seeking and Pre-roll

329	   Seeking in Ogg files is best performed using a bisection search for a
330	   page whose granule position corresponds to a PCM position at or
331	   before the seek target.  With appropriately weighted bisection,
332	   accurate seeking can be performed with just three or four bisections
333	   even in multi-gigabyte files.  See [seeking] for general
334	   implementation guidance.

336	   When seeking within an Ogg Opus stream, the decoder SHOULD start
337	   decoding (and discarding the output) at least 3840 samples (80 ms)
338	   prior to the seek target in order to ensure that the output audio is
339	   correct by the time it reaches the seek target.  This 'pre-roll' is
340	   separate from, and unrelated to, the 'pre-skip' used at the beginning
341	   of the stream.  If the point 80 ms prior to the seek target comes
342	   before the initial PCM sample position, the decoder SHOULD start
343	   decoding from the beginning of the stream, applying pre-skip as
344	   normal, regardless of whether the pre-skip is larger or smaller than
345	   80 ms.

347	5.  Header Packets

349	   An Opus stream contains exactly two mandatory header packets.

351	5.1.  Identification Header

353	      0                   1                   2                   3
354	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
355	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
356	     |      'O'      |      'p'      |      'u'      |      's'      |
357	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
358	     |      'H'      |      'e'      |      'a'      |      'd'      |
359	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
360	     |  Version = 1  | Channel Count |           Pre-skip            |
361	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
362	     |                     Input Sample Rate (Hz)                    |
363	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
364	     |   Output Gain (Q7.8 in dB)    | Mapping Family|               |
365	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               :
366	     |                                                               |
367	     :               Optional Channel Mapping Table...               :
368	     |                                                               |
369	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

371	                        Figure 1: ID Header Packet

373	   The fields in the identification (ID) header have the following
374	   meaning:

376	   1.  *Magic Signature*:

378	       This is an 8-octet (64-bit) field that allows codec
379	       identification and is human-readable.  It contains, in order, the
380	       magic numbers:

382	          0x4F 'O'

384	          0x70 'p'

386	          0x75 'u'

388	          0x73 's'

390	          0x48 'H'

392	          0x65 'e'
393	          0x61 'a'

395	          0x64 'd'

397	       Starting with "Op" helps distinguish it from audio data packets,
398	       as this is an invalid TOC sequence.

400	   2.  *Version* (8 bits, unsigned):

402	       The version number MUST always be '1' for this version of the
403	       encapsulation specification.  Implementations SHOULD treat
404	       streams where the upper four bits of the version number match
405	       that of a recognized specification as backwards-compatible with
406	       that specification.  That is, the version number can be split
407	       into "major" and "minor" version sub-fields, with changes to the
408	       "minor" sub-field (in the lower four bits) signaling compatible
409	       changes.  For example, a decoder implementing this specification
410	       SHOULD accept any stream with a version number of '15' or less,
411	       and SHOULD assume any stream with a version number '16' or
412	       greater is incompatible.  The initial version '1' was chosen to
413	       keep implementations from relying on this octet as a null
414	       terminator for the "OpusHead" string.

416	   3.  *Output Channel Count* 'C' (8 bits, unsigned):

418	       This is the number of output channels.  This might be different
419	       than the number of encoded channels, which can change on a
420	       packet-by-packet basis.  This value MUST NOT be zero.  The
421	       maximum allowable value depends on the channel mapping family,
422	       and might be as large as 255.  See Section 5.1.1 for details.

424	   4.  *Pre-skip* (16 bits, unsigned, little endian):

426	       This is the number of samples (at 48 kHz) to discard from the
427	       decoder output when starting playback, and also the number to
428	       subtract from a page's granule position to calculate its PCM
429	       sample position.  When constructing cropped Ogg Opus streams, a
430	       pre-skip of at least 3,840 samples (80 ms) is RECOMMENDED to
431	       ensure complete convergence.

433	   5.  *Input Sample Rate* (32 bits, unsigned, little endian):

435	       This field is _not_ the sample rate to use for playback of the
436	       encoded data.

438	       Opus has a handful of coding modes, with internal audio
439	       bandwidths of 4, 6, 8, 12, and 20 kHz.  Each packet in the stream
440	       may have a different audio bandwidth.  Regardless of the audio
441	       bandwidth, the reference decoder supports decoding any stream at
442	       a sample rate of 8, 12, 16, 24, or 48 kHz.  The original sample
443	       rate of the encoder input is not preserved by the lossy
444	       compression.

446	       An Ogg Opus player SHOULD select the playback sample rate
447	       according to the following procedure:

449	       1.  If the hardware supports 48 kHz playback, decode at 48 kHz.

451	       2.  Otherwise, if the hardware's highest available sample rate is
452	           a supported rate, decode at this sample rate.

454	       3.  Otherwise, if the hardware's highest available sample rate is
455	           less than 48 kHz, decode at the highest supported rate above
456	           this and resample.

458	       4.  Otherwise, decode at 48 kHz and resample.

460	       However, the 'Input Sample Rate' field allows the encoder to pass
461	       the sample rate of the original input stream as metadata.  This
462	       may be useful when the user requires the output sample rate to
463	       match the input sample rate.  For example, a non-player decoder
464	       writing PCM format samples to disk might choose to resample the
465	       output audio back to the original input sample rate to reduce
466	       surprise to the user, who might reasonably expect to get back a
467	       file with the same sample rate as the one they fed to the
468	       encoder.

470	       A value of zero indicates 'unspecified'.  Encoders SHOULD write
471	       the actual input sample rate or zero, but decoder implementations
472	       which do something with this field SHOULD take care to behave
473	       sanely if given crazy values (e.g., do not actually upsample the
474	       output to 10 MHz if requested).

476	   6.  *Output Gain* (16 bits, signed, little endian):

478	       This is a gain to be applied by the decoder.  It is 20*log10 of
479	       the factor to scale the decoder output by to achieve the desired
480	       playback volume, stored in a 16-bit, signed, two's complement
481	       fixed-point value with 8 fractional bits (i.e., Q7.8).  To apply
482	       the gain, a decoder could use

484	                sample *= pow(10, output_gain/(20.0*256)) ,

486	       where output_gain is the raw 16-bit value from the header.

488	       Virtually all players and media frameworks should apply it by
489	       default.  If a player chooses to apply any volume adjustment or
490	       gain modification, such as the R128_TRACK_GAIN (see Section 5.2)
491	       or a user-facing volume knob, the adjustment MUST be applied in
492	       addition to this output gain in order to achieve playback at the
493	       desired volume.

495	       An encoder SHOULD set this field to zero, and instead apply any
496	       gain prior to encoding, when this is possible and does not
497	       conflict with the user's wishes.  The output gain should only be
498	       nonzero when the gain is adjusted after encoding, or when the
499	       user wishes to adjust the gain for playback while preserving the
500	       ability to recover the original signal amplitude.

502	       Although the output gain has enormous range (+/- 128 dB, enough
503	       to amplify inaudible sounds to the threshold of physical pain),
504	       most applications can only reasonably use a small portion of this
505	       range around zero.  The large range serves in part to ensure that
506	       gain can always be losslessly transferred between OpusHead and
507	       R128_TRACK_GAIN (see below) without saturating.

509	   7.  *Channel Mapping Family* (8 bits, unsigned):

511	       This octet indicates the order and semantic meaning of the
512	       various channels encoded in each Ogg packet.

514	       Each possible value of this octet indicates a mapping family,
515	       which defines a set of allowed channel counts, and the ordered
516	       set of channel names for each allowed channel count.  The details
517	       are described in Section 5.1.1.

519	   8.  *Channel Mapping Table*: This table defines the mapping from
520	       encoded streams to output channels.  It is omitted when the
521	       channel mapping family is 0, but REQUIRED otherwise.  Its
522	       contents are specified in Section 5.1.1.

524	   All fields in the ID headers are REQUIRED, except for the channel
525	   mapping table, which is omitted when the channel mapping family is 0.
526	   Implementations SHOULD reject ID headers which do not contain enough
527	   data for these fields, even if they contain a valid Magic Signature.
528	   Future versions of this specification, even backwards-compatible
529	   versions, might include additional fields in the ID header.  If an ID
530	   header has a compatible major version, but a larger minor version, an
531	   implementation MUST NOT reject it for containing additional data not
532	   specified here.  However, implementations MAY reject streams in which
533	   the ID header does not complete on the first page.

535	5.1.1.  Channel Mapping

537	   An Ogg Opus stream allows mapping one number of Opus streams (N) to a
538	   possibly larger number of decoded channels (M+N) to yet another
539	   number of output channels (C), which might be larger or smaller than
540	   the number of decoded channels.  The order and meaning these channels
541	   is defined by a channel mapping, which consists of the 'channel
542	   mapping family' octet and, for channel mapping families other than
543	   family 0, a channel mapping table, as illustrated in Figure 2.

545	      0                   1                   2                   3
546	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
547	                                                     +-+-+-+-+-+-+-+-+
548	                                                     | Stream Count  |
549	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
550	     | Coupled Count |              Channel Mapping...               :
551	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

553	                      Figure 2: Channel Mapping Table

555	   The fields in the channel mapping table have the following meaning:

557	   1.  *Stream Count* 'N' (8 bits, unsigned):

559	       This is the total number of streams encoded in each Ogg packet.
560	       This value is required to correctly parse the packed Opus packets
561	       inside an Ogg packet, as described in Section 3.  This value MUST
562	       NOT be zero, as without at least one Opus packet with a valid TOC
563	       sequence, a demuxer cannot recover the duration of an Ogg packet.

565	       For channel mapping family 0, this value defaults to 1, and is
566	       not coded.

568	   2.  *Coupled Stream Count* 'M' (8 bits, unsigned): This is the number
569	       of streams whose decoders should be configured to produce two
570	       channels.  This MUST be no larger than the total number of
571	       streams, N.

573	       Each packet in an Opus stream has an internal channel count of 1
574	       or 2, which can change from packet to packet.  This is selected
575	       by the encoder depending on the bitrate and the contents being
576	       encoded.  The original channel count of the encoder input is not
577	       preserved by the lossy compression.

579	       Regardless of the internal channel count, any Opus stream can be
580	       decoded as mono (a single channel) or stereo (two channels) by
581	       appropriate initialization of the decoder.  The 'coupled stream
582	       count' field indicates that the first M Opus decoders are to be
583	       initialized in stereo mode, and the remaining N-M decoders are to
584	       be initialized in mono mode.  The total number of decoded
585	       channels, (M+N), MUST be no larger than 255, as there is no way
586	       to index more channels than that in the channel mapping.

588	       For channel mapping family 0, this value defaults to C-1 (i.e., 0
589	       for mono and 1 for stereo), and is not coded.

591	   3.  *Channel Mapping* (8*C bits): This contains one octet per output
592	       channel, indicating which decoded channel should be used for each
593	       one.  Let 'index' be the value of this octet for a particular
594	       output channel.  This value MUST either be smaller than (M+N), or
595	       be the special value 255.  If 'index' is less than 2*M, the
596	       output MUST be taken from decoding stream ('index'/2) as stereo
597	       and selecting the left channel if 'index' is even, and the right
598	       channel if 'index' is odd.  If 'index' is 2*M or larger, the
599	       output MUST be taken from decoding stream ('index'-M) as mono.
600	       If 'index' is 255, the corresponding output channel MUST contain
601	       pure silence.

603	       The number of output channels, C, is not constrained to match the
604	       number of decoded channels (M+N).  A single index value MAY
605	       appear multiple times, i.e., the same decoded channel might be
606	       mapped to multiple output channels.  Some decoded channels might
607	       not be assigned to any output channel, as well.

609	       For channel mapping family 0, the first index defaults to 0, and
610	       if C==2, the second index defaults to 1.  Neither index is coded.

612	   After producing the output channels, the channel mapping family
613	   determines the semantic meaning of each one.  Currently there are
614	   three defined mapping families, although more may be added:

616	   o  Family 0 (RTP mapping):

618	      Allowed numbers of channels: 1 or 2.

620	      *  1 channel: monophonic (mono).

622	      *  2 channels: stereo (left, right).

624	      *Special mapping*: This channel mapping value also indicates that
625	      the contents consists of a single Opus stream that is stereo if
626	      and only if C==2, with stream index 0 mapped to channel 0, and (if
627	      stereo) stream index 1 mapped to channel 1.  When the 'channel
628	      mapping family' octet has this value, the channel mapping table
629	      MUST be omitted from the ID header packet.

631	   o  Family 1 (Vorbis channel order):

633	      Allowed numbers of channels: 1...8.
634	      Channel meanings depend on the number of channels.  See
635	      [vorbis-mapping] for the assignments from output channel number to
636	      specific speaker locations.

638	   o  Family 255 (no defined channel meaning):

640	      Allowed numbers of channels: 1...255.
641	      Channels are unidentified.  General-purpose players SHOULD NOT
642	      attempt to play these streams, and offline decoders MAY
643	      deinterleave the output into separate PCM files, one per channel.
644	      Decoders SHOULD NOT produce output for channels mapped to stream
645	      index 255 (pure silence) unless they have no other way to indicate
646	      the index of non-silent channels.

648	   The remaining channel mapping families (2...254) are reserved.  A
649	   decoder encountering a reserved channel mapping family value SHOULD
650	   act as though the value is 255.

652	   An Ogg Opus player MUST play any Ogg Opus stream with a channel
653	   mapping family of 0 or 1, even if the number of channels does not
654	   match the physically connected audio hardware.  Players SHOULD
655	   perform channel mixing to increase or reduce the number of channels
656	   as needed.

658	5.2.  Comment Header

660	      0                   1                   2                   3
661	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
662	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
663	     |      'O'      |      'p'      |      'u'      |      's'      |
664	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
665	     |      'T'      |      'a'      |      'g'      |      's'      |
666	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
667	     |                     Vendor String Length                      |
668	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
669	     |                                                               |
670	     :                        Vendor String...                       :
671	     |                                                               |
672	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
673	     |                   User Comment List Length                    |
674	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
675	     |                 User Comment #0 String Length                 |
676	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
677	     |                                                               |
678	     :                   User Comment #0 String...                   :
679	     |                                                               |
680	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
681	     |                 User Comment #1 String Length                 |
682	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
683	     :                                                               :

685	                      Figure 3: Comment Header Packet

687	   The comment header consists of a 64-bit magic signature, followed by
688	   data in the same format as the [vorbis-comment] header used in Ogg
689	   Vorbis (without the final "framing bit"), Ogg Theora, and Speex.

691	   1.  *Magic Signature*:

693	       This is an 8-octet (64-bit) field that allows codec
694	       identification and is human-readable.  It contains, in order, the
695	       magic numbers:

697	          0x4F 'O'

699	          0x70 'p'

701	          0x75 'u'

703	          0x73 's'
704	          0x54 'T'

706	          0x61 'a'

708	          0x67 'g'

710	          0x73 's'

712	       Starting with "Op" helps distinguish it from audio data packets,
713	       as this is an invalid TOC sequence.

715	   2.  *Vendor String Length* (32 bits, unsigned, little endian):

717	       This field gives the length of the following vendor string, in
718	       octets.  It MUST NOT indicate that the vendor string is longer
719	       than the rest of the packet.

721	   3.  *Vendor String* (variable length, UTF-8 vector):

723	       This is a simple human-readable tag for vendor information,
724	       encoded as a UTF-8 string [RFC3629].  No terminating NUL octet is
725	       required.

727	       This tag is intended to identify the codec encoder and
728	       encapsulation implementations, for tracing differences in
729	       technical behavior.  User-facing encoding applications can use
730	       the 'ENCODER' user comment tag to identify themselves.

732	   4.  *User Comment List Length* (32 bits, unsigned, little endian):

734	       This field indicates the number of user-supplied comments.  It
735	       MAY indicate there are zero user-supplied comments, in which case
736	       there are no additional fields in the packet.  It MUST NOT
737	       indicate that there are so many comments that the comment string
738	       lengths would require more data than is available in the rest of
739	       the packet.

741	   5.  *User Comment #i String Length* (32 bits, unsigned, little
742	       endian):

744	       This field gives the length of the following user comment string,
745	       in octets.  There is one for each user comment indicated by the
746	       'user comment list length' field.  It MUST NOT indicate that the
747	       string is longer than the rest of the packet.

749	   6.  *User Comment #i String* (variable length, UTF-8 vector):

751	       This field contains a single user comment string.  There is one
752	       for each user comment indicated by the 'user comment list length'
753	       field.

755	   The vendor string length and user comment list length are REQUIRED,
756	   and implementations SHOULD reject comment headers that do not contain
757	   enough data for these fields, or that do not contain enough data for
758	   the corresponding vendor string or user comments they describe.
759	   Making this check before allocating the associated memory to contain
760	   the data may help prevent a possible Denial-of-Service (DoS) attack
761	   from small comment headers that claim to contain strings longer than
762	   the entire packet or more user comments than than could possibly fit
763	   in the packet.

765	   The user comment strings follow the NAME=value format described by
766	   [vorbis-comment] with the same recommended tag names.  One new
767	   comment tag is introduced for Ogg Opus:

769	   R128_TRACK_GAIN=-573

771	   representing the volume shift needed to normalize the track's volume.
772	   The gain is a Q7.8 fixed point number in dB, as in the ID header's
773	   'output gain' field.  This tag is similar to the
774	   REPLAYGAIN_TRACK_GAIN tag in Vorbis [replay-gain], except that the
775	   normal volume reference is the [EBU-R128] standard.

777	   An Ogg Opus file MUST NOT have more than one such tag, and if present
778	   its value MUST be an integer from -32768 to 32767, inclusive,
779	   represented in ASCII with no whitespace.  If present, it MUST
780	   correctly represent the R128 normalization gain relative to the
781	   'output gain' field specified in the ID header.  If a player chooses
782	   to make use of the R128_TRACK_GAIN tag, it MUST be applied _in
783	   addition_ to the 'output gain' value.  If an encoder wishes to use
784	   R128 normalization, and the output gain is not otherwise constrained
785	   or specified, the encoder SHOULD write the R128 gain into the 'output
786	   gain' field and store a tag containing "R128_TRACK_GAIN=0".  That is,
787	   it should assume that by default tools will respect the 'output gain'
788	   field, and not the comment tag.  If a tool modifies the ID header's
789	   'output gain' field, it MUST also update or remove the
790	   R128_TRACK_GAIN comment tag.

792	   To avoid confusion with multiple normalization schemes, an Opus
793	   comment header SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN,
794	   REPLAYGAIN_TRACK_PEAK, REPLAYGAIN_ALBUM_GAIN, or
795	   REPLAYGAIN_ALBUM_PEAK tags.

797	   There is no Opus comment tag corresponding to REPLAYGAIN_ALBUM_GAIN.
798	   That information should instead be stored in the ID header's 'output
799	   gain' field.

801	6.  Packet Size Limits

803	   Technically valid Opus packets can be arbitrarily large due to the
804	   padding format, although the amount of non-padding data they can
805	   contain is bounded.  These packets might be spread over a similarly
806	   enormous number of Ogg pages.  Encoders SHOULD use no more padding
807	   than required to make a variable bitrate (VBR) stream constant
808	   bitrate (CBR).  Decoders SHOULD avoid attempting to allocate
809	   excessive amounts of memory when presented with a very large packet.
810	   The presence of an extremely large packet in the stream could
811	   indicate a memory exhaustion attack or stream corruption.  Decoders
812	   SHOULD reject a packet that is too large to process, and display a
813	   warning message.

815	   In an Ogg Opus stream, the largest possible valid packet that does
816	   not use padding has a size of (61,298*N - 2) octets, or about 60 kB
817	   per Opus stream.  With 255 streams, this is 15,630,988 octets
818	   (14.9 MB) and can span up to 61,298 Ogg pages, all but one of which
819	   will have a granule position of -1.  This is of course a very extreme
820	   packet, consisting of 255 streams, each containing 120 ms of audio
821	   encoded as 2.5 ms frames, each frame using the maximum possible
822	   number of octets (1275) and stored in the least efficient manner
823	   allowed (a VBR code 3 Opus packet).  Even in such a packet, most of
824	   the data will be zeros, as 2.5 ms frames, which are required to run
825	   in the MDCT mode, cannot actually use all 1275 octets.  The largest
826	   packet consisting of entirely useful data is (15,326*N - 2) octets,
827	   or about 15 kB per stream.  This corresponds to 120 ms of audio
828	   encoded as 10 ms frames in either LP or Hybrid mode, but at a data
829	   rate of over 1 Mbps, which makes little sense for the quality
830	   achieved.  A more reasonable limit is (7,664*N - 2) octets, or about
831	   7.5 kB per stream.  This corresponds to 120 ms of audio encoded as
832	   20 ms stereo MDCT-mode frames, with a total bitrate just under
833	   511 kbps (not counting the Ogg encapsulation overhead).  With N=8,
834	   the maximum number of channels currently defined by mapping family 1,
835	   this gives a maximum packet size of 61,310 octets, or just under
836	   60 kB.  This is still quite conservative, as it assumes each output
837	   channel is taken from one decoded channel of a stereo packet.  An
838	   implementation could reasonably choose any of these numbers for its
839	   internal limits.

841	7.  Security Considerations

843	   Implementations of the Opus codec need to take appropriate security
844	   considerations into account, as outlined in [RFC4732].  This is just
845	   as much a problem for the container as it is for the codec itself.
846	   It is extremely important for the decoder to be robust against
847	   malicious payloads.  Malicious payloads must not cause the decoder to
848	   overrun its allocated memory or to take an excessive amount of
849	   resources to decode.  Although problems in encoders are typically
850	   rarer, the same applies to the encoder.  Malicious audio streams must
851	   not cause the encoder to misbehave because this would allow an
852	   attacker to attack transcoding gateways.

854	   Like most other container formats, Ogg Opus files should not be used
855	   with insecure ciphers or cipher modes that are vulnerable to known-
856	   plaintext attacks.  Elements such as the Ogg page capture pattern and
857	   the magic signatures in the ID header and the comment header all have
858	   easily predictable values, in addition to various elements of the
859	   codec data itself.

861	8.  Content Type

863	   An "Ogg Opus file" consists of one or more sequentially multiplexed
864	   segments, each containing exactly one Ogg Opus stream.  The
865	   RECOMMENDED mime-type for Ogg Opus files is "audio/ogg".  When Opus
866	   is concurrently multiplexed with other streams in an Ogg container,
867	   one SHOULD use one of the "audio/ogg", "video/ogg", or "application/
868	   ogg" mime-types, as defined in [RFC5334].

870	   If more specificity is desired, one MAY indicate the presence of Opus
871	   streams using the codecs parameter defined in [RFC6381], e.g.,

873	   audio/ogg; codecs=opus

875	   for an Ogg Opus file.

877	   The RECOMMENDED filename extension for Ogg Opus files is '.opus'.

879	9.  IANA Considerations

881	   This document has no actions for IANA.

883	10.  Acknowledgments

885	   Thanks to Ralph Giles, Greg Maxwell, Christopher "Monty" Montgomery,
886	   and Jean-Marc Valin for their valuable contributions to this
887	   document.  Additional thanks to Andrew D'Addesio, Ralph Giles, Greg
888	   Maxwell, and Vincent Penqeurc'h for their feedback based on early
889	   implementations.

891	11.  Copying Conditions

893	   The authors agree to grant third parties the irrevocable right to
894	   copy, use, and distribute the work, with or without modification, in
895	   any medium, without royalty, provided that, unless separate
896	   permission is granted, redistributed modified works do not contain
897	   misleading author, version, name of work, or endorsement information.

899	12.  References

901	12.1.  Normative References

903	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
904	              Requirement Levels", BCP 14, RFC 2119, March 1997.

906	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
907	              10646", STD 63, RFC 3629, November 2003.

909	   [RFC3533]  Pfeiffer, S., "The Ogg Encapsulation Format Version 0",
910	              RFC 3533, May 2003.

912	   [RFC5334]  Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media
913	              Types", RFC 5334, September 2008.

915	   [RFC6381]  Gellens, R., Singer, D., and P. Frojdh, "The 'Codecs' and
916	              'Profiles' Parameters for "Bucket" Media Types", RFC 6381,
917	              August 2011.

919	   [RFCOpus]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the
920	              Opus Audio Codec", RFC XXXX.

922	   [EBU-R128]
923	              ""Loudness Recommendation EBU R128",
924	              <http://tech.ebu.ch/loudness>.

926	   [vorbis-comment]
927	              Montgomery, C., "Ogg Vorbis I Format Specification:
928	              Comment Field and Header Specification",
929	              <http://www.xiph.org/vorbis/doc/v-comment.html>.

931	   [vorbis-mapping]
932	              Montgomery, C., "The Vorbis I Specification, Section 4.3.9
933	              Output Channel Order", <http://www.xiph.org/vorbis/doc/
934	              Vorbis_I_spec.html#x1-800004.3.9>.

936	12.2.  Informative References

938	   [RFC4732]  Handley, M., Rescorla, E., and IAB, "Internet Denial-of-
939	              Service Considerations", RFC 4732, December 2006.

941	   [replay-gain]
942	              Parker, C. and M. Leese, "VorbisComment: Replay Gain",
943	              <http://wiki.xiph.org/VorbisComment#Replay_Gain>.

945	   [seeking]  Pfeiffer, S., Parker, C., and G. Maxwell, "Granulepos
946	              Encoding and How Seeking Really Works",
947	              <http://wiki.xiph.org/Seeking>.

949	   [vorbis-trim]
950	              Montgomery, C., "The Vorbis I Specification, Appendix A
951	              Embedding Vorbis into an Ogg stream", <http://xiph.org/
952	              vorbis/doc/Vorbis_I_spec.html#x1-130000A.2>.

954	Authors' Addresses

956	   Timothy B. Terriberry
957	   Mozilla Corporation
958	   650 Castro Street
959	   Mountain View, CA  94041
960	   USA

962	   Phone: +1 650 903-0800
963	   Email: tterribe@xiph.org

965	   Ron Lee
966	   Voicetronix
967	   246 Pulteney Street, Level 1
968	   Adelaide, SA  5000
969	   Australia

971	   Phone: +61 8 8232 9112
972	   Email: ron@debian.org

974	   Ralph Giles
975	   Mozilla Corporation
976	   163 West Hastings Street
977	   Vancouver, BC  V6B 1H5
978	   Canada

980	   Phone: +1 604 778 1540
981	   Email: giles@xiph.org