idnits 2.17.1 

draft-ietf-codec-oggopus-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (February 7, 2014) is 3723 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 1295

  ** Downref: Normative reference to an Informational RFC: RFC 3533

  -- Possible downref: Non-RFC (?) normative reference: ref. 'EBU-R128'

  -- Obsolete informational reference (is this intentional?): RFC 6982
     (Obsoleted by RFC 7942)


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	codec                                                      T. Terriberry
3	Internet-Draft                                       Mozilla Corporation
4	Intended status: Standards Track                                  R. Lee
5	Expires: August 11, 2014                                     Voicetronix
6	                                                                R. Giles
7	                                                     Mozilla Corporation
8	                                                        February 7, 2014

10	               Ogg Encapsulation for the Opus Audio Codec
11	                      draft-ietf-codec-oggopus-03

13	Abstract

15	   This document defines the Ogg encapsulation for the Opus interactive
16	   speech and audio codec.  This allows data encoded in the Opus format
17	   to be stored in an Ogg logical bitstream.  Ogg encapsulation provides
18	   Opus with a long-term storage format supporting all of the essential
19	   features, including metadata, fast and accurate seeking, corruption
20	   detection, recapture after errors, low overhead, and the ability to
21	   multiplex Opus with other codecs (including video) with minimal
22	   buffering.  It also provides a live streamable format, capable of
23	   delivery over a reliable stream-oriented transport, without requiring
24	   all the data, or even the total length of the data, up-front, in a
25	   form that is identical to the on-disk storage format.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at http://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on August 11, 2014.

44	Copyright Notice

46	   Copyright (c) 2014 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (http://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
62	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
63	   3.  Packet Organization . . . . . . . . . . . . . . . . . . . . .   3
64	   4.  Granule Position  . . . . . . . . . . . . . . . . . . . . . .   5
65	     4.1.  Repairing Gaps in Real-time Streams . . . . . . . . . . .   5
66	     4.2.  Pre-skip  . . . . . . . . . . . . . . . . . . . . . . . .   7
67	     4.3.  PCM Sample Position . . . . . . . . . . . . . . . . . . .   7
68	     4.4.  End Trimming  . . . . . . . . . . . . . . . . . . . . . .   8
69	     4.5.  Restrictions on the Initial Granule Position  . . . . . .   8
70	     4.6.  Seeking and Pre-roll  . . . . . . . . . . . . . . . . . .   9
71	   5.  Header Packets  . . . . . . . . . . . . . . . . . . . . . . .  10
72	     5.1.  Identification Header . . . . . . . . . . . . . . . . . .  10
73	       5.1.1.  Channel Mapping . . . . . . . . . . . . . . . . . . .  14
74	     5.2.  Comment Header  . . . . . . . . . . . . . . . . . . . . .  19
75	   6.  Packet Size Limits  . . . . . . . . . . . . . . . . . . . . .  23
76	   7.  Encoder Guidelines  . . . . . . . . . . . . . . . . . . . . .  24
77	     7.1.  LPC Extrapolation . . . . . . . . . . . . . . . . . . . .  24
78	     7.2.  Continuous Chaining . . . . . . . . . . . . . . . . . . .  25
79	   8.  Implementation Status . . . . . . . . . . . . . . . . . . . .  25
80	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  26
81	   10. Content Type  . . . . . . . . . . . . . . . . . . . . . . . .  26
82	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  26
83	   12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  27
84	   13. Copying Conditions  . . . . . . . . . . . . . . . . . . . . .  27
85	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  27
86	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  27
87	     14.2.  Informative References . . . . . . . . . . . . . . . . .  28
88	     14.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . .  29
89	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  29

91	1.  Introduction

93	   The IETF Opus codec is a low-latency audio codec optimized for both
94	   voice and general-purpose audio.  See [RFC6716] for technical
95	   details.  This document defines the encapsulation of Opus in a
96	   continuous, logical Ogg bitstream [RFC3533].

98	   Ogg bitstreams are made up of a series of 'pages', each of which
99	   contains data from one or more 'packets'.  Pages are the fundamental
100	   unit of multiplexing in an Ogg stream.  Each page is associated with
101	   a particular logical stream and contains a capture pattern and
102	   checksum, flags to mark the beginning and end of the logical stream,
103	   and a 'granule position' that represents an absolute position in the
104	   stream, to aid seeking.  A single page can contain up to 65,025
105	   octets of packet data from up to 255 different packets.  Packets may
106	   be split arbitrarily across pages, and continued from one page to the
107	   next (allowing packets much larger than would fit on a single page).
108	   Each page contains 'lacing values' that indicate how the data is
109	   partitioned into packets, allowing a demuxer to recover the packet
110	   boundaries without examining the encoded data.  A packet is said to
111	   'complete' on a page when the page contains the final lacing value
112	   corresponding to that packet.

114	   This encapsulation defines the required contents of the packet data,
115	   including the necessary headers, the organization of those packets
116	   into a logical stream, and the interpretation of the codec-specific
117	   granule position field.  It does not attempt to describe or specify
118	   the existing Ogg container format.  Readers unfamiliar with the basic
119	   concepts mentioned above are encouraged to review the details in
120	   [RFC3533].

122	2.  Terminology

124	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
125	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
126	   "OPTIONAL" in this document are to be interpreted as described in
127	   [RFC2119].

129	   Implementations that fail to satisfy one or more "MUST" requirements
130	   are considered non-compliant.  Implementations that satisfy all
131	   "MUST" requirements, but fail to satisfy one or more "SHOULD"
132	   requirements are said to be "conditionally compliant".  All other
133	   implementations are "unconditionally compliant".

135	3.  Packet Organization

137	   An Opus stream is organized as follows.

139	   There are two mandatory header packets.  The granule position of the
140	   pages on which these packets complete MUST be zero.

142	   The first packet in the logical Ogg bitstream MUST contain the
143	   identification (ID) header, which uniquely identifies a stream as
144	   Opus audio.  The format of this header is defined in Section 5.1.  It
145	   MUST be placed alone (without any other packet data) on the first
146	   page of the logical Ogg bitstream, and must complete on that page.
147	   This page MUST have its 'beginning of stream' flag set.

149	   The second packet in the logical Ogg bitstream MUST contain the
150	   comment header, which contains user-supplied metadata.  The format of
151	   this header is defined in Section 5.2.  It MAY span one or more
152	   pages, beginning on the second page of the logical stream.  However
153	   many pages it spans, the comment header packet MUST finish the page
154	   on which it completes.

156	   All subsequent pages are audio data pages, and the Ogg packets they
157	   contain are audio data packets.  Each audio data packet contains one
158	   Opus packet for each of N different streams, where N is typically one
159	   for mono or stereo, but may be greater than one for multichannel
160	   audio.  The value N is specified in the ID header (see
161	   Section 5.1.1), and is fixed over the entire length of the logical
162	   Ogg bitstream.

164	   The first N-1 Opus packets, if any, are packed one after another into
165	   the Ogg packet, using the self-delimiting framing from Appendix B of
166	   [RFC6716].  The remaining Opus packet is packed at the end of the Ogg
167	   packet using the regular, undelimited framing from Section 3 of
168	   [RFC6716].  All of the Opus packets in a single Ogg packet MUST be
169	   constrained to have the same duration.  A decoder SHOULD treat any
170	   Opus packet whose duration is different from that of the first Opus
171	   packet in an Ogg packet as if it were an Opus packet with an illegal
172	   TOC sequence.

174	   The coding mode (SILK, Hybrid, or CELT), audio bandwidth, channel
175	   count, duration (frame size), and number of frames per packet, are
176	   indicated in the TOC (table of contents) in the first byte of each
177	   Opus packet, as described in Section 3.1 of [RFC6716].  The
178	   combination of mode, audio bandwidth, and frame size is referred to
179	   as the configuration of an Opus packet.

181	   The first audio data page SHOULD NOT have the 'continued packet' flag
182	   set (which would indicate the first audio data packet is continued
183	   from a previous page).  Packets MUST be placed into Ogg pages in
184	   order until the end of stream.  Audio packets MAY span page
185	   boundaries.  A decoder MUST treat a zero-octet audio data packet as
186	   if it were an Opus packet with an illegal TOC sequence.  The last
187	   page SHOULD have the 'end of stream' flag set, but implementations
188	   should be prepared to deal with truncated streams that do not have a
189	   page marked 'end of stream'.  The final packet on the last page
190	   SHOULD NOT be a continued packet, i.e., the final lacing value should
191	   be less than 255.  There MUST NOT be any more pages in an Opus
192	   logical bitstream after a page marked 'end of stream'.

194	4.  Granule Position

196	   The granule position of an audio data page encodes the total number
197	   of PCM samples in the stream up to and including the last fully-
198	   decodable sample from the last packet completed on that page.  A page
199	   that is entirely spanned by a single packet (that completes on a
200	   subsequent page) has no granule position, and the granule position
201	   field MUST be set to the special value '-1' in two's complement.

203	   The granule position of an audio data page is in units of PCM audio
204	   samples at a fixed rate of 48 kHz (per channel; a stereo stream's
205	   granule position does not increment at twice the speed of a mono
206	   stream).  It is possible to run an Opus decoder at other sampling
207	   rates, but the value in the granule position field always counts
208	   samples assuming a 48 kHz decoding rate, and the rest of this
209	   specification makes the same assumption.

211	   The duration of an Opus packet may be any multiple of 2.5 ms, up to a
212	   maximum of 120 ms.  This duration is encoded in the TOC sequence at
213	   the beginning of each packet.  The number of samples returned by a
214	   decoder corresponds to this duration exactly, even for the first few
215	   packets.  For example, a 20 ms packet fed to a decoder running at
216	   48 kHz will always return 960 samples.  A demuxer can parse the TOC
217	   sequence at the beginning of each Ogg packet to work backwards or
218	   forwards from a packet with a known granule position (i.e., the last
219	   packet completed on some page) in order to assign granule positions
220	   to every packet, or even every individual sample.  The one exception
221	   is the last page in the stream, as described below.

223	   All other pages with completed packets after the first MUST have a
224	   granule position equal to the number of samples contained in packets
225	   that complete on that page plus the granule position of the most
226	   recent page with completed packets.  This guarantees that a demuxer
227	   can assign individual packets the same granule position when working
228	   forwards as when working backwards.  For this to work, there cannot
229	   be any gaps.

231	4.1.  Repairing Gaps in Real-time Streams

233	   In order to support capturing a real-time stream that has lost or not
234	   transmitted packets, a muxer SHOULD emit packets that explicitly
235	   request the use of Packet Loss Concealment (PLC) in place of the
236	   missing packets.  Only gaps that are a multiple of 2.5 ms are
237	   repairable, as these are the only durations that can be created by
238	   packet loss or discontinuous transmission.  Muxers need not handle
239	   other gap sizes.  Creating the necessary packets involves
240	   synthesizing a TOC byte (defined in Section 3.1 of [RFC6716])--and
241	   whatever additional internal framing is needed--to indicate the
242	   packet duration for each stream.  The actual length of each missing
243	   Opus frame inside the packet is zero bytes, as defined in
244	   Section 3.2.1 of [RFC6716].

246	   Zero-byte frames MAY be packed into packets using any of codes 0, 1,
247	   2, or 3.  When successive frames have the same configuration, the
248	   higher code packings reduce overhead.  Likewise, if the TOC
249	   configuration matches, the muxer MAY further combine the empty frames
250	   with previous or subsequent non-zero-length frames (using code 2 or
251	   VBR code 3).

253	   [RFC6716] does not impose any requirements on the PLC, but this
254	   section outlines choices that are expected to have a positive
255	   influence on most PLC implementations, including the reference
256	   implementation.  Synthesized TOC bytes SHOULD maintain the same mode,
257	   audio bandwidth, channel count, and frame size as the previous packet
258	   (if any).  This is the simplest and usually the most well-tested case
259	   for the PLC to handle and it covers all losses that do not include a
260	   configuration switch, as defined in Section 4.5 of [RFC6716].

262	   When a previous packet is available, keeping the audio bandwidth and
263	   channel count the same allows the PLC to provide maximum continuity
264	   in the concealment data it generates.  However, if the size of the
265	   gap is not a multiple of the most recent frame size, then the frame
266	   size will have to change for at least some frames.  Such changes
267	   SHOULD be delayed as long as possible to simplify things for PLC
268	   implementations.

270	   As an example, a 95 ms gap could be encoded as nineteen 5 ms frames
271	   in two bytes with a single CBR code 3 packet.  If the previous frame
272	   size was 20 ms, using four 20 ms frames followed by three 5 ms frames
273	   requires 4 bytes (plus an extra byte of Ogg lacing overhead), but
274	   allows the PLC to use its well-tested steady state behavior for as
275	   long as possible.  The total bitrate of the latter approach,
276	   including Ogg overhead, is about 0.4 kbps, so the impact on file size
277	   is minimal.

279	   Changing modes is discouraged, since this causes some decoder
280	   implementations to reset their PLC state.  However, SILK and Hybrid
281	   mode frames cannot fill gaps that are not a multiple of 10 ms.  If
282	   switching to CELT mode is needed to match the gap size, a muxer
283	   SHOULD do so at the end of the gap to allow the PLC to function for
284	   as long as possible.

286	   In the example above, if the previous frame was a 20 ms SILK mode
287	   frame, the better solution is to synthesize a packet describing four
288	   20 ms SILK frames, followed by a packet with a single 10 ms SILK
289	   frame, and finally a packet with a 5 ms CELT frame, to fill the 95 ms
290	   gap.  This also requires four bytes to describe the synthesized
291	   packet data (two bytes for a CBR code 3 and one byte each for two
292	   code 0 packets) but three bytes of Ogg lacing overhead are required
293	   to mark the packet boundaries.  At 0.6 kbps, this is still a minimal
294	   bitrate impact over a naive, low quality solution.

296	   Since medium-band audio is an option only in the SILK mode, wideband
297	   frames SHOULD be generated if switching from that configuration to
298	   CELT mode, to ensure that any PLC implementation which does try to
299	   migrate state between the modes will be able to preserve all of the
300	   available audio bandwidth.

302	4.2.  Pre-skip

304	   There is some amount of latency introduced during the decoding
305	   process, to allow for overlap in the CELT mode, stereo mixing in the
306	   SILK mode, and resampling.  The encoder will also introduce latency
307	   (though the exact amount is not specified).  Therefore, the first few
308	   samples produced by the decoder do not correspond to real input
309	   audio, but are instead composed of padding inserted by the encoder to
310	   compensate for this latency.  These samples need to be stored and
311	   decoded, as Opus is an asymptotically convergent predictive codec,
312	   meaning the decoded contents of each frame depend on the recent
313	   history of decoder inputs.  However, a decoder will want to skip
314	   these samples after decoding them.

316	   A 'pre-skip' field in the ID header (see Section 5.1) signals the
317	   number of samples which SHOULD be skipped (decoded but discarded) at
318	   the beginning of the stream.  This provides sufficient history to the
319	   decoder so that it has already converged before the stream's output
320	   begins.  It may also be used to perform sample-accurate cropping of
321	   existing encoded streams.  This amount need not be a multiple of
322	   2.5 ms, may be smaller than a single packet, or may span the contents
323	   of several packets.

325	4.3.  PCM Sample Position

327	   The PCM sample position is determined from the granule position using
328	   the formula

330	         'PCM sample position' = 'granule position' - 'pre-skip' .

332	   For example, if the granule position of the first audio data page is
333	   59,971, and the pre-skip is 11,971, then the PCM sample position of
334	   the last decoded sample from that page is 48,000.

336	   This can be converted into a playback time using the formula

338	                                   'PCM sample position'
339	                 'playback time' = --------------------- .
340	                                          48000.0

342	   The initial PCM sample position before any samples are played is
343	   normally '0'.  In this case, the PCM sample position of the first
344	   audio sample to be played starts at '1', because it marks the time on
345	   the clock _after_ that sample has been played, and a stream that is
346	   exactly one second long has a final PCM sample position of '48000',
347	   as in the example here.

349	   Vorbis streams use a granule position smaller than the number of
350	   audio samples contained in the first audio data page to indicate that
351	   some of those samples must be trimmed from the output (see
352	   [vorbis-trim]).  However, to do so, Vorbis requires that the first
353	   audio data page contains exactly two packets, in order to allow the
354	   decoder to perform PCM position adjustments before needing to return
355	   any PCM data.  Opus uses the pre-skip mechanism for this purpose
356	   instead, since the encoder may introduce more than a single packet's
357	   worth of latency, and since very large packets in streams with a very
358	   large number of channels might not fit on a single page.

360	4.4.  End Trimming

362	   The page with the 'end of stream' flag set MAY have a granule
363	   position that indicates the page contains less audio data than would
364	   normally be returned by decoding up through the final packet.  This
365	   is used to end the stream somewhere other than an even frame
366	   boundary.  The granule position of the most recent audio data page
367	   with completed packets is used to make this determination, or '0' is
368	   used if there were no previous audio data pages with a completed
369	   packet.  The difference between these granule positions indicates how
370	   many samples to keep after decoding the packets that completed on the
371	   final page.  The remaining samples are discarded.  The number of
372	   discarded samples SHOULD be no larger than the number decoded from
373	   the last packet.

375	4.5.  Restrictions on the Initial Granule Position

377	   The granule position of the first audio data page with a completed
378	   packet MAY be larger than the number of samples contained in packets
379	   that complete on that page, however it MUST NOT be smaller, unless
380	   that page has the 'end of stream' flag set.  Allowing a granule
381	   position larger than the number of samples allows the beginning of a
382	   stream to be cropped or a live stream to be joined without rewriting
383	   the granule position of all the remaining pages.  This means that the
384	   PCM sample position just before the first sample to be played may be
385	   larger than '0'.  Synchronization when multiplexing with other
386	   logical streams still uses the PCM sample position relative to '0' to
387	   compute sample times.  This does not affect the behavior of pre-skip:
388	   exactly 'pre-skip' samples should be skipped from the beginning of
389	   the decoded output, even if the initial PCM sample position is
390	   greater than zero.

392	   On the other hand, a granule position that is smaller than the number
393	   of decoded samples prevents a demuxer from working backwards to
394	   assign each packet or each individual sample a valid granule
395	   position, since granule positions must be non-negative.  A decoder
396	   MUST reject as invalid any stream where the granule position is
397	   smaller than the number of samples contained in packets that complete
398	   on the first audio data page with a completed packet, unless that
399	   page has the 'end of stream' flag set.  It MAY defer this action
400	   until it decodes the last packet completed on that page.

402	   If that page has the 'end of stream' flag set, a demuxer MUST reject
403	   as invalid any stream where its granule position is smaller than the
404	   'pre-skip' amount.  This would indicate that more samples should be
405	   skipped from the initial decoded output than exist in the stream.  If
406	   the granule position is smaller than the number of decoded samples
407	   produced by the packets that complete on that page, then a demuxer
408	   MUST use an initial granule position of '0', and can work forwards
409	   from '0' to timestamp individual packets.  If the granule position is
410	   larger than the number of decoded samples available, then the demuxer
411	   MUST still work backwards as described above, even if the 'end of
412	   stream' flag is set, to determine the initial granule position, and
413	   thus the initial PCM sample position.  Both of these will be greater
414	   than '0' in this case.

416	4.6.  Seeking and Pre-roll

418	   Seeking in Ogg files is best performed using a bisection search for a
419	   page whose granule position corresponds to a PCM position at or
420	   before the seek target.  With appropriately weighted bisection,
421	   accurate seeking can be performed with just three or four bisections
422	   even in multi-gigabyte files.  See [seeking] for general
423	   implementation guidance.

425	   When seeking within an Ogg Opus stream, the decoder SHOULD start
426	   decoding (and discarding the output) at least 3840 samples (80 ms)
427	   prior to the seek target in order to ensure that the output audio is
428	   correct by the time it reaches the seek target.  This 'pre-roll' is
429	   separate from, and unrelated to, the 'pre-skip' used at the beginning
430	   of the stream.  If the point 80 ms prior to the seek target comes
431	   before the initial PCM sample position, the decoder SHOULD start
432	   decoding from the beginning of the stream, applying pre-skip as
433	   normal, regardless of whether the pre-skip is larger or smaller than
434	   80 ms, and then continue to discard the samples required to reach the
435	   seek target (if any).

437	5.  Header Packets

439	   An Opus stream contains exactly two mandatory header packets: an
440	   identification header and a comment header.

442	5.1.  Identification Header

444	      0                   1                   2                   3
445	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
446	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
447	     |      'O'      |      'p'      |      'u'      |      's'      |
448	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
449	     |      'H'      |      'e'      |      'a'      |      'd'      |
450	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
451	     |  Version = 1  | Channel Count |           Pre-skip            |
452	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
453	     |                     Input Sample Rate (Hz)                    |
454	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
455	     |   Output Gain (Q7.8 in dB)    | Mapping Family|               |
456	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               :
457	     |                                                               |
458	     :               Optional Channel Mapping Table...               :
459	     |                                                               |
460	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

462	                        Figure 1: ID Header Packet

464	   The fields in the identification (ID) header have the following
465	   meaning:

467	   1.  *Magic Signature*:

469	       This is an 8-octet (64-bit) field that allows codec
470	       identification and is human-readable.  It contains, in order, the
471	       magic numbers:

473	          0x4F 'O'

475	          0x70 'p'

477	          0x75 'u'

479	          0x73 's'
480	          0x48 'H'

482	          0x65 'e'

484	          0x61 'a'

486	          0x64 'd'

488	       Starting with "Op" helps distinguish it from audio data packets,
489	       as this is an invalid TOC sequence.

491	   2.  *Version* (8 bits, unsigned):

493	       The version number MUST always be '1' for this version of the
494	       encapsulation specification.  Implementations SHOULD treat
495	       streams where the upper four bits of the version number match
496	       that of a recognized specification as backwards-compatible with
497	       that specification.  That is, the version number can be split
498	       into "major" and "minor" version sub-fields, with changes to the
499	       "minor" sub-field (in the lower four bits) signaling compatible
500	       changes.  For example, a decoder implementing this specification
501	       SHOULD accept any stream with a version number of '15' or less,
502	       and SHOULD assume any stream with a version number '16' or
503	       greater is incompatible.  The initial version '1' was chosen to
504	       keep implementations from relying on this octet as a null
505	       terminator for the "OpusHead" string.

507	   3.  *Output Channel Count* 'C' (8 bits, unsigned):

509	       This is the number of output channels.  This might be different
510	       than the number of encoded channels, which can change on a
511	       packet-by-packet basis.  This value MUST NOT be zero.  The
512	       maximum allowable value depends on the channel mapping family,
513	       and might be as large as 255.  See Section 5.1.1 for details.

515	   4.  *Pre-skip* (16 bits, unsigned, little endian):

517	       This is the number of samples (at 48 kHz) to discard from the
518	       decoder output when starting playback, and also the number to
519	       subtract from a page's granule position to calculate its PCM
520	       sample position.  When cropping the beginning of existing Ogg
521	       Opus streams, a pre-skip of at least 3,840 samples (80 ms) is
522	       RECOMMENDED to ensure complete convergence in the decoder.

524	   5.  *Input Sample Rate* (32 bits, unsigned, little endian):

526	       This field is _not_ the sample rate to use for playback of the
527	       encoded data.

529	       Opus can switch between internal audio bandwidths of 4, 6, 8, 12,
530	       and 20 kHz.  Each packet in the stream may have a different audio
531	       bandwidth.  Regardless of the audio bandwidth, the reference
532	       decoder supports decoding any stream at a sample rate of 8, 12,
533	       16, 24, or 48 kHz.  The original sample rate of the encoder input
534	       is not preserved by the lossy compression.

536	       An Ogg Opus player SHOULD select the playback sample rate
537	       according to the following procedure:

539	       1.  If the hardware supports 48 kHz playback, decode at 48 kHz.

541	       2.  Otherwise, if the hardware's highest available sample rate is
542	           a supported rate, decode at this sample rate.

544	       3.  Otherwise, if the hardware's highest available sample rate is
545	           less than 48 kHz, decode at the next highest supported rate
546	           above this and resample.

548	       4.  Otherwise, decode at 48 kHz and resample.

550	       However, the 'Input Sample Rate' field allows the encoder to pass
551	       the sample rate of the original input stream as metadata.  This
552	       may be useful when the user requires the output sample rate to
553	       match the input sample rate.  For example, a non-player decoder
554	       writing PCM format samples to disk might choose to resample the
555	       output audio back to the original input sample rate to reduce
556	       surprise to the user, who might reasonably expect to get back a
557	       file with the same sample rate as the one they fed to the
558	       encoder.

560	       A value of zero indicates 'unspecified'.  Encoders SHOULD write
561	       the actual input sample rate or zero, but decoder implementations
562	       which do something with this field SHOULD take care to behave
563	       sanely if given crazy values (e.g., do not actually upsample the
564	       output to 10 MHz if requested).

566	   6.  *Output Gain* (16 bits, signed, little endian):

568	       This is a gain to be applied by the decoder.  It is 20*log10 of
569	       the factor to scale the decoder output by to achieve the desired
570	       playback volume, stored in a 16-bit, signed, two's complement
571	       fixed-point value with 8 fractional bits (i.e., Q7.8).

573	       To apply the gain, a decoder could use

575	                sample *= pow(10, output_gain/(20.0*256)) ,

577	       where output_gain is the raw 16-bit value from the header.

579	       Virtually all players and media frameworks should apply it by
580	       default.  If a player chooses to apply any volume adjustment or
581	       gain modification, such as the R128_TRACK_GAIN (see Section 5.2)
582	       or a user-facing volume knob, the adjustment MUST be applied in
583	       addition to this output gain in order to achieve playback at the
584	       desired volume.

586	       An encoder SHOULD set this field to zero, and instead apply any
587	       gain prior to encoding, when this is possible and does not
588	       conflict with the user's wishes.  The output gain should only be
589	       nonzero when the gain is adjusted after encoding, or when the
590	       user wishes to adjust the gain for playback while preserving the
591	       ability to recover the original signal amplitude.

593	       Although the output gain has enormous range (+/- 128 dB, enough
594	       to amplify inaudible sounds to the threshold of physical pain),
595	       most applications can only reasonably use a small portion of this
596	       range around zero.  The large range serves in part to ensure that
597	       gain can always be losslessly transferred between OpusHead and
598	       R128_TRACK_GAIN (see below) without saturating.

600	   7.  *Channel Mapping Family* (8 bits, unsigned):

602	       This octet indicates the order and semantic meaning of the
603	       various channels encoded in each Ogg packet.

605	       Each possible value of this octet indicates a mapping family,
606	       which defines a set of allowed channel counts, and the ordered
607	       set of channel names for each allowed channel count.  The details
608	       are described in Section 5.1.1.

610	   8.  *Channel Mapping Table*: This table defines the mapping from
611	       encoded streams to output channels.  It is omitted when the
612	       channel mapping family is 0, but REQUIRED otherwise.  Its
613	       contents are specified in Section 5.1.1.

615	   All fields in the ID headers are REQUIRED, except for the channel
616	   mapping table, which is omitted when the channel mapping family is 0.
617	   Implementations SHOULD reject ID headers which do not contain enough
618	   data for these fields, even if they contain a valid Magic Signature.
619	   Future versions of this specification, even backwards-compatible
620	   versions, might include additional fields in the ID header.  If an ID
621	   header has a compatible major version, but a larger minor version, an
622	   implementation MUST NOT reject it for containing additional data not
623	   specified here.  However, implementations MAY reject streams in which
624	   the ID header does not complete on the first page.

626	5.1.1.  Channel Mapping

628	   An Ogg Opus stream allows mapping one number of Opus streams (N) to a
629	   possibly larger number of decoded channels (M+N) to yet another
630	   number of output channels (C), which might be larger or smaller than
631	   the number of decoded channels.  The order and meaning of these
632	   channels are defined by a channel mapping, which consists of the
633	   'channel mapping family' octet and, for channel mapping families
634	   other than family 0, a channel mapping table, as illustrated in
635	   Figure 2.

637	      0                   1                   2                   3
638	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
639	                                                     +-+-+-+-+-+-+-+-+
640	                                                     | Stream Count  |
641	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
642	     | Coupled Count |              Channel Mapping...               :
643	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

645	                      Figure 2: Channel Mapping Table

647	   The fields in the channel mapping table have the following meaning:

649	   1.  *Stream Count* 'N' (8 bits, unsigned):

651	       This is the total number of streams encoded in each Ogg packet.
652	       This value is required to correctly parse the packed Opus packets
653	       inside an Ogg packet, as described in Section 3.  This value MUST
654	       NOT be zero, as without at least one Opus packet with a valid TOC
655	       sequence, a demuxer cannot recover the duration of an Ogg packet.

657	       For channel mapping family 0, this value defaults to 1, and is
658	       not coded.

660	   2.  *Coupled Stream Count* 'M' (8 bits, unsigned): This is the number
661	       of streams whose decoders should be configured to produce two
662	       channels.  This MUST be no larger than the total number of
663	       streams, N.

665	       Each packet in an Opus stream has an internal channel count of 1
666	       or 2, which can change from packet to packet.  This is selected
667	       by the encoder depending on the bitrate and the audio being
668	       encoded.  The original channel count of the encoder input is not
669	       preserved by the lossy compression.

671	       Regardless of the internal channel count, any Opus stream can be
672	       decoded as mono (a single channel) or stereo (two channels) by
673	       appropriate initialization of the decoder.  The 'coupled stream
674	       count' field indicates that the first M Opus decoders are to be
675	       initialized for stereo output, and the remaining N-M decoders are
676	       to be initialized for mono only.  The total number of decoded
677	       channels, (M+N), MUST be no larger than 255, as there is no way
678	       to index more channels than that in the channel mapping.

680	       For channel mapping family 0, this value defaults to C-1 (i.e., 0
681	       for mono and 1 for stereo), and is not coded.

683	   3.  *Channel Mapping* (8*C bits): This contains one octet per output
684	       channel, indicating which decoded channel should be used for each
685	       one.  Let 'index' be the value of this octet for a particular
686	       output channel.  This value MUST either be smaller than (M+N), or
687	       be the special value 255.  If 'index' is less than 2*M, the
688	       output MUST be taken from decoding stream ('index'/2) as stereo
689	       and selecting the left channel if 'index' is even, and the right
690	       channel if 'index' is odd.  If 'index' is 2*M or larger, the
691	       output MUST be taken from decoding stream ('index'-M) as mono.
692	       If 'index' is 255, the corresponding output channel MUST contain
693	       pure silence.

695	       The number of output channels, C, is not constrained to match the
696	       number of decoded channels (M+N).  A single index value MAY
697	       appear multiple times, i.e., the same decoded channel might be
698	       mapped to multiple output channels.  Some decoded channels might
699	       not be assigned to any output channel, as well.

701	       For channel mapping family 0, the first index defaults to 0, and
702	       if C==2, the second index defaults to 1.  Neither index is coded.

704	   After producing the output channels, the channel mapping family
705	   determines the semantic meaning of each one.  Currently there are
706	   three defined mapping families, although more may be added.

708	5.1.1.1.  Channel Mapping Family 0

710	   Allowed numbers of channels: 1 or 2.  RTP mapping.

712	   o  1 channel: monophonic (mono).

714	   o  2 channels: stereo (left, right).

716	   *Special mapping*: This channel mapping value also indicates that the
717	   contents consists of a single Opus stream that is stereo if and only
718	   if C==2, with stream index 0 mapped to output channel 0 (mono, or
719	   left channel) and stream index 1 mapped to output channel 1 (right
720	   channel) if stereo.  When the 'channel mapping family' octet has this
721	   value, the channel mapping table MUST be omitted from the ID header
722	   packet.

724	5.1.1.2.  Channel Mapping Family 1

726	   Allowed numbers of channels: 1...8.  Vorbis channel order.

728	   Each channel is assigned to a speaker location in a conventional
729	   surround arrangement.  Specific locations depend on the number of
730	   channels, and are given below in order of the corresponding channel
731	   indicies.

733	   o  1 channel: monophonic (mono).

735	   o  2 channels: stereo (left, right).

737	   o  3 channels: linear surround (left, center, right)

739	   o  4 channels: quadraphonic (front left, front right, rear left,
740	      rear right).

742	   o  5 channels: 5.0 surround (front left, front center, front right,
743	      rear left, rear right).

745	   o  6 channels: 5.1 surround (front left, front center, front right,
746	      rear left, rear right, LFE).

748	   o  7 channels: 6.1 surround (front left, front center, front right,
749	      side left, side right, rear center, LFE).

751	   o  8 channels: 7.1 surround (front left, front center, front right,
752	      side left, side right, rear left, rear right, LFE)

754	   This set of surround options and speaker location orderings is the
755	   same as those used by the Vorbis codec [vorbis-mapping].  The
756	   ordering is different from the one used by the WAVE
757	   [wave-multichannel] and FLAC [flac] formats, so correct ordering
758	   requires permutation of the output channels when decoding to or
759	   encoding from those formats.  'LFE' here refers to a Low Frequency
760	   Effects, often mapped to a subwoofer with no particular spatial
761	   position.  Implementations SHOULD identify 'side' or 'rear' speaker
762	   locations with 'surround' and 'back' as appropriate when interfacing
763	   with audio formats or systems which prefer that terminology.

765	5.1.1.3.  Channel Mapping Family 255

767	   Allowed numbers of channels: 1...255.  No defined channel meaning.

769	   Channels are unidentified.  General-purpose players SHOULD NOT
770	   attempt to play these streams, and offline decoders MAY deinterleave
771	   the output into separate PCM files, one per channel.  Decoders SHOULD
772	   NOT produce output for channels mapped to stream index 255 (pure
773	   silence) unless they have no other way to indicate the index of non-
774	   silent channels.

776	5.1.1.4.  Undefined Channel Mappings

778	   The remaining channel mapping families (2...254) are reserved.  A
779	   decoder encountering a reserved channel mapping family value SHOULD
780	   act as though the value is 255.

782	5.1.1.5.  Downmixing

784	   An Ogg Opus player MUST play any Ogg Opus stream with a channel
785	   mapping family of 0 or 1, even if the number of channels does not
786	   match the physically connected audio hardware.  Players SHOULD
787	   perform channel mixing to increase or reduce the number of channels
788	   as needed.

790	   Implementations MAY use the following matricies to implement
791	   downmixing from multichannel files using Channel Mapping Family 1
792	   (Section 5.1.1.2), which are known to give acceptable results for
793	   stereo.  Matricies for 3 and 4 channels are normalized so each
794	   coefficent row sums to 1 to avoid clipping.  For 5 or more channels
795	   they are normalized to 2 as a compromise between clipping and dynamic
796	   range reduction.

798	   In these matricies the front left and front right channels are
799	   generally passed through directly.  When a surround channel is split
800	   between both the left and right stereo channels, coefficients are
801	   chosen so their squares sum to 1, which helps preserve the perceived
802	   intensity.  Rear channels are mixed more diffusely or attenuated to
803	   maintain focus on the front channels.

805	   L output = ( 0.585786 * left + 0.414214 * center                    )
806	   R output = (                   0.414214 * center + 0.585786 * right )

808	   Exact coefficient values are 1 and 1/sqrt(2), multiplied by 1/(1 + 1/
809	                        sqrt(2)) for normalization.

811	      Figure 3: Stereo downmix matrix for the linear surround channel
812	                                  mapping

814	       /          \   /                                     \ / FL \
815	       | L output |   | 0.422650 0.000000 0.366025 0.211325 | | FR |
816	       | R output | = | 0.000000 0.422650 0.211325 0.366025 | | RL |
817	       \          /   \                                     / \ RR /

819	    Exact coefficient values are 1, sqrt(3)/2 and 1/2, multiplied by 1/
820	                 (1 + sqrt(3)/2 + 1/2) for normalization.

822	   Figure 4: Stereo downmix matrix for the quadraphonic channel mapping

824	                                                               / FL \
825	      /   \   /                                              \ | FC |
826	      | L |   | 0.650802 0.460186 0.000000 0.563611 0.325401 | | FR |
827	      | R | = | 0.000000 0.460186 0.650802 0.325401 0.563611 | | RL |
828	      \   /   \                                              / | RR |
829	                                                               \    /

831	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
832	   multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2) for normalization.

834	       Figure 5: Stereo downmix matrix for the 5.0 surround mapping
835	                                                                   /FL \
836	   / \   /                                                       \ |FC |
837	   |L|   | 0.529067 0.374107 0.000000 0.458186 0.264534 0.374107 | |FR |
838	   |R| = | 0.000000 0.374107 0.529067 0.264534 0.458186 0.374107 | |RL |
839	   \ /   \                                                       / |RR |
840	                                                                   \LFE/

842	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
843	     multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + 1/sqrt(2)) for
844	                              normalization.

846	       Figure 6: Stereo downmix matrix for the 5.1 surround mapping

848	     /                                                                \
849	     | 0.455310 0.321953 0.000000 0.394310 0.227655 0.278819 0.321953 |
850	     | 0.000000 0.321953 0.455310 0.227655 0.394310 0.278819 0.321953 |
851	     \                                                                /

853	   Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2, 1/2 and sqrt(3)
854	      /2/sqrt(2), multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 +
855	   sqrt(3)/2/sqrt(2) + 1/sqrt(2)) for normalization.  The coeffients are
856	     in the same order as in Section 5.1.1.2, and the matricies above.

858	       Figure 7: Stereo downmix matrix for the 6.1 surround mapping

860	    /                                                                 \
861	    | .388631 .274804 .000000 .336565 .194316 .336565 .194316 .274804 |
862	    | .000000 .274804 .388631 .194316 .336565 .194316 .336565 .274804 |
863	    \                                                                 /

865	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
866	     multiplied by 2/(2 + 2/sqrt(2) + sqrt(3)) for normalization.  The
867	      coeffients are in the same order as in Section 5.1.1.2, and the
868	                             matricies above.

870	       Figure 8: Stereo downmix matrix for the 7.1 surround mapping

872	5.2.  Comment Header
873	      0                   1                   2                   3
874	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
875	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
876	     |      'O'      |      'p'      |      'u'      |      's'      |
877	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
878	     |      'T'      |      'a'      |      'g'      |      's'      |
879	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
880	     |                     Vendor String Length                      |
881	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
882	     |                                                               |
883	     :                        Vendor String...                       :
884	     |                                                               |
885	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
886	     |                   User Comment List Length                    |
887	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
888	     |                 User Comment #0 String Length                 |
889	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
890	     |                                                               |
891	     :                   User Comment #0 String...                   :
892	     |                                                               |
893	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
894	     |                 User Comment #1 String Length                 |
895	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
896	     :                                                               :

898	                      Figure 9: Comment Header Packet

900	   The comment header consists of a 64-bit magic signature, followed by
901	   data in the same format as the [vorbis-comment] header used in Ogg
902	   Vorbis, except (like Ogg Theora and Speex) the final "framing bit"
903	   specified in the Vorbis spec is not present.

905	   1.  *Magic Signature*:

907	       This is an 8-octet (64-bit) field that allows codec
908	       identification and is human-readable.  It contains, in order, the
909	       magic numbers:

911	          0x4F 'O'

913	          0x70 'p'

915	          0x75 'u'

917	          0x73 's'

919	          0x54 'T'
920	          0x61 'a'

922	          0x67 'g'

924	          0x73 's'

926	       Starting with "Op" helps distinguish it from audio data packets,
927	       as this is an invalid TOC sequence.

929	   2.  *Vendor String Length* (32 bits, unsigned, little endian):

931	       This field gives the length of the following vendor string, in
932	       octets.  It MUST NOT indicate that the vendor string is longer
933	       than the rest of the packet.

935	   3.  *Vendor String* (variable length, UTF-8 vector):

937	       This is a simple human-readable tag for vendor information,
938	       encoded as a UTF-8 string [RFC3629].  No terminating null octet
939	       is required.

941	       This tag is intended to identify the codec encoder and
942	       encapsulation implementations, for tracing differences in
943	       technical behavior.  User-facing encoding applications can use
944	       the 'ENCODER' user comment tag to identify themselves.

946	   4.  *User Comment List Length* (32 bits, unsigned, little endian):

948	       This field indicates the number of user-supplied comments.  It
949	       MAY indicate there are zero user-supplied comments, in which case
950	       there are no additional fields in the packet.  It MUST NOT
951	       indicate that there are so many comments that the comment string
952	       lengths would require more data than is available in the rest of
953	       the packet.

955	   5.  *User Comment #i String Length* (32 bits, unsigned, little
956	       endian):

958	       This field gives the length of the following user comment string,
959	       in octets.  There is one for each user comment indicated by the
960	       'user comment list length' field.  It MUST NOT indicate that the
961	       string is longer than the rest of the packet.

963	   6.  *User Comment #i String* (variable length, UTF-8 vector):

965	       This field contains a single user comment string.  There is one
966	       for each user comment indicated by the 'user comment list length'
967	       field.

969	   The vendor string length and user comment list length are REQUIRED,
970	   and implementations SHOULD reject comment headers that do not contain
971	   enough data for these fields, or that do not contain enough data for
972	   the corresponding vendor string or user comments they describe.
973	   Making this check before allocating the associated memory to contain
974	   the data helps prevent a possible Denial-of-Service (DoS) attack from
975	   small comment headers that claim to contain strings longer than the
976	   entire packet or more user comments than than could possibly fit in
977	   the packet.

979	   The user comment strings follow the NAME=value format described by
980	   [vorbis-comment] with the same recommended tag names.

982	   One new comment tag is introduced for Ogg Opus:

984	   R128_TRACK_GAIN=-573

986	   representing the volume shift needed to normalize the track's volume.
987	   The gain is a Q7.8 fixed point number in dB, as in the ID header's
988	   'output gain' field.

990	   This tag is similar to the REPLAYGAIN_TRACK_GAIN tag in
991	   Vorbis [replay-gain], except that the normal volume reference is the
992	   [EBU-R128] standard.

994	   An Ogg Opus file MUST NOT have more than one such tag, and if present
995	   its value MUST be an integer from -32768 to 32767, inclusive,
996	   represented in ASCII with no whitespace.  If present, it MUST
997	   correctly represent the R128 normalization gain relative to the
998	   'output gain' field specified in the ID header.  If a player chooses
999	   to make use of the R128_TRACK_GAIN tag, it MUST be applied _in
1000	   addition_ to the 'output gain' value.  If an encoder wishes to use
1001	   R128 normalization, and the output gain is not otherwise constrained
1002	   or specified, the encoder SHOULD write the R128 gain into the 'output
1003	   gain' field and store a tag containing "R128_TRACK_GAIN=0".  That is,
1004	   it should assume that by default tools will respect the 'output gain'
1005	   field, and not the comment tag.  If a tool modifies the ID header's
1006	   'output gain' field, it MUST also update or remove the
1007	   R128_TRACK_GAIN comment tag.

1009	   To avoid confusion with multiple normalization schemes, an Opus
1010	   comment header SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN,
1011	   REPLAYGAIN_TRACK_PEAK, REPLAYGAIN_ALBUM_GAIN, or
1012	   REPLAYGAIN_ALBUM_PEAK tags.

1014	   There is no Opus comment tag corresponding to REPLAYGAIN_ALBUM_GAIN.
1015	   That information should instead be stored in the ID header's 'output
1016	   gain' field.

1018	6.  Packet Size Limits

1020	   Technically valid Opus packets can be arbitrarily large due to the
1021	   padding format, although the amount of non-padding data they can
1022	   contain is bounded.  These packets might be spread over a similarly
1023	   enormous number of Ogg pages.  Encoders SHOULD use no more padding
1024	   than required to make a variable bitrate (VBR) stream constant
1025	   bitrate (CBR).  Decoders SHOULD avoid attempting to allocate
1026	   excessive amounts of memory when presented with a very large packet.
1027	   The presence of an extremely large packet in the stream could
1028	   indicate a memory exhaustion attack or stream corruption.  Decoders
1029	   SHOULD reject a packet that is too large to process, and display a
1030	   warning message.

1032	   In an Ogg Opus stream, the largest possible valid packet that does
1033	   not use padding has a size of (61,298*N - 2) octets, or about 60 kB
1034	   per Opus stream.  With 255 streams, this is 15,630,988 octets
1035	   (14.9 MB) and can span up to 61,298 Ogg pages, all but one of which
1036	   will have a granule position of -1.  This is of course a very extreme
1037	   packet, consisting of 255 streams, each containing 120 ms of audio
1038	   encoded as 2.5 ms frames, each frame using the maximum possible
1039	   number of octets (1275) and stored in the least efficient manner
1040	   allowed (a VBR code 3 Opus packet).  Even in such a packet, most of
1041	   the data will be zeros as 2.5 ms frames cannot actually use all
1042	   1275 octets.  The largest packet consisting of entirely useful data
1043	   is (15,326*N - 2) octets, or about 15 kB per stream.  This
1044	   corresponds to 120 ms of audio encoded as 10 ms frames in either SILK
1045	   or Hybrid mode, but at a data rate of over 1 Mbps, which makes little
1046	   sense for the quality achieved.  A more reasonable limit is
1047	   (7,664*N - 2) octets, or about 7.5 kB per stream.  This corresponds
1048	   to 120 ms of audio encoded as 20 ms stereo CELT mode frames, with a
1049	   total bitrate just under 511 kbps (not counting the Ogg encapsulation
1050	   overhead).  With N=8, the maximum number of channels currently
1051	   defined by mapping family 1, this gives a maximum packet size of
1052	   61,310 octets, or just under 60 kB.  This is still quite
1053	   conservative, as it assumes each output channel is taken from one
1054	   decoded channel of a stereo packet.  An implementation could
1055	   reasonably choose any of these numbers for its internal limits.

1057	7.  Encoder Guidelines

1059	   When encoding Opus files, Ogg encoders should take into account the
1060	   algorithmic delay of the Opus encoder.

1062	   In encoders derived from the reference implementation, the number of
1063	   samples can be queried with:

1065	    opus_encoder_ctl(encoder_state, OPUS_GET_LOOKAHEAD, &delay_samples);

1067	   To achieve good quality in the very first samples of a stream, the
1068	   Ogg encoder MAY use linear predictive coding (LPC) extrapolation
1069	   [linear-prediction] to generate at least 120 extra samples at the
1070	   beginning to avoid the Opus encoder having to encode a discontinuous
1071	   signal.  For an input file containing 'length' samples, the Ogg
1072	   encoder SHOULD set the pre-skip header value to
1073	   delay_samples+extra_samples, encode at least
1074	   length+delay_samples+extra_samples samples, and set the granulepos of
1075	   the last page to length+delay_samples+extra_samples.  This ensures
1076	   that the encoded file has the same duration as the original, with no
1077	   time offset.  The best way to pad the end of the stream is to also
1078	   use LPC extrapolation, but zero-padding is also acceptable.

1080	7.1.  LPC Extrapolation

1082	   The first step in LPC extrapolation is to compute linear prediction
1083	   coefficients. [lpc-sample] When extending the end of the signal,
1084	   order-N (typically with N ranging from 8 to 40) LPC analysis is
1085	   performed on a window near the end of the signal.  The last N samples
1086	   are used as memory to an infinite impulse response (IIR) filter.

1088	   The filter is then applied on a zero input to extrapolate the end of
1089	   the signal.  Let a(k) be the kth LPC coefficient and x(n) be the nth
1090	   sample of the signal, each new sample past the end of the signal is
1091	   computed as:

1093	                                  N
1094	                                 ---
1095	                          x(n) = \   a(k)*x(n-k)
1096	                                 /
1097	                                 ---
1098	                                 k=1

1100	   The process is repeated independently for each channel.  It is
1101	   possible to extend the beginning of the signal by applying the same
1102	   process backward in time.  When extending the beginning of the
1103	   signal, it is best to apply a "fade in" to the extrapolated signal,
1104	   e.g. by multiplying it by a half-Hanning window [hanning].

1106	7.2.  Continuous Chaining

1108	   In some applications, such as Internet radio, it is desirable to cut
1109	   a long stream into smaller chains, e.g. so the comment header can be
1110	   updated.  This can be done simply by separating the input streams
1111	   into segments and encoding each segment independently.  The drawback
1112	   of this approach is that it creates a small discontinuity at the
1113	   boundary due to the lossy nature of Opus.  An encoder MAY avoid this
1114	   discontinuity by using the following procedure:

1116	   1.  Encode the last frame of the first segment as an independent
1117	       frame by turning off all forms of inter-frame prediction.  De-
1118	       emphasis is allowed.

1120	   2.  Set the granulepos of the last page to a point near the end of
1121	       the last frame.

1123	   3.  Begin the second segment with a copy of the last frame of the
1124	       first segment.

1126	   4.  Set the pre-skip value of the second stream in such a way as to
1127	       properly join the two streams.

1129	   5.  Continue the encoding process normally from there, without any
1130	       reset to the encoder.

1132	   In encoders derived from the reference implementation, inter-frame
1133	   prediction can be turned off by calling:

1135	     opus_encoder_ctl(encoder_state, OPUS_SET_PREDICTION_DISABLED, 1);

1137	   Prediction should be enabled again before resuming normal encoding,
1138	   even after a reset.

1140	8.  Implementation Status

1142	   A brief summary of major implementations of this draft is available
1143	   at [1], along with their status.

1145	   [Note to RFC Editor: please remove this entire section before final
1146	   publication per [RFC6982].]

1148	9.  Security Considerations

1150	   Implementations of the Opus codec need to take appropriate security
1151	   considerations into account, as outlined in [RFC4732].  This is just
1152	   as much a problem for the container as it is for the codec itself.
1153	   It is extremely important for the decoder to be robust against
1154	   malicious payloads.  Malicious payloads must not cause the decoder to
1155	   overrun its allocated memory or to take an excessive amount of
1156	   resources to decode.  Although problems in encoders are typically
1157	   rarer, the same applies to the encoder.  Malicious audio streams must
1158	   not cause the encoder to misbehave because this would allow an
1159	   attacker to attack transcoding gateways.

1161	   Like most other container formats, Ogg Opus files should not be used
1162	   with insecure ciphers or cipher modes that are vulnerable to known-
1163	   plaintext attacks.  Elements such as the Ogg page capture pattern and
1164	   the magic signatures in the ID header and the comment header all have
1165	   easily predictable values, in addition to various elements of the
1166	   codec data itself.

1168	10.  Content Type

1170	   An "Ogg Opus file" consists of one or more sequentially multiplexed
1171	   segments, each containing exactly one Ogg Opus stream.  The
1172	   RECOMMENDED mime-type for Ogg Opus files is "audio/ogg".

1174	   If more specificity is desired, one MAY indicate the presence of Opus
1175	   streams using the codecs parameter defined in [RFC6381], e.g.,

1177	                            audio/ogg; codecs=opus

1179	   for an Ogg Opus file.

1181	   The RECOMMENDED filename extension for Ogg Opus files is '.opus'.

1183	   When Opus is concurrently multiplexed with other streams in an Ogg
1184	   container, one SHOULD use one of the "audio/ogg", "video/ogg", or
1185	   "application/ogg" mime-types, as defined in [RFC5334].  Such streams
1186	   are not strictly "Ogg Opus files" as described above, since they
1187	   contain more than a single Opus stream per sequentially multiplexed
1188	   segment, e.g. video or multiple audio tracks.  In such cases the the
1189	   '.opus' filename extension is NOT RECOMMENDED.

1191	11.  IANA Considerations

1193	   This document has no actions for IANA.

1195	12.  Acknowledgments

1197	   Thanks to Greg Maxwell, Christopher "Monty" Montgomery, and Jean-Marc
1198	   Valin for their valuable contributions to this document.  Additional
1199	   thanks to Andrew D'Addesio, Greg Maxwell, and Vincent Penqeurc'h for
1200	   their feedback based on early implementations.

1202	13.  Copying Conditions

1204	   The authors agree to grant third parties the irrevocable right to
1205	   copy, use, and distribute the work, with or without modification, in
1206	   any medium, without royalty, provided that, unless separate
1207	   permission is granted, redistributed modified works do not contain
1208	   misleading author, version, name of work, or endorsement information.

1210	14.  References

1212	14.1.  Normative References

1214	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1215	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1217	   [RFC3533]  Pfeiffer, S., "The Ogg Encapsulation Format Version 0",
1218	              RFC 3533, May 2003.

1220	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1221	              10646", STD 63, RFC 3629, November 2003.

1223	   [RFC5334]  Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media
1224	              Types", RFC 5334, September 2008.

1226	   [RFC6381]  Gellens, R., Singer, D., and P. Frojdh, "The 'Codecs' and
1227	              'Profiles' Parameters for "Bucket" Media Types", RFC 6381,
1228	              August 2011.

1230	   [RFC6716]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the
1231	              Opus Audio Codec", RFC 6716, September 2012.

1233	   [EBU-R128]
1234	              EBU Technical Committee, "Loudness Recommendation EBU
1235	              R128", August 2011, <https://tech.ebu.ch/loudness>.

1237	   [vorbis-comment]
1238	              Montgomery, C., "Ogg Vorbis I Format Specification:
1239	              Comment Field and Header Specification", July 2002,
1240	              <https://www.xiph.org/vorbis/doc/v-comment.html>.

1242	14.2.  Informative References

1244	   [RFC4732]  Handley, M., Rescorla, E., and IAB, "Internet Denial-of-
1245	              Service Considerations", RFC 4732, December 2006.

1247	   [RFC6982]  Sheffer, Y. and A. Farrel, "Improving Awareness of Running
1248	              Code: The Implementation Status Section", RFC 6982, July
1249	              2013.

1251	   [flac]     Coalson, J., "FLAC - Free Lossless Audio Codec Format
1252	              Description", January 2008, <https://xiph.org/flac/
1253	              format.html>.

1255	   [hanning]  Wikipedia, "Hann window", May 2013, <https://
1256	              en.wikipedia.org/wiki/
1257	              Hamming_function#Hann_.28Hanning.29_window>.

1259	   [linear-prediction]
1260	              Wikipedia, "Linear Predictive Coding", January 2014,
1261	              <https://en.wikipedia.org/wiki/Linear_predictive_coding>.

1263	   [lpc-sample]
1264	              Degener, J. and C. Bormann, "Autocorrelation LPC coeff
1265	              generation algorithm (Vorbis source code)", November 1994,
1266	              <https://svn.xiph.org/trunk/vorbis/lib/lpc.c>.

1268	   [replay-gain]
1269	              Parker, C. and M. Leese, "VorbisComment: Replay Gain",
1270	              June 2009, <https://wiki.xiph.org/
1271	              VorbisComment#Replay_Gain>.

1273	   [seeking]  Pfeiffer, S., Parker, C., and G. Maxwell, "Granulepos
1274	              Encoding and How Seeking Really Works", May 2012, <https:/
1275	              /wiki.xiph.org/Seeking>.

1277	   [vorbis-mapping]
1278	              Montgomery, C., "The Vorbis I Specification, Section 4.3.9
1279	              Output Channel Order", January 2010, <https://www.xiph.org
1280	              /vorbis/doc/Vorbis_I_spec.html#x1-800004.3.9>.

1282	   [vorbis-trim]
1283	              Montgomery, C., "The Vorbis I Specification, Appendix A:
1284	              Embedding Vorbis into an Ogg stream", November 2008,
1285	              <https://xiph.org/vorbis/doc/
1286	              Vorbis_I_spec.html#x1-130000A.2>.

1288	   [wave-multichannel]
1289	              Microsoft Corporation, "Multiple Channel Audio Data and
1290	              WAVE Files", March 2007, <http://msdn.microsoft.com/en-us/
1291	              windows/hardware/gg463006.aspx>.

1293	14.3.  URIs

1295	   [1] https://wiki.xiph.org/OggOpusImplementation

1297	Authors' Addresses

1299	   Timothy B. Terriberry
1300	   Mozilla Corporation
1301	   650 Castro Street
1302	   Mountain View, CA  94041
1303	   USA

1305	   Phone: +1 650 903-0800
1306	   Email: tterribe@xiph.org

1308	   Ron Lee
1309	   Voicetronix
1310	   246 Pulteney Street, Level 1
1311	   Adelaide, SA  5000
1312	   Australia

1314	   Phone: +61 8 8232 9112
1315	   Email: ron@debian.org

1317	   Ralph Giles
1318	   Mozilla Corporation
1319	   163 West Hastings Street
1320	   Vancouver, BC  V6B 1H5
1321	   Canada

1323	   Phone: +1 778 785 1540
1324	   Email: giles@xiph.org