idnits 2.17.1 

draft-ietf-codec-oggopus-11.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document updates RFC5334, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC5334, updated by this document, for
     RFC5378 checks: 2007-12-03)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (January 28, 2016) is 3011 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 1453

  -- Looks like a reference, but probably isn't: '8' on line 1289

  == Missing Reference: 'RFCXXXX' is mentioned on line 1323, but not defined

  ** Downref: Normative reference to an Informational RFC: RFC 3533

  ** Downref: Normative reference to an Informational RFC: RFC 4732

  ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'EBU-R128'

  -- Obsolete informational reference (is this intentional?): RFC 6982
     (Obsoleted by RFC 7942)


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	codec                                                      T. Terriberry
3	Internet-Draft                                       Mozilla Corporation
4	Updates: 5334 (if approved)                                       R. Lee
5	Intended status: Standards Track                             Voicetronix
6	Expires: July 31, 2016                                          R. Giles
7	                                                     Mozilla Corporation
8	                                                        January 28, 2016

10	               Ogg Encapsulation for the Opus Audio Codec
11	                      draft-ietf-codec-oggopus-11

13	Abstract

15	   This document defines the Ogg encapsulation for the Opus interactive
16	   speech and audio codec.  This allows data encoded in the Opus format
17	   to be stored in an Ogg logical bitstream.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on July 31, 2016.

36	Copyright Notice

38	   Copyright (c) 2016 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
55	   3.  Packet Organization . . . . . . . . . . . . . . . . . . . . .   3
56	   4.  Granule Position  . . . . . . . . . . . . . . . . . . . . . .   5
57	     4.1.  Repairing Gaps in Real-time Streams . . . . . . . . . . .   6
58	     4.2.  Pre-skip  . . . . . . . . . . . . . . . . . . . . . . . .   7
59	     4.3.  PCM Sample Position . . . . . . . . . . . . . . . . . . .   8
60	     4.4.  End Trimming  . . . . . . . . . . . . . . . . . . . . . .   9
61	     4.5.  Restrictions on the Initial Granule Position  . . . . . .   9
62	     4.6.  Seeking and Pre-roll  . . . . . . . . . . . . . . . . . .  10
63	   5.  Header Packets  . . . . . . . . . . . . . . . . . . . . . . .  11
64	     5.1.  Identification Header . . . . . . . . . . . . . . . . . .  11
65	       5.1.1.  Channel Mapping . . . . . . . . . . . . . . . . . . .  15
66	     5.2.  Comment Header  . . . . . . . . . . . . . . . . . . . . .  20
67	       5.2.1.  Tag Definitions . . . . . . . . . . . . . . . . . . .  23
68	   6.  Packet Size Limits  . . . . . . . . . . . . . . . . . . . . .  25
69	   7.  Encoder Guidelines  . . . . . . . . . . . . . . . . . . . . .  26
70	     7.1.  LPC Extrapolation . . . . . . . . . . . . . . . . . . . .  26
71	     7.2.  Continuous Chaining . . . . . . . . . . . . . . . . . . .  27
72	   8.  Implementation Status . . . . . . . . . . . . . . . . . . . .  27
73	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  28
74	   10. Content Type  . . . . . . . . . . . . . . . . . . . . . . . .  28
75	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  29
76	   12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  29
77	   13. RFC Editor Notes  . . . . . . . . . . . . . . . . . . . . . .  30
78	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
79	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  30
80	     14.2.  Informative References . . . . . . . . . . . . . . . . .  31
81	     14.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . .  32
82	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  32

84	1.  Introduction

86	   The IETF Opus codec is a low-latency audio codec optimized for both
87	   voice and general-purpose audio.  See [RFC6716] for technical
88	   details.  This document defines the encapsulation of Opus in a
89	   continuous, logical Ogg bitstream [RFC3533].  Ogg encapsulation
90	   provides Opus with a long-term storage format supporting all of the
91	   essential features, including metadata, fast and accurate seeking,
92	   corruption detection, recapture after errors, low overhead, and the
93	   ability to multiplex Opus with other codecs (including video) with
94	   minimal buffering.  It also provides a live streamable format,
95	   capable of delivery over a reliable stream-oriented transport,
96	   without requiring all the data, or even the total length of the data,
97	   up-front, in a form that is identical to the on-disk storage format.

99	   Ogg bitstreams are made up of a series of 'pages', each of which
100	   contains data from one or more 'packets'.  Pages are the fundamental
101	   unit of multiplexing in an Ogg stream.  Each page is associated with
102	   a particular logical stream and contains a capture pattern and
103	   checksum, flags to mark the beginning and end of the logical stream,
104	   and a 'granule position' that represents an absolute position in the
105	   stream, to aid seeking.  A single page can contain up to 65,025
106	   octets of packet data from up to 255 different packets.  Packets can
107	   be split arbitrarily across pages, and continued from one page to the
108	   next (allowing packets much larger than would fit on a single page).
109	   Each page contains 'lacing values' that indicate how the data is
110	   partitioned into packets, allowing a demultiplexer (demuxer) to
111	   recover the packet boundaries without examining the encoded data.  A
112	   packet is said to 'complete' on a page when the page contains the
113	   final lacing value corresponding to that packet.

115	   This encapsulation defines the contents of the packet data, including
116	   the necessary headers, the organization of those packets into a
117	   logical stream, and the interpretation of the codec-specific granule
118	   position field.  It does not attempt to describe or specify the
119	   existing Ogg container format.  Readers unfamiliar with the basic
120	   concepts mentioned above are encouraged to review the details in
121	   [RFC3533].

123	2.  Terminology

125	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
126	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
127	   "OPTIONAL" in this document are to be interpreted as described in
128	   [RFC2119].

130	3.  Packet Organization

132	   An Ogg Opus stream is organized as follows.

134	   There are two mandatory header packets.  The first packet in the
135	   logical Ogg bitstream MUST contain the identification (ID) header,
136	   which uniquely identifies a stream as Opus audio.  The format of this
137	   header is defined in Section 5.1.  It is placed alone (without any
138	   other packet data) on the first page of the logical Ogg bitstream,
139	   and completes on that page.  This page has its 'beginning of stream'
140	   flag set.

142	   The second packet in the logical Ogg bitstream MUST contain the
143	   comment header, which contains user-supplied metadata.  The format of
144	   this header is defined in Section 5.2.  It MAY span multiple pages,
145	   beginning on the second page of the logical stream.  However many
146	   pages it spans, the comment header packet MUST finish the page on
147	   which it completes.

149	   All subsequent pages are audio data pages, and the Ogg packets they
150	   contain are audio data packets.  Each audio data packet contains one
151	   Opus packet for each of N different streams, where N is typically one
152	   for mono or stereo, but MAY be greater than one for multichannel
153	   audio.  The value N is specified in the ID header (see
154	   Section 5.1.1), and is fixed over the entire length of the logical
155	   Ogg bitstream.

157	   The first (N - 1) Opus packets, if any, are packed one after another
158	   into the Ogg packet, using the self-delimiting framing from
159	   Appendix B of [RFC6716].  The remaining Opus packet is packed at the
160	   end of the Ogg packet using the regular, undelimited framing from
161	   Section 3 of [RFC6716].  All of the Opus packets in a single Ogg
162	   packet MUST be constrained to have the same duration.  An
163	   implementation of this specification SHOULD treat any Opus packet
164	   whose duration is different from that of the first Opus packet in an
165	   Ogg packet as if it were a malformed Opus packet with an invalid
166	   Table Of Contents (TOC) sequence.

168	   The TOC sequence at the beginning of each Opus packet indicates the
169	   coding mode, audio bandwidth, channel count, duration (frame size),
170	   and number of frames per packet, as described in Section 3.1
171	   of [RFC6716].  The coding mode is one of SILK, Hybrid, or Constrained
172	   Energy Lapped Transform (CELT).  The combination of coding mode,
173	   audio bandwidth, and frame size is referred to as the configuration
174	   of an Opus packet.

176	   Packets are placed into Ogg pages in order until the end of stream.
177	   Audio data packets might span page boundaries.  The first audio data
178	   page could have the 'continued packet' flag set (indicating the first
179	   audio data packet is continued from a previous page) if, for example,
180	   it was a live stream joined mid-broadcast, with the headers pasted on
181	   the front.  A demuxer SHOULD NOT attempt to decode the data for the
182	   first packet on a page with the 'continued packet' flag set if the
183	   previous page with packet data does not end in a continued packet
184	   (i.e., did not end with a lacing value of 255) or if the page
185	   sequence numbers are not consecutive, unless the demuxer has some
186	   special knowledge that would allow it to interpret this data despite
187	   the missing pieces.  An implementation MUST treat a zero-octet audio
188	   data packet as if it were a malformed Opus packet as described in
189	   Section 3.4 of [RFC6716].

191	   A logical stream ends with a page with the 'end of stream' flag set,
192	   but implementations need to be prepared to deal with truncated
193	   streams that do not have a page marked 'end of stream'.  There is no
194	   reason for the final packet on the last page to be a continued
195	   packet, i.e., for the final lacing value to be 255.  However,
196	   demuxers might encounter such streams, possibly as the result of a
197	   transfer that did not complete or of corruption.  A demuxer SHOULD
198	   NOT attempt to decode the data from a packet that continues onto a
199	   subsequent page (i.e., when the page ends with a lacing value of 255)
200	   if the next page with packet data does not have the 'continued
201	   packet' flag set or does not exist, or if the page sequence numbers
202	   are not consecutive, unless the demuxer has some special knowledge
203	   that would allow it to interpret this data despite the missing
204	   pieces.  There MUST NOT be any more pages in an Opus logical
205	   bitstream after a page marked 'end of stream'.

207	4.  Granule Position

209	   The granule position MUST be zero for the ID header page and the page
210	   where the comment header completes.  That is, the first page in the
211	   logical stream, and the last header page before the first audio data
212	   page both have a granule position of zero.

214	   The granule position of an audio data page encodes the total number
215	   of PCM samples in the stream up to and including the last fully-
216	   decodable sample from the last packet completed on that page.  The
217	   granule position of the first audio data page will usually be larger
218	   than zero, as described in Section 4.5.

220	   A page that is entirely spanned by a single packet (that completes on
221	   a subsequent page) has no granule position, and the granule position
222	   field is set to the special value '-1' in two's complement.

224	   The granule position of an audio data page is in units of PCM audio
225	   samples at a fixed rate of 48 kHz (per channel; a stereo stream's
226	   granule position does not increment at twice the speed of a mono
227	   stream).  It is possible to run an Opus decoder at other sampling
228	   rates, but all of them evenly divide 48 kHz.  Therefore, the value in
229	   the granule position field always counts samples assuming a 48 kHz
230	   decoding rate, and the rest of this specification makes the same
231	   assumption.

233	   The duration of an Opus packet as defined in [RFC6716] can be any
234	   multiple of 2.5 ms, up to a maximum of 120 ms.  This duration is
235	   encoded in the TOC sequence at the beginning of each packet.  The
236	   number of samples returned by a decoder corresponds to this duration
237	   exactly, even for the first few packets.  For example, a 20 ms packet
238	   fed to a decoder running at 48 kHz will always return 960 samples.  A
239	   demuxer can parse the TOC sequence at the beginning of each Ogg
240	   packet to work backwards or forwards from a packet with a known
241	   granule position (i.e., the last packet completed on some page) in
242	   order to assign granule positions to every packet, or even every
243	   individual sample.  The one exception is the last page in the stream,
244	   as described below.

246	   All other pages with completed packets after the first MUST have a
247	   granule position equal to the number of samples contained in packets
248	   that complete on that page plus the granule position of the most
249	   recent page with completed packets.  This guarantees that a demuxer
250	   can assign individual packets the same granule position when working
251	   forwards as when working backwards.  For this to work, there cannot
252	   be any gaps.

254	4.1.  Repairing Gaps in Real-time Streams

256	   In order to support capturing a real-time stream that has lost or not
257	   transmitted packets, a multiplexer (muxer) SHOULD emit packets that
258	   explicitly request the use of Packet Loss Concealment (PLC) in place
259	   of the missing packets.  Implementations that fail to do so still
260	   MUST NOT increment the granule position for a page by anything other
261	   than the number of samples contained in packets that actually
262	   complete on that page.

264	   Only gaps that are a multiple of 2.5 ms are repairable, as these are
265	   the only durations that can be created by packet loss or
266	   discontinuous transmission.  Muxers need not handle other gap sizes.
267	   Creating the necessary packets involves synthesizing a TOC byte
268	   (defined in Section 3.1 of [RFC6716])--and whatever additional
269	   internal framing is needed--to indicate the packet duration for each
270	   stream.  The actual length of each missing Opus frame inside the
271	   packet is zero bytes, as defined in Section 3.2.1 of [RFC6716].

273	   Zero-byte frames MAY be packed into packets using any of codes 0, 1,
274	   2, or 3.  When successive frames have the same configuration, the
275	   higher code packings reduce overhead.  Likewise, if the TOC
276	   configuration matches, the muxer MAY further combine the empty frames
277	   with previous or subsequent non-zero-length frames (using code 2 or
278	   VBR code 3).

280	   [RFC6716] does not impose any requirements on the PLC, but this
281	   section outlines choices that are expected to have a positive
282	   influence on most PLC implementations, including the reference
283	   implementation.  Synthesized TOC sequences SHOULD maintain the same
284	   mode, audio bandwidth, channel count, and frame size as the previous
285	   packet (if any).  This is the simplest and usually the most well-
286	   tested case for the PLC to handle and it covers all losses that do
287	   not include a configuration switch, as defined in Section 4.5
288	   of [RFC6716].

290	   When a previous packet is available, keeping the audio bandwidth and
291	   channel count the same allows the PLC to provide maximum continuity
292	   in the concealment data it generates.  However, if the size of the
293	   gap is not a multiple of the most recent frame size, then the frame
294	   size will have to change for at least some frames.  Such changes
295	   SHOULD be delayed as long as possible to simplify things for PLC
296	   implementations.

298	   As an example, a 95 ms gap could be encoded as nineteen 5 ms frames
299	   in two bytes with a single CBR code 3 packet.  If the previous frame
300	   size was 20 ms, using four 20 ms frames followed by three 5 ms frames
301	   requires 4 bytes (plus an extra byte of Ogg lacing overhead), but
302	   allows the PLC to use its well-tested steady state behavior for as
303	   long as possible.  The total bitrate of the latter approach,
304	   including Ogg overhead, is about 0.4 kbps, so the impact on file size
305	   is minimal.

307	   Changing modes is discouraged, since this causes some decoder
308	   implementations to reset their PLC state.  However, SILK and Hybrid
309	   mode frames cannot fill gaps that are not a multiple of 10 ms.  If
310	   switching to CELT mode is needed to match the gap size, a muxer
311	   SHOULD do so at the end of the gap to allow the PLC to function for
312	   as long as possible.

314	   In the example above, if the previous frame was a 20 ms SILK mode
315	   frame, the better solution is to synthesize a packet describing four
316	   20 ms SILK frames, followed by a packet with a single 10 ms SILK
317	   frame, and finally a packet with a 5 ms CELT frame, to fill the 95 ms
318	   gap.  This also requires four bytes to describe the synthesized
319	   packet data (two bytes for a CBR code 3 and one byte each for two
320	   code 0 packets) but three bytes of Ogg lacing overhead are needed to
321	   mark the packet boundaries.  At 0.6 kbps, this is still a minimal
322	   bitrate impact over a naive, low quality solution.

324	   Since medium-band audio is an option only in the SILK mode, wideband
325	   frames SHOULD be generated if switching from that configuration to
326	   CELT mode, to ensure that any PLC implementation which does try to
327	   migrate state between the modes will be able to preserve all of the
328	   available audio bandwidth.

330	4.2.  Pre-skip

332	   There is some amount of latency introduced during the decoding
333	   process, to allow for overlap in the CELT mode, stereo mixing in the
334	   SILK mode, and resampling.  The encoder might have introduced
335	   additional latency through its own resampling and analysis (though
336	   the exact amount is not specified).  Therefore, the first few samples
337	   produced by the decoder do not correspond to real input audio, but
338	   are instead composed of padding inserted by the encoder to compensate
339	   for this latency.  These samples need to be stored and decoded, as
340	   Opus is an asymptotically convergent predictive codec, meaning the
341	   decoded contents of each frame depend on the recent history of
342	   decoder inputs.  However, a player will want to skip these samples
343	   after decoding them.

345	   A 'pre-skip' field in the ID header (see Section 5.1) signals the
346	   number of samples that SHOULD be skipped (decoded but discarded) at
347	   the beginning of the stream, though some specific applications might
348	   have a reason for looking at that data.  This amount need not be a
349	   multiple of 2.5 ms, MAY be smaller than a single packet, or MAY span
350	   the contents of several packets.  These samples are not valid audio.

352	   For example, if the first Opus frame uses the CELT mode, it will
353	   always produce 120 samples of windowed overlap-add data.  However,
354	   the overlap data is initially all zeros (since there is no prior
355	   frame), meaning this cannot, in general, accurately represent the
356	   original audio.  The SILK mode requires additional delay to account
357	   for its analysis and resampling latency.  The encoder delays the
358	   original audio to avoid this problem.

360	   The pre-skip field MAY also be used to perform sample-accurate
361	   cropping of already encoded streams.  In this case, a value of at
362	   least 3840 samples (80 ms) provides sufficient history to the decoder
363	   that it will have converged before the stream's output begins.

365	4.3.  PCM Sample Position

367	   The PCM sample position is determined from the granule position using
368	   the formula

370	         'PCM sample position' = 'granule position' - 'pre-skip' .

372	   For example, if the granule position of the first audio data page is
373	   59,971, and the pre-skip is 11,971, then the PCM sample position of
374	   the last decoded sample from that page is 48,000.

376	   This can be converted into a playback time using the formula

378	                                   'PCM sample position'
379	                 'playback time' = --------------------- .
380	                                          48000.0

382	   The initial PCM sample position before any samples are played is
383	   normally '0'.  In this case, the PCM sample position of the first
384	   audio sample to be played starts at '1', because it marks the time on
385	   the clock _after_ that sample has been played, and a stream that is
386	   exactly one second long has a final PCM sample position of '48000',
387	   as in the example here.

389	   Vorbis streams use a granule position smaller than the number of
390	   audio samples contained in the first audio data page to indicate that
391	   some of those samples are trimmed from the output (see
392	   [vorbis-trim]).  However, to do so, Vorbis requires that the first
393	   audio data page contains exactly two packets, in order to allow the
394	   decoder to perform PCM position adjustments before needing to return
395	   any PCM data.  Opus uses the pre-skip mechanism for this purpose
396	   instead, since the encoder might introduce more than a single
397	   packet's worth of latency, and since very large packets in streams
398	   with a very large number of channels might not fit on a single page.

400	4.4.  End Trimming

402	   The page with the 'end of stream' flag set MAY have a granule
403	   position that indicates the page contains less audio data than would
404	   normally be returned by decoding up through the final packet.  This
405	   is used to end the stream somewhere other than an even frame
406	   boundary.  The granule position of the most recent audio data page
407	   with completed packets is used to make this determination, or '0' is
408	   used if there were no previous audio data pages with a completed
409	   packet.  The difference between these granule positions indicates how
410	   many samples to keep after decoding the packets that completed on the
411	   final page.  The remaining samples are discarded.  The number of
412	   discarded samples SHOULD be no larger than the number decoded from
413	   the last packet.

415	4.5.  Restrictions on the Initial Granule Position

417	   The granule position of the first audio data page with a completed
418	   packet MAY be larger than the number of samples contained in packets
419	   that complete on that page, however it MUST NOT be smaller, unless
420	   that page has the 'end of stream' flag set.  Allowing a granule
421	   position larger than the number of samples allows the beginning of a
422	   stream to be cropped or a live stream to be joined without rewriting
423	   the granule position of all the remaining pages.  This means that the
424	   PCM sample position just before the first sample to be played MAY be
425	   larger than '0'.  Synchronization when multiplexing with other
426	   logical streams still uses the PCM sample position relative to '0' to
427	   compute sample times.  This does not affect the behavior of pre-skip:
428	   exactly 'pre-skip' samples SHOULD be skipped from the beginning of
429	   the decoded output, even if the initial PCM sample position is
430	   greater than zero.

432	   On the other hand, a granule position that is smaller than the number
433	   of decoded samples prevents a demuxer from working backwards to
434	   assign each packet or each individual sample a valid granule
435	   position, since granule positions are non-negative.  An
436	   implementation MUST treat any stream as invalid if the granule
437	   position is smaller than the number of samples contained in packets
438	   that complete on the first audio data page with a completed packet,
439	   unless that page has the 'end of stream' flag set.  It MAY defer this
440	   action until it decodes the last packet completed on that page.

442	   If that page has the 'end of stream' flag set, a demuxer MUST treat
443	   any stream as invalid if its granule position is smaller than the
444	   'pre-skip' amount.  This would indicate that there are more samples
445	   to be skipped from the initial decoded output than exist in the
446	   stream.  If the granule position is smaller than the number of
447	   decoded samples produced by the packets that complete on that page,
448	   then a demuxer MUST use an initial granule position of '0', and can
449	   work forwards from '0' to timestamp individual packets.  If the
450	   granule position is larger than the number of decoded samples
451	   available, then the demuxer MUST still work backwards as described
452	   above, even if the 'end of stream' flag is set, to determine the
453	   initial granule position, and thus the initial PCM sample position.
454	   Both of these will be greater than '0' in this case.

456	4.6.  Seeking and Pre-roll

458	   Seeking in Ogg files is best performed using a bisection search for a
459	   page whose granule position corresponds to a PCM position at or
460	   before the seek target.  With appropriately weighted bisection,
461	   accurate seeking can be performed in just one or two bisections on
462	   average, even in multi-gigabyte files.  See [seeking] for an example
463	   of general implementation guidance.

465	   When seeking within an Ogg Opus stream, an implementation SHOULD
466	   start decoding (and discarding the output) at least 3840 samples
467	   (80 ms) prior to the seek target in order to ensure that the output
468	   audio is correct by the time it reaches the seek target.  This 'pre-
469	   roll' is separate from, and unrelated to, the 'pre-skip' used at the
470	   beginning of the stream.  If the point 80 ms prior to the seek target
471	   comes before the initial PCM sample position, an implementation
472	   SHOULD start decoding from the beginning of the stream, applying pre-
473	   skip as normal, regardless of whether the pre-skip is larger or
474	   smaller than 80 ms, and then continue to discard samples to reach the
475	   seek target (if any).

477	5.  Header Packets

479	   An Ogg Opus logical stream contains exactly two mandatory header
480	   packets: an identification header and a comment header.

482	5.1.  Identification Header

484	      0                   1                   2                   3
485	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
486	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
487	     |      'O'      |      'p'      |      'u'      |      's'      |
488	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
489	     |      'H'      |      'e'      |      'a'      |      'd'      |
490	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
491	     |  Version = 1  | Channel Count |           Pre-skip            |
492	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
493	     |                     Input Sample Rate (Hz)                    |
494	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
495	     |   Output Gain (Q7.8 in dB)    | Mapping Family|               |
496	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               :
497	     |                                                               |
498	     :               Optional Channel Mapping Table...               :
499	     |                                                               |
500	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

502	                        Figure 1: ID Header Packet

504	   The fields in the identification (ID) header have the following
505	   meaning:

507	   1.  Magic Signature:

509	       This is an 8-octet (64-bit) field that allows codec
510	       identification and is human-readable.  It contains, in order, the
511	       magic numbers:

513	          0x4F 'O'

515	          0x70 'p'

517	          0x75 'u'

519	          0x73 's'

521	          0x48 'H'

523	          0x65 'e'
524	          0x61 'a'

526	          0x64 'd'

528	       Starting with "Op" helps distinguish it from audio data packets,
529	       as this is an invalid TOC sequence.

531	   2.  Version (8 bits, unsigned):

533	       The version number MUST always be '1' for this version of the
534	       encapsulation specification.  Implementations SHOULD treat
535	       streams where the upper four bits of the version number match
536	       that of a recognized specification as backwards-compatible with
537	       that specification.  That is, the version number can be split
538	       into "major" and "minor" version sub-fields, with changes to the
539	       "minor" sub-field (in the lower four bits) signaling compatible
540	       changes.  For example, an implementation of this specification
541	       SHOULD accept any stream with a version number of '15' or less,
542	       and SHOULD assume any stream with a version number '16' or
543	       greater is incompatible.  The initial version '1' was chosen to
544	       keep implementations from relying on this octet as a null
545	       terminator for the "OpusHead" string.

547	   3.  Output Channel Count 'C' (8 bits, unsigned):

549	       This is the number of output channels.  This might be different
550	       than the number of encoded channels, which can change on a
551	       packet-by-packet basis.  This value MUST NOT be zero.  The
552	       maximum allowable value depends on the channel mapping family,
553	       and might be as large as 255.  See Section 5.1.1 for details.

555	   4.  Pre-skip (16 bits, unsigned, little endian):

557	       This is the number of samples (at 48 kHz) to discard from the
558	       decoder output when starting playback, and also the number to
559	       subtract from a page's granule position to calculate its PCM
560	       sample position.  When cropping the beginning of existing Ogg
561	       Opus streams, a pre-skip of at least 3,840 samples (80 ms) is
562	       RECOMMENDED to ensure complete convergence in the decoder.

564	   5.  Input Sample Rate (32 bits, unsigned, little endian):

566	       This is the sample rate of the original input (before encoding),
567	       in Hz.  This field is _not_ the sample rate to use for playback
568	       of the encoded data.

570	       Opus can switch between internal audio bandwidths of 4, 6, 8, 12,
571	       and 20 kHz.  Each packet in the stream can have a different audio
572	       bandwidth.  Regardless of the audio bandwidth, the reference
573	       decoder supports decoding any stream at a sample rate of 8, 12,
574	       16, 24, or 48 kHz.  The original sample rate of the audio passed
575	       to the encoder is not preserved by the lossy compression.

577	       An Ogg Opus player SHOULD select the playback sample rate
578	       according to the following procedure:

580	       1.  If the hardware supports 48 kHz playback, decode at 48 kHz.

582	       2.  Otherwise, if the hardware's highest available sample rate is
583	           a supported rate, decode at this sample rate.

585	       3.  Otherwise, if the hardware's highest available sample rate is
586	           less than 48 kHz, decode at the next higher Opus supported
587	           rate above the highest available hardware rate and resample.

589	       4.  Otherwise, decode at 48 kHz and resample.

591	       However, the 'Input Sample Rate' field allows the muxer to pass
592	       the sample rate of the original input stream as metadata.  This
593	       is useful when the user requires the output sample rate to match
594	       the input sample rate.  For example, when not playing the output,
595	       an implementation writing PCM format samples to disk might choose
596	       to resample the audio back to the original input sample rate to
597	       reduce surprise to the user, who might reasonably expect to get
598	       back a file with the same sample rate.

600	       A value of zero indicates 'unspecified'.  Muxers SHOULD write the
601	       actual input sample rate or zero, but implementations which do
602	       something with this field SHOULD take care to behave sanely if
603	       given crazy values (e.g., do not actually upsample the output to
604	       10 MHz if requested).  Implementations SHOULD support input
605	       sample rates between 8 kHz and 192 kHz (inclusive).  Rates
606	       outside this range MAY be ignored by falling back to the default
607	       rate of 48 kHz instead.

609	   6.  Output Gain (16 bits, signed, little endian):

611	       This is a gain to be applied when decoding.  It is 20*log10 of
612	       the factor by which to scale the decoder output to achieve the
613	       desired playback volume, stored in a 16-bit, signed, two's
614	       complement fixed-point value with 8 fractional bits (i.e., Q7.8).

616	       To apply the gain, an implementation could use

618	                sample *= pow(10, output_gain/(20.0*256)) ,

620	       where output_gain is the raw 16-bit value from the header.

622	       Players and media frameworks SHOULD apply it by default.  If a
623	       player chooses to apply any volume adjustment or gain
624	       modification, such as the R128_TRACK_GAIN (see Section 5.2), the
625	       adjustment MUST be applied in addition to this output gain in
626	       order to achieve playback at the normalized volume.

628	       A muxer SHOULD set this field to zero, and instead apply any gain
629	       prior to encoding, when this is possible and does not conflict
630	       with the user's wishes.  A nonzero output gain indicates the gain
631	       was adjusted after encoding, or that a user wished to adjust the
632	       gain for playback while preserving the ability to recover the
633	       original signal amplitude.

635	       Although the output gain has enormous range (+/- 128 dB, enough
636	       to amplify inaudible sounds to the threshold of physical pain),
637	       most applications can only reasonably use a small portion of this
638	       range around zero.  The large range serves in part to ensure that
639	       gain can always be losslessly transferred between OpusHead and
640	       R128 gain tags (see below) without saturating.

642	   7.  Channel Mapping Family (8 bits, unsigned):

644	       This octet indicates the order and semantic meaning of the output
645	       channels.

647	       Each currently specified value of this octet indicates a mapping
648	       family, which defines a set of allowed channel counts, and the
649	       ordered set of channel names for each allowed channel count.  The
650	       details are described in Section 5.1.1.

652	   8.  Channel Mapping Table: This table defines the mapping from
653	       encoded streams to output channels.  Its contents are specified
654	       in Section 5.1.1.

656	   All fields in the ID headers are REQUIRED, except for the channel
657	   mapping table, which MUST be omitted when the channel mapping family
658	   is 0, but is REQUIRED otherwise.  Implementations SHOULD treat a
659	   stream as invalid if it contains an ID header that does not have
660	   enough data for these fields, even if it contain a valid Magic
661	   Signature.  Future versions of this specification, even backwards-
662	   compatible versions, might include additional fields in the ID
663	   header.  If an ID header has a compatible major version, but a larger
664	   minor version, an implementation MUST NOT treat it as invalid for
665	   containing additional data not specified here, provided it still
666	   completes on the first page.

668	5.1.1.  Channel Mapping

670	   An Ogg Opus stream allows mapping one number of Opus streams (N) to a
671	   possibly larger number of decoded channels (M + N) to yet another
672	   number of output channels (C), which might be larger or smaller than
673	   the number of decoded channels.  The order and meaning of these
674	   channels are defined by a channel mapping, which consists of the
675	   'channel mapping family' octet and, for channel mapping families
676	   other than family 0, a channel mapping table, as illustrated in
677	   Figure 2.

679	      0                   1                   2                   3
680	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
681	                                                     +-+-+-+-+-+-+-+-+
682	                                                     | Stream Count  |
683	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
684	     | Coupled Count |              Channel Mapping...               :
685	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

687	                      Figure 2: Channel Mapping Table

689	   The fields in the channel mapping table have the following meaning:

691	   1.  Stream Count 'N' (8 bits, unsigned):

693	       This is the total number of streams encoded in each Ogg packet.
694	       This value is necessary to correctly parse the packed Opus
695	       packets inside an Ogg packet, as described in Section 3.  This
696	       value MUST NOT be zero, as without at least one Opus packet with
697	       a valid TOC sequence, a demuxer cannot recover the duration of an
698	       Ogg packet.

700	       For channel mapping family 0, this value defaults to 1, and is
701	       not coded.

703	   2.  Coupled Stream Count 'M' (8 bits, unsigned): This is the number
704	       of streams whose decoders are to be configured to produce two
705	       channels (stereo).  This MUST be no larger than the total number
706	       of streams, N.

708	       Each packet in an Opus stream has an internal channel count of 1
709	       or 2, which can change from packet to packet.  This is selected
710	       by the encoder depending on the bitrate and the audio being
711	       encoded.  The original channel count of the audio passed to the
712	       encoder is not necessarily preserved by the lossy compression.

714	       Regardless of the internal channel count, any Opus stream can be
715	       decoded as mono (a single channel) or stereo (two channels) by
716	       appropriate initialization of the decoder.  The 'coupled stream
717	       count' field indicates that the decoders for the first M Opus
718	       streams are to be initialized for stereo (two-channel) output,
719	       and the remaining (N - M) decoders are to be initialized for mono
720	       (a single channel) only.  The total number of decoded channels,
721	       (M + N), MUST be no larger than 255, as there is no way to index
722	       more channels than that in the channel mapping.

724	       For channel mapping family 0, this value defaults to (C - 1)
725	       (i.e., 0 for mono and 1 for stereo), and is not coded.

727	   3.  Channel Mapping (8*C bits): This contains one octet per output
728	       channel, indicating which decoded channel is to be used for each
729	       one.  Let 'index' be the value of this octet for a particular
730	       output channel.  This value MUST either be smaller than (M + N),
731	       or be the special value 255.  If 'index' is less than 2*M, the
732	       output MUST be taken from decoding stream ('index'/2) as stereo
733	       and selecting the left channel if 'index' is even, and the right
734	       channel if 'index' is odd.  If 'index' is 2*M or larger, but less
735	       than 255, the output MUST be taken from decoding stream
736	       ('index' - M) as mono.  If 'index' is 255, the corresponding
737	       output channel MUST contain pure silence.

739	       The number of output channels, C, is not constrained to match the
740	       number of decoded channels (M + N).  A single index value MAY
741	       appear multiple times, i.e., the same decoded channel might be
742	       mapped to multiple output channels.  Some decoded channels might
743	       not be assigned to any output channel, as well.

745	       For channel mapping family 0, the first index defaults to 0, and
746	       if C == 2, the second index defaults to 1.  Neither index is
747	       coded.

749	   After producing the output channels, the channel mapping family
750	   determines the semantic meaning of each one.  There are three defined
751	   mapping families in this specification.

753	5.1.1.1.  Channel Mapping Family 0

755	   Allowed numbers of channels: 1 or 2.  RTP mapping.  This is the same
756	   channel interpretation as [RFC7587].

758	   o  1 channel: monophonic (mono).

760	   o  2 channels: stereo (left, right).

762	   Special mapping: This channel mapping value also indicates that the
763	   contents consists of a single Opus stream that is stereo if and only
764	   if C == 2, with stream index 0 mapped to output channel 0 (mono, or
765	   left channel) and stream index 1 mapped to output channel 1 (right
766	   channel) if stereo.  When the 'channel mapping family' octet has this
767	   value, the channel mapping table MUST be omitted from the ID header
768	   packet.

770	5.1.1.2.  Channel Mapping Family 1

772	   Allowed numbers of channels: 1...8.  Vorbis channel order (see
773	   below).

775	   Each channel is assigned to a speaker location in a conventional
776	   surround arrangement.  Specific locations depend on the number of
777	   channels, and are given below in order of the corresponding channel
778	   indices.

780	   o  1 channel: monophonic (mono).

782	   o  2 channels: stereo (left, right).

784	   o  3 channels: linear surround (left, center, right)

786	   o  4 channels: quadraphonic (front left, front right, rear left,
787	      rear right).

789	   o  5 channels: 5.0 surround (front left, front center, front right,
790	      rear left, rear right).

792	   o  6 channels: 5.1 surround (front left, front center, front right,
793	      rear left, rear right, LFE).

795	   o  7 channels: 6.1 surround (front left, front center, front right,
796	      side left, side right, rear center, LFE).

798	   o  8 channels: 7.1 surround (front left, front center, front right,
799	      side left, side right, rear left, rear right, LFE)

801	   This set of surround options and speaker location orderings is the
802	   same as those used by the Vorbis codec [vorbis-mapping].  The
803	   ordering is different from the one used by the WAVE
804	   [wave-multichannel] and Free Lossless Audio Codec (FLAC) [flac]
805	   formats, so correct ordering requires permutation of the output
806	   channels when decoding to or encoding from those formats.  'LFE' here
807	   refers to a Low Frequency Effects channel, often mapped to a
808	   subwoofer with no particular spatial position.  Implementations
809	   SHOULD identify 'side' or 'rear' speaker locations with 'surround'
810	   and 'back' as appropriate when interfacing with audio formats or
811	   systems which prefer that terminology.

813	5.1.1.3.  Channel Mapping Family 255

815	   Allowed numbers of channels: 1...255.  No defined channel meaning.

817	   Channels are unidentified.  General-purpose players SHOULD NOT
818	   attempt to play these streams.  Offline implementations MAY
819	   deinterleave the output into separate PCM files, one per channel.
820	   Implementations SHOULD NOT produce output for channels mapped to
821	   stream index 255 (pure silence) unless they have no other way to
822	   indicate the index of non-silent channels.

824	5.1.1.4.  Undefined Channel Mappings

826	   The remaining channel mapping families (2...254) are reserved.  A
827	   demuxer implementation encountering a reserved channel mapping family
828	   value SHOULD act as though the value is 255.

830	5.1.1.5.  Downmixing

832	   An Ogg Opus player MUST support any valid channel mapping with a
833	   channel mapping family of 0 or 1, even if the number of channels does
834	   not match the physically connected audio hardware.  Players SHOULD
835	   perform channel mixing to increase or reduce the number of channels
836	   as needed.

838	   Implementations MAY use the following matrices to implement
839	   downmixing from multichannel files using Channel Mapping Family 1
840	   (Section 5.1.1.2), which are known to give acceptable results for
841	   stereo.  Matrices for 3 and 4 channels are normalized so each
842	   coefficient row sums to 1 to avoid clipping.  For 5 or more channels
843	   they are normalized to 2 as a compromise between clipping and dynamic
844	   range reduction.

846	   In these matrices the front left and front right channels are
847	   generally passed through directly.  When a surround channel is split
848	   between both the left and right stereo channels, coefficients are
849	   chosen so their squares sum to 1, which helps preserve the perceived
850	   intensity.  Rear channels are mixed more diffusely or attenuated to
851	   maintain focus on the front channels.

853	   L output = ( 0.585786 * left + 0.414214 * center                    )
854	   R output = (                   0.414214 * center + 0.585786 * right )

856	   Exact coefficient values are 1 and 1/sqrt(2), multiplied by 1/(1 + 1/
857	                        sqrt(2)) for normalization.

859	      Figure 3: Stereo downmix matrix for the linear surround channel
860	                                  mapping

862	       /          \   /                                     \ / FL \
863	       | L output |   | 0.422650 0.000000 0.366025 0.211325 | | FR |
864	       | R output | = | 0.000000 0.422650 0.211325 0.366025 | | RL |
865	       \          /   \                                     / \ RR /

867	     Exact coefficient values are 1, sqrt(3)/2 and 1/2, multiplied by
868	                1/(1 + sqrt(3)/2 + 1/2) for normalization.

870	   Figure 4: Stereo downmix matrix for the quadraphonic channel mapping

872	                                                               / FL \
873	      /   \   /                                              \ | FC |
874	      | L |   | 0.650802 0.460186 0.000000 0.563611 0.325401 | | FR |
875	      | R | = | 0.000000 0.460186 0.650802 0.325401 0.563611 | | RL |
876	      \   /   \                                              / | RR |
877	                                                               \    /

879	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
880	   multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2) for normalization.

882	       Figure 5: Stereo downmix matrix for the 5.0 surround mapping
883	                                                                   /FL \
884	   / \   /                                                       \ |FC |
885	   |L|   | 0.529067 0.374107 0.000000 0.458186 0.264534 0.374107 | |FR |
886	   |R| = | 0.000000 0.374107 0.529067 0.264534 0.458186 0.374107 | |RL |
887	   \ /   \                                                       / |RR |
888	                                                                   \LFE/

890	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
891	     multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + 1/sqrt(2)) for
892	                              normalization.

894	       Figure 6: Stereo downmix matrix for the 5.1 surround mapping

896	     /                                                                \
897	     | 0.455310 0.321953 0.000000 0.394310 0.227655 0.278819 0.321953 |
898	     | 0.000000 0.321953 0.455310 0.227655 0.394310 0.278819 0.321953 |
899	     \                                                                /

901	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2, 1/2 and
902	   sqrt(3)/2/sqrt(2), multiplied by 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 +
903	    sqrt(3)/2/sqrt(2) + 1/sqrt(2)) for normalization.  The coefficients
904	   are in the same order as in Section 5.1.1.2, and the matrices above.

906	       Figure 7: Stereo downmix matrix for the 6.1 surround mapping

908	    /                                                                 \
909	    | .388631 .274804 .000000 .336565 .194316 .336565 .194316 .274804 |
910	    | .000000 .274804 .388631 .194316 .336565 .194316 .336565 .274804 |
911	    \                                                                 /

913	       Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2,
914	     multiplied by 2/(2 + 2/sqrt(2) + sqrt(3)) for normalization.  The
915	     coefficients are in the same order as in Section 5.1.1.2, and the
916	                              matrices above.

918	       Figure 8: Stereo downmix matrix for the 7.1 surround mapping

920	5.2.  Comment Header
921	      0                   1                   2                   3
922	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
923	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
924	     |      'O'      |      'p'      |      'u'      |      's'      |
925	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
926	     |      'T'      |      'a'      |      'g'      |      's'      |
927	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
928	     |                     Vendor String Length                      |
929	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
930	     |                                                               |
931	     :                        Vendor String...                       :
932	     |                                                               |
933	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
934	     |                   User Comment List Length                    |
935	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
936	     |                 User Comment #0 String Length                 |
937	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
938	     |                                                               |
939	     :                   User Comment #0 String...                   :
940	     |                                                               |
941	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
942	     |                 User Comment #1 String Length                 |
943	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
944	     :                                                               :

946	                      Figure 9: Comment Header Packet

948	   The comment header consists of a 64-bit magic signature, followed by
949	   data in the same format as the [vorbis-comment] header used in Ogg
950	   Vorbis, except (like Ogg Theora and Speex) the final "framing bit"
951	   specified in the Vorbis spec is not present.

953	   1.  Magic Signature:

955	       This is an 8-octet (64-bit) field that allows codec
956	       identification and is human-readable.  It contains, in order, the
957	       magic numbers:

959	          0x4F 'O'

961	          0x70 'p'

963	          0x75 'u'

965	          0x73 's'

967	          0x54 'T'
968	          0x61 'a'

970	          0x67 'g'

972	          0x73 's'

974	       Starting with "Op" helps distinguish it from audio data packets,
975	       as this is an invalid TOC sequence.

977	   2.  Vendor String Length (32 bits, unsigned, little endian):

979	       This field gives the length of the following vendor string, in
980	       octets.  It MUST NOT indicate that the vendor string is longer
981	       than the rest of the packet.

983	   3.  Vendor String (variable length, UTF-8 vector):

985	       This is a simple human-readable tag for vendor information,
986	       encoded as a UTF-8 string [RFC3629].  No terminating null octet
987	       is necessary.

989	       This tag is intended to identify the codec encoder and
990	       encapsulation implementations, for tracing differences in
991	       technical behavior.  User-facing applications can use the
992	       'ENCODER' user comment tag to identify themselves.

994	   4.  User Comment List Length (32 bits, unsigned, little endian):

996	       This field indicates the number of user-supplied comments.  It
997	       MAY indicate there are zero user-supplied comments, in which case
998	       there are no additional fields in the packet.  It MUST NOT
999	       indicate that there are so many comments that the comment string
1000	       lengths would require more data than is available in the rest of
1001	       the packet.

1003	   5.  User Comment #i String Length (32 bits, unsigned, little endian):

1005	       This field gives the length of the following user comment string,
1006	       in octets.  There is one for each user comment indicated by the
1007	       'user comment list length' field.  It MUST NOT indicate that the
1008	       string is longer than the rest of the packet.

1010	   6.  User Comment #i String (variable length, UTF-8 vector):

1012	       This field contains a single user comment string.  There is one
1013	       for each user comment indicated by the 'user comment list length'
1014	       field.

1016	   The vendor string length and user comment list length are REQUIRED,
1017	   and implementations SHOULD treat a stream as invalid if it contains a
1018	   comment header that does not have enough data for these fields, or
1019	   that does not contain enough data for the corresponding vendor string
1020	   or user comments they describe.  Making this check before allocating
1021	   the associated memory to contain the data helps prevent a possible
1022	   Denial-of-Service (DoS) attack from small comment headers that claim
1023	   to contain strings longer than the entire packet or more user
1024	   comments than than could possibly fit in the packet.

1026	   Immediately following the user comment list, the comment header MAY
1027	   contain zero-padding or other binary data which is not specified
1028	   here.  If the least-significant bit of the first byte of this data is
1029	   1, then editors SHOULD preserve the contents of this data when
1030	   updating the tags, but if this bit is 0, all such data MAY be treated
1031	   as padding, and truncated or discarded as desired.  This allows
1032	   informal experimentation with the format of this binary data until it
1033	   can be specified later.

1035	   The comment header can be arbitrarily large and might be spread over
1036	   a large number of Ogg pages.  Implementations MUST avoid attempting
1037	   to allocate excessive amounts of memory when presented with a very
1038	   large comment header.  To accomplish this, implementations MAY treat
1039	   a stream as invalid if it has a comment header larger than
1040	   125,829,120 octets, and MAY ignore individual comments that are not
1041	   fully contained within the first 61,440 octets of the comment header.

1043	5.2.1.  Tag Definitions

1045	   The user comment strings follow the NAME=value format described by
1046	   [vorbis-comment] with the same recommended tag names: ARTIST, TITLE,
1047	   DATE, ALBUM, and so on.

1049	   Two new comment tags are introduced here:

1051	   First, an optional gain for track normalization:

1053	   R128_TRACK_GAIN=-573

1055	   representing the volume shift needed to normalize the track's volume
1056	   during isolated playback, in random shuffle, and so on.  The gain is
1057	   a Q7.8 fixed point number in dB, as in the ID header's 'output gain'
1058	   field.  This tag is similar to the REPLAYGAIN_TRACK_GAIN tag in
1059	   Vorbis [replay-gain], except that the normal volume reference is the
1060	   [EBU-R128] standard.

1062	   Second, an optional gain for album normalization:

1064	   R128_ALBUM_GAIN=111

1066	   representing the volume shift needed to normalize the overall volume
1067	   when played as part of a particular collection of tracks.  The gain
1068	   is also a Q7.8 fixed point number in dB, as in the ID header's
1069	   'output gain' field.

1071	   An Ogg Opus stream MUST NOT have more than one of each of these tags,
1072	   and if present their values MUST be an integer from -32768 to 32767,
1073	   inclusive, represented in ASCII as a base 10 number with no
1074	   whitespace.  A leading '+' or '-' character is valid.  Leading zeros
1075	   are also permitted, but the value MUST be represented by no more than
1076	   6 characters.  Other non-digit characters MUST NOT be present.

1078	   If present, R128_TRACK_GAIN and R128_ALBUM_GAIN MUST correctly
1079	   represent the R128 normalization gain relative to the 'output gain'
1080	   field specified in the ID header.  If a player chooses to make use of
1081	   the R128_TRACK_GAIN tag or the R128_ALBUM_GAIN tag, it MUST apply
1082	   those gains _in addition_ to the 'output gain' value.  If a tool
1083	   modifies the ID header's 'output gain' field, it MUST also update or
1084	   remove the R128_TRACK_GAIN and R128_ALBUM_GAIN comment tags if
1085	   present.  A muxer SHOULD place the gain it wants other tools to use
1086	   by default into the 'output gain' field, and not the comment tag.

1088	   To avoid confusion with multiple normalization schemes, an Opus
1089	   comment header SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN,
1090	   REPLAYGAIN_TRACK_PEAK, REPLAYGAIN_ALBUM_GAIN, or
1091	   REPLAYGAIN_ALBUM_PEAK tags, unless they are only to be used in some
1092	   context where there is guaranteed to be no such confusion.
1093	   [EBU-R128] normalization is preferred to the earlier REPLAYGAIN
1094	   schemes because of its clear definition and adoption by industry.
1095	   Peak normalizations are difficult to calculate reliably for lossy
1096	   codecs because of variation in excursion heights due to decoder
1097	   differences.  In the authors' investigations they were not applied
1098	   consistently or broadly enough to merit inclusion here.

1100	6.  Packet Size Limits

1102	   Technically, valid Opus packets can be arbitrarily large due to the
1103	   padding format, although the amount of non-padding data they can
1104	   contain is bounded.  These packets might be spread over a similarly
1105	   enormous number of Ogg pages.  When encoding, implementations SHOULD
1106	   limit the use of padding in audio data packets to no more than is
1107	   necessary to make a variable bitrate (VBR) stream constant bitrate
1108	   (CBR), unless they have no reasonable way to determine what is
1109	   necessary.  Demuxers SHOULD treat audio data packets as invalid
1110	   (treat them as if they were malformed Opus packets with an invalid
1111	   TOC sequence) if they are larger than 61,440 octets per Opus stream,
1112	   unless they have a specific reason for allowing extra padding.  Such
1113	   packets necessarily contain more padding than needed to make a stream
1114	   CBR.  Demuxers MUST avoid attempting to allocate excessive amounts of
1115	   memory when presented with a very large packet.  Demuxers MAY treat
1116	   audio data packets as invalid or partially process them if they are
1117	   larger than 61,440 octets in an Ogg Opus stream with channel mapping
1118	   families 0 or 1.  Demuxers MAY treat audio data packets as invalid or
1119	   partially process them in any Ogg Opus stream if the packet is larger
1120	   than 61,440 octets and also larger than 7,680 octets per Opus stream.
1121	   The presence of an extremely large packet in the stream could
1122	   indicate a memory exhaustion attack or stream corruption.

1124	   In an Ogg Opus stream, the largest possible valid packet that does
1125	   not use padding has a size of (61,298*N - 2) octets.  With
1126	   255 streams, this is 15,630,988 octets and can span up to 61,298 Ogg
1127	   pages, all but one of which will have a granule position of -1.  This
1128	   is of course a very extreme packet, consisting of 255 streams, each
1129	   containing 120 ms of audio encoded as 2.5 ms frames, each frame using
1130	   the maximum possible number of octets (1275) and stored in the least
1131	   efficient manner allowed (a VBR code 3 Opus packet).  Even in such a
1132	   packet, most of the data will be zeros as 2.5 ms frames cannot
1133	   actually use all 1275 octets.

1135	   The largest packet consisting of entirely useful data is
1136	   (15,326*N - 2) octets.  This corresponds to 120 ms of audio encoded
1137	   as 10 ms frames in either SILK or Hybrid mode, but at a data rate of
1138	   over 1 Mbps, which makes little sense for the quality achieved.

1140	   A more reasonable limit is (7,664*N - 2) octets.  This corresponds to
1141	   120 ms of audio encoded as 20 ms stereo CELT mode frames, with a
1142	   total bitrate just under 511 kbps (not counting the Ogg encapsulation
1143	   overhead).  For channel mapping family 1, N=8 provides a reasonable
1144	   upper bound, as it allows for each of the 8 possible output channels
1145	   to be decoded from a separate stereo Opus stream.  This gives a size
1146	   of 61,310 octets, which is rounded up to a multiple of 1,024 octets
1147	   to yield the audio data packet size of 61,440 octets that any
1148	   implementation is expected to be able to process successfully.

1150	7.  Encoder Guidelines

1152	   When encoding Opus streams, Ogg muxers SHOULD take into account the
1153	   algorithmic delay of the Opus encoder.

1155	   In encoders derived from the reference implementation [RFC6716], the
1156	   number of samples can be queried with:

1158	    opus_encoder_ctl(encoder_state, OPUS_GET_LOOKAHEAD(&delay_samples));

1160	   To achieve good quality in the very first samples of a stream,
1161	   implementations MAY use linear predictive coding (LPC) extrapolation
1162	   to generate at least 120 extra samples at the beginning to avoid the
1163	   Opus encoder having to encode a discontinuous signal.  For more
1164	   information on linear prediction, see [linear-prediction].  For an
1165	   input file containing 'length' samples, the implementation SHOULD set
1166	   the pre-skip header value to (delay_samples + extra_samples), encode
1167	   at least (length + delay_samples + extra_samples) samples, and set
1168	   the granule position of the last page to
1169	   (length + delay_samples + extra_samples).  This ensures that the
1170	   encoded file has the same duration as the original, with no time
1171	   offset.  The best way to pad the end of the stream is to also use LPC
1172	   extrapolation, but zero-padding is also acceptable.

1174	7.1.  LPC Extrapolation

1176	   The first step in LPC extrapolation is to compute linear prediction
1177	   coefficients. [lpc-sample] When extending the end of the signal,
1178	   order-N (typically with N ranging from 8 to 40) LPC analysis is
1179	   performed on a window near the end of the signal.  The last N samples
1180	   are used as memory to an infinite impulse response (IIR) filter.

1182	   The filter is then applied on a zero input to extrapolate the end of
1183	   the signal.  Let a(k) be the kth LPC coefficient and x(n) be the nth
1184	   sample of the signal, each new sample past the end of the signal is
1185	   computed as:

1187	                                  N
1188	                                 ---
1189	                          x(n) = \   a(k)*x(n-k)
1190	                                 /
1191	                                 ---
1192	                                 k=1

1194	   The process is repeated independently for each channel.  It is
1195	   possible to extend the beginning of the signal by applying the same
1196	   process backward in time.  When extending the beginning of the
1197	   signal, it is best to apply a "fade in" to the extrapolated signal,
1198	   e.g. by multiplying it by a half-Hanning window [hanning].

1200	7.2.  Continuous Chaining

1202	   In some applications, such as Internet radio, it is desirable to cut
1203	   a long stream into smaller chains, e.g. so the comment header can be
1204	   updated.  This can be done simply by separating the input streams
1205	   into segments and encoding each segment independently.  The drawback
1206	   of this approach is that it creates a small discontinuity at the
1207	   boundary due to the lossy nature of Opus.  A muxer MAY avoid this
1208	   discontinuity by using the following procedure:

1210	   1.  Encode the last frame of the first segment as an independent
1211	       frame by turning off all forms of inter-frame prediction.  De-
1212	       emphasis is allowed.

1214	   2.  Set the granule position of the last page to a point near the end
1215	       of the last frame.

1217	   3.  Begin the second segment with a copy of the last frame of the
1218	       first segment.

1220	   4.  Set the pre-skip value of the second stream in such a way as to
1221	       properly join the two streams.

1223	   5.  Continue the encoding process normally from there, without any
1224	       reset to the encoder.

1226	   In encoders derived from the reference implementation, inter-frame
1227	   prediction can be turned off by calling:

1229	     opus_encoder_ctl(encoder_state, OPUS_SET_PREDICTION_DISABLED(1));

1231	   For best results, this implementation requires that prediction be
1232	   explicitly enabled again before resuming normal encoding, even after
1233	   a reset.

1235	8.  Implementation Status

1237	   A brief summary of major implementations of this draft is available
1238	   at [1], along with their status.

1240	   [Note to RFC Editor: please remove this entire section before final
1241	   publication per [RFC6982], along with its references.]

1243	9.  Security Considerations

1245	   Implementations of the Opus codec need to take appropriate security
1246	   considerations into account, as outlined in [RFC4732].  This is just
1247	   as much a problem for the container as it is for the codec itself.
1248	   Robustness against malicious payloads is extremely important.
1249	   Malicious payloads MUST NOT cause an implementation to overrun its
1250	   allocated memory or to take an excessive amount of resources to
1251	   decode.  Although problems in encoding applications are typically
1252	   rarer, the same applies to the muxer.  Malicious audio input streams
1253	   MUST NOT cause an implementation to overrun its allocated memory or
1254	   consume excessive resources because this would allow an attacker to
1255	   attack transcoding gateways.

1257	   Like most other container formats, Ogg Opus streams SHOULD NOT be
1258	   used with insecure ciphers or cipher modes that are vulnerable to
1259	   known-plaintext attacks.  Elements such as the Ogg page capture
1260	   pattern and the magic signatures in the ID header and the comment
1261	   header all have easily predictable values, in addition to various
1262	   elements of the codec data itself.

1264	10.  Content Type

1266	   An "Ogg Opus file" consists of one or more sequentially multiplexed
1267	   segments, each containing exactly one Ogg Opus stream.  The
1268	   RECOMMENDED mime-type for Ogg Opus files is "audio/ogg".

1270	   If more specificity is desired, one MAY indicate the presence of Opus
1271	   streams using the codecs parameter defined in [RFC6381] and
1272	   [RFC5334], e.g.,

1274	                            audio/ogg; codecs=opus

1276	   for an Ogg Opus file.

1278	   The RECOMMENDED filename extension for Ogg Opus files is '.opus'.

1280	   When Opus is concurrently multiplexed with other streams in an Ogg
1281	   container, one SHOULD use one of the "audio/ogg", "video/ogg", or
1282	   "application/ogg" mime-types, as defined in [RFC5334].  Such streams
1283	   are not strictly "Ogg Opus files" as described above, since they
1284	   contain more than a single Opus stream per sequentially multiplexed
1285	   segment, e.g. video or multiple audio tracks.  In such cases the the
1286	   '.opus' filename extension is NOT RECOMMENDED.

1288	   In either case, this document updates [RFC5334] to add 'opus' as a
1289	   codecs parameter value with char[8]: 'OpusHead' as Codec Identifier.

1291	11.  IANA Considerations

1293	   This document updates the IANA Media Types registry to add .opus as a
1294	   file extension for "audio/ogg", and to add itself as a reference
1295	   alongside [RFC5334] for "audio/ogg", "video/ogg", and "application/
1296	   ogg" Media Types.

1298	   This document defines a new registry "Opus Channel Mapping Families"
1299	   to indicate how the semantic meanings of the channels in a multi-
1300	   channel Opus stream are described.  IANA is requested to create a new
1301	   name space of "Opus Channel Mapping Families".  This will be a new
1302	   registry on the IANA Matrix, and not a subregistry of an existing
1303	   registry.  Modifications to this registry follow the "Specification
1304	   Required with Expert Review" registration policy as defined in
1305	   [RFC5226].  Each registry entry consists of a Channel Mapping Family
1306	   Number, which is specified in decimal in the range 0 to 255,
1307	   inclusive, and a Reference (or list of references) Each Reference
1308	   must point to sufficient documentation to describe what information
1309	   is coded in the Opus identification header for this channel mapping
1310	   family, how a demuxer determines the Stream Count ('N') and Coupled
1311	   Stream Count ('M') from this information, and how it determines the
1312	   proper interpretation of each of the decoded channels.

1314	   This document defines three initial assignments for this registry.

1316	                   +-------+---------------------------+
1317	                   | Value | Reference                 |
1318	                   +-------+---------------------------+
1319	                   | 0     | [RFCXXXX] Section 5.1.1.1 |
1320	                   |       |                           |
1321	                   | 1     | [RFCXXXX] Section 5.1.1.2 |
1322	                   |       |                           |
1323	                   | 255   | [RFCXXXX] Section 5.1.1.3 |
1324	                   +-------+---------------------------+

1326	   The designated expert will determine if the Reference points to a
1327	   specification that meets the requirements for permanence and ready
1328	   availability laid out in [RFC5226] and that it specifies the
1329	   information described above with sufficient clarity to allow
1330	   interoperable implementations.

1332	12.  Acknowledgments

1334	   Thanks to Ben Campbell, Mark Harris, Greg Maxwell, Christopher
1335	   "Monty" Montgomery, Jean-Marc Valin, and Mo Zanaty for their valuable
1336	   contributions to this document.  Additional thanks to Andrew
1337	   D'Addesio, Greg Maxwell, and Vincent Penquerc'h for their feedback
1338	   based on early implementations.

1340	13.  RFC Editor Notes

1342	   In Section 11, "RFCXXXX" is to be replaced with the RFC number
1343	   assigned to this draft.

1345	   In the Copyright Notice at the start of the document, the following
1346	   paragraph is to be appended after the regular copyright notice text:

1348	   "The licenses granted by the IETF Trust to this RFC under Section 3.c
1349	   of the Trust Legal Provisions shall also include the right to extract
1350	   text from Sections 1 through 14 of this RFC and create derivative
1351	   works from these extracts, and to copy, publish, display, and
1352	   distribute such derivative works in any medium and for any purpose,
1353	   provided that no such derivative work shall be presented, displayed,
1354	   or published in a manner that states or implies that it is part of
1355	   this RFC or any other IETF Document."

1357	14.  References

1359	14.1.  Normative References

1361	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1362	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1364	   [RFC3533]  Pfeiffer, S., "The Ogg Encapsulation Format Version 0",
1365	              RFC 3533, May 2003.

1367	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1368	              10646", STD 63, RFC 3629, November 2003.

1370	   [RFC4732]  Handley, M., Rescorla, E., and IAB, "Internet Denial-of-
1371	              Service Considerations", RFC 4732, December 2006.

1373	   [RFC5226]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
1374	              IANA Considerations Section in RFCs", BCP 26, RFC 5226,
1375	              DOI 10.17487/RFC5226, May 2008,
1376	              <http://www.rfc-editor.org/info/rfc5226>.

1378	   [RFC5334]  Goncalves, I., Pfeiffer, S., and C. Montgomery, "Ogg Media
1379	              Types", RFC 5334, September 2008.

1381	   [RFC6381]  Gellens, R., Singer, D., and P. Frojdh, "The 'Codecs' and
1382	              'Profiles' Parameters for "Bucket" Media Types", RFC 6381,
1383	              August 2011.

1385	   [RFC6716]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the
1386	              Opus Audio Codec", RFC 6716, September 2012.

1388	   [EBU-R128]
1389	              EBU Technical Committee, "Loudness Recommendation EBU
1390	              R128", August 2011, <https://tech.ebu.ch/loudness>.

1392	   [vorbis-comment]
1393	              Montgomery, C., "Ogg Vorbis I Format Specification:
1394	              Comment Field and Header Specification", July 2002,
1395	              <https://www.xiph.org/vorbis/doc/v-comment.html>.

1397	14.2.  Informative References

1399	   [RFC6982]  Sheffer, Y. and A. Farrel, "Improving Awareness of Running
1400	              Code: The Implementation Status Section", RFC 6982, July
1401	              2013.

1403	   [RFC7587]  Spittka, J., Vos, K., and JM. Valin, "RTP Payload Format
1404	              for the Opus Speech and Audio Codec", RFC 7587, DOI
1405	              10.17487/RFC7587, June 2015,
1406	              <http://www.rfc-editor.org/info/rfc7587>.

1408	   [flac]     Coalson, J., "FLAC - Free Lossless Audio Codec Format
1409	              Description", January 2008, <https://xiph.org/flac/
1410	              format.html>.

1412	   [hanning]  Wikipedia, "Hann window", May 2013,
1413	              <https://en.wikipedia.org/wiki/
1414	              Hamming_function#Hann_.28Hanning.29_window>.

1416	   [linear-prediction]
1417	              Wikipedia, "Linear Predictive Coding", January 2014,
1418	              <https://en.wikipedia.org/wiki/Linear_predictive_coding>.

1420	   [lpc-sample]
1421	              Degener, J. and C. Bormann, "Autocorrelation LPC coeff
1422	              generation algorithm (Vorbis source code)", November 1994,
1423	              <https://svn.xiph.org/trunk/vorbis/lib/lpc.c>.

1425	   [replay-gain]
1426	              Parker, C. and M. Leese, "VorbisComment: Replay Gain",
1427	              June 2009, <https://wiki.xiph.org/
1428	              VorbisComment#Replay_Gain>.

1430	   [seeking]  Pfeiffer, S., Parker, C., and G. Maxwell, "Granulepos
1431	              Encoding and How Seeking Really Works", May 2012,
1432	              <https://wiki.xiph.org/Seeking>.

1434	   [vorbis-mapping]
1435	              Montgomery, C., "The Vorbis I Specification, Section 4.3.9
1436	              Output Channel Order", January 2010,
1437	              <https://www.xiph.org/vorbis/doc/
1438	              Vorbis_I_spec.html#x1-810004.3.9>.

1440	   [vorbis-trim]
1441	              Montgomery, C., "The Vorbis I Specification, Appendix A:
1442	              Embedding Vorbis into an Ogg stream", November 2008,
1443	              <https://xiph.org/vorbis/doc/
1444	              Vorbis_I_spec.html#x1-132000A.2>.

1446	   [wave-multichannel]
1447	              Microsoft Corporation, "Multiple Channel Audio Data and
1448	              WAVE Files", March 2007, <http://msdn.microsoft.com/en-
1449	              us/windows/hardware/gg463006.aspx>.

1451	14.3.  URIs

1453	   [1] https://wiki.xiph.org/OggOpusImplementation

1455	Authors' Addresses

1457	   Timothy B. Terriberry
1458	   Mozilla Corporation
1459	   650 Castro Street
1460	   Mountain View, CA  94041
1461	   USA

1463	   Phone: +1 650 903-0800
1464	   Email: tterribe@xiph.org

1466	   Ron Lee
1467	   Voicetronix
1468	   246 Pulteney Street, Level 1
1469	   Adelaide, SA  5000
1470	   Australia

1472	   Phone: +61 8 8232 9112
1473	   Email: ron@debian.org
1474	   Ralph Giles
1475	   Mozilla Corporation
1476	   163 West Hastings Street
1477	   Vancouver, BC  V6B 1H5
1478	   Canada

1480	   Phone: +1 778 785 1540
1481	   Email: giles@xiph.org