Internet Engineering Task Force Kretschmer-AT&T/Basso-AT&T INTERNET DRAFT Civanlar-AT&T/Quackenbush-AT&T File:draft-ietf-avt-rtp-mpeg2aac-02.txt Snyder-AT&T July 20, 2001 Expires: January 20, 2002 RTP Payload Format for MPEG-2 and MPEG-4 AAC Streams STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a payload format for transporting MPEG-2/MPEG-4 AAC encoded data using RTP. MPEG-2/MPEG-4 AAC is a recent standard from ISO/IEC [1] [2] [3] for coding multi-channel audio data. This payload format increases the packet loss resilience of AAC coded audio transport above that of 'RTP Payload Format for MPEG1/MPEG2 Video (RFC 2250)' [7] by incorporating AAC properties into the payload format. Supported features comprise fragmentation, interleaving, grouping, repair information and a predictability vector. The MPEG-2/MPEG-4 AAC bitstream format is not backwards compatible with other MPEG-2 audio formats (e.g. MP3). Several services provided by RTP are beneficial for MPEG-2/MPEG-4 AAC encoded data transport over the Internet. Additionally, the use of RTP allows for the synchronization of MPEG-2/MPEG-4 AAC with other real-time streams. In this version of the draft: - The fragmentation section has been revised to allow for the fragmentation of elements (CPE, etc.) to better support transmission over small MTU links Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 1] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 1. Introduction The ISO/IEC MPEG-2/MPEG-4 Advanced Audio Coding (AAC) [1] [2] [3] technology delivers CD-like or better multichannel audio quality at rates around 64 kBit/s per channel. It has a flexible bitstream syntax that supports from 1 to 48 audio channels, up to 16 subwoofer channels and up to 16 embedded data channels. AAC supports a wide range of sampling frequencies (from 16 kHz to 96 kHz) and an extremely wide range of bitrates. AAC can support applications ranging from professional or home theater sound systems to Internet music broadcast systems. The syntax of MPEG-2 AAC compressed data streams is identical to that of MPEG-4 AAC main, AAC LC and AAC SSR General Audio compressed data streams, so that MPEG-2 AAC is fully forward compatible with MPEG-4 AAC. Both MPEG-2 AAC and MPEG-4 AAC provide the same level of compression performance. However, the semantics of MPEG-4 AAC is different in one small respect, precluding a full backward compatibility of MPEG-4 AAC to MPEG-2 AAC. Benefits of using RTP for MPEG-2/MPEG-4 AAC stream transport include: i. Providing increased packet loss resilience based on application layer framing. ii. The ability to synchronize AAC streams with other RTP payloads iii. Monitoring MPEG-2/MPEG-4 AAC delivery performance through RTCP iv. Combining MPEG-2/MPEG-4 AAC and other real-time data streams received from multiple end-systems into a set of consolidated streams through RTP mixers v. Converting data types, etc. through the use of RTP translators. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [5]. 1.1 Overview of MPEG-2/MPEG-4 AAC AAC combines the coding efficiencies of a high resolution filter bank, a powerful model of audio perception, backward-adaptive prediction, joint channel coding, and Huffman coding to achieve high-quality signal compression. In 1998 the MPEG Audio subgroup tested the family of MPEG audio coders (see http://www.tnt.uni-hannover.de/project/mpeg/ audio/public/w2006.pdf). The test results indicate that for a stereo signal, AAC at 96 kBit/s has audio quality comparable to MPEG-2 Layer 3 ("mp3") at 128 kBit/s. AAC is a block oriented, variable rate coding algorithm. An AAC encoder takes 1024 samples per channel at a time (a 'block') as input and the compressed representation is variable in size. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 2] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 Rate control can be used at the encoder to generate a constant-rate bitstream. Each block of AAC compressed bits is called a "raw data block", and can be decoded "stand-alone", that is, without information from prior raw data blocks. This feature is particularly useful for the delivery of AAC over lossy packet networks since the loss of a packet does not directly affect the decodability of the adjacent packets. 1.2 Bitstream Syntax The syntax of an AAC bitstream is as follows: => => [] where indicates the AAC bitstream, indicates intermediate tokens, indicates terminal tokens and [] indicates one or more occurrence. is a token that indicates the end of a raw_data_block and is a variable length token that forces the total length of a raw_data_block to be an integral number of bytes. In general, intermediate tokens are not an integral number of bytes in length. The tokens are a string of bits of variable length, and they can be any of the following: a single audio channel a stereo presentation (2 channels) a mechanism for multi-channel compression a special effects channel "user data" a mechanism for describing the bitstream content a mechanism to use bits (for constant rate channels) The can occur several times in a single raw_data_block. For example, the raw_data_block for a 5.1 surround sound signal would be: ... corresponding to the center, left and right, left surround and right surround and effects channels. Occurances of the are dis-ambiguated by means of a unique 4-bit id inside the . Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 3] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 2. Issues covered by this Payload Format 2.1 Repair Information to reconstruct lost AAC Frames (Unequal FEC) A smart AAC decoder can mitigate the effects of lost packets using techniques such as interpolation in the spectral domain. However if the raw_data_block in a packet is perceptually significant and also highly unpredictable (e.g. the onset of a cymbal crash) then the sender may choose to add repair information associated with that raw_data_block. This form of unequal FEC allows the encoder/sender to protect a stream depending on known loss characteristics and/or frame predictability. A given repair information block (AAC DATA chunk with TYPE > 0) is typically associated with a raw_data_block (TYPE = 0). The association between the raw_data_block and the repair information is obtained by means of the SEQ field. Repair Information as defined here is a valid AAC raw_data_block. As an example, the Repair Information can be a highly compressed monophonic version of a subset of the signal being transmitted. An AAC stereo signal coded at a sampling rate of 44100 samples/s and a bit rate of 96 kBit/s corresponds to an average raw_data_block size of 279 bytes. A RepairData version of that block, compressed to 16 kBit/s would be 46 bytes in length. Given that perceptually critical blocks might occur only once per 100 or more blocks, the average rate increase associated with this type of RepairData can be very low. Generally, the Repair Information for a given AAC frame X SHALL be carried by a different RTP packet then the one that carries X. Generally, the Repair Information MUST be computed at the same sampling rate as the stream being repaired. The usage of the Repair Information is similar to the one proposed in [6]. The OPTIONAL Repair Information MAY be provided for every frame. RepairData can be generated in many ways including using two encoders, decoding followed by coding or processing the original bitstream. 2.2 Fragmentation of AAC Frames It is desirable to limit the size of an AAC frame to less than the path-MTU. If this is not possible, the frame can be fragmented across several RTP packets. Fragmentation SHOULD occur at boundaries. If further fragmentation is needed MAY be fragmented, as well. In that case the decoder must be able to handle partial . An RTP packet contains either an integer number of complete AAC frames or fragments of a single AAC frame. Subsequent RTP packets containing subsequent fragments of an AAC frame have a much simpler header that is just two bytes long. They can be identified by the F-bit set to 1. The S-bit signals the first or only fragment of an . The same ELEMENT ID is shared by all fragments belonging to the same . ELEMENT ID zero is assigned to the fragments of the first in the frame and is increased by one for each following . FBITS indicates the number of unused bits in the first byte Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 4] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 and LBITS the number of unused bits in the last byte. FBITS and/or LBITS must be set to zero if the fragment starts and/or ends at a byte-aligned boundary. The length of a fragment can be determined from the total length of the packet excluding the headers. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |X|X|X|F|S|ELEMNT ID|FBITS|LBITS|AAC FRAGMENT | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 2.3 Predictability of AAC Frames AAC frame predictability allows adaptive handling of packet losses and/or bandwidth constraints by signalling the need for an action a receiver may take when an associated AAC frame is lost. Every AAC frame will be assigned to one of the following three predictability classes: - 0: not predictable - 1: one side predictable (either L-predictable or R-predictable) - 2: two side predictable (- 3: reserved) An AAC frame that belongs to class 0 cannot be easily concealed using any other AAC frame(s) in the bitstream. An AAC frame that belongs to class 1 can be predicted either from previous (R-predictable) or following (L-predictable) AAC frame but not from both. An AAC frame that belongs to class 2 can be predicted from the preceding or following AAC frame or from both. Predictability information is coded for every RTP AAC packet in the Predictability Quantifier (PQ) which is 2 bits in length. For a given RTP packet such PQs are organized in a predictability vector which represents a sliding window of PQs, starting with the current packet's PQ followed by preceding packets' PQs. 2.4 Grouping and Interleaving of AAC Frames It is often desirable to group an integer number of AAC frames. The predictability of such an RTP packet is the predictability of the AAC frame in the RTP packet which is least predictable. AAC frames belonging to the same predictability class MAY be grouped into one RTP packet. Note that if frames of different predictabilities are grouped much of the usefulness of the predictability information is lost. The sequence numbers SEQ of the AAC DATA chunks are used to restore the proper order on the receiver side. Grouping AAC frames into a single RTP packet is OPTIONAL. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 5] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 2.5 Example RTP Packet Sequence The example below shows a sequence of AAC frames (a...p) and their assigned predictability classes. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 1 | 2 | 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The AAC frames MAY be grouped according to their predictability. R(x) is the RepairData information sent within the RTP packet: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |a g j|b h k|c i o| d f | e | l n | m | p | | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | |R(e) | | |R(m) | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3. RTP AAC Payload Format The AAC specific RTP payload consists of a 8, 32, 64 or 96 bit header, and a variable number of AAC DATA chunks. The type of those chunks is identified by TYPE field. The LENGTH field specifies the length of a chunk in bytes and SEQ is a sequence number which allows grouping, interleaving and association of Repair Info with the frame it repairs. The header contains a vector of Predictability Quantizers (PQ) which specify the packets' predictability classes, and a set of control bits. The PVS field specifies if the header contains 12, 28 or 44 PQs. At the beginning of a session, if fewer packets have been transmitted/ received than there are PQs in the header then the extra PQs are invalid and MUST be set to 0 (on the sender side) and MUST be ignored (on the receiver side). If a sender provides a predictability vector but does not provide frame predictability information it MUST set all PQs to 0. A client can ignore the information provided by PQs since PQs are not required for decoding AAC frames. PQs can be used to decide when to ask for retransmission of lost packets. PQs can also provide hints which help a PQ-aware decoder to improve the audio quality when concealing lost packets. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 6] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PVS|M|F| MBZ |PRD VECTOR (PVS > 0) | Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PRD VECTOR (PVS > 1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |PRD VECTOR (PVS > 2) | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |TYPE |SEQ |LENGTH (if M==1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| |AAC DATA 1 | AAC | | Data | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |TYPE |SEQ |LENGTH | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | |AAC DATA 2 | |-+-+-+-+-+-+-+-+ | | . | | . | | . | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| |TYPE |SEQ |LENGTH | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| |AAC DATA N | | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ PRD VECTOR: Predictability vector. It contains either 12, 28 or 44 Predictability Quantifiers (PQ). The size of a PQ element is 2 bits. The first PQ refers to the predictability class of the current packet. The following PQs refer to the most recent previous packets. Thus, the vector looks like this: {PQ(t), PQ(t-1), PQ(t-2)...} The predictability class of a packet is that of the least predictable AAC frame that is contained in the packet. PVS: Predictability Vector Size. Specifies the number of 32bit words used for the Predictability Vector. The first 32bit word contains the flags field. Hence, only the lower significant 24bits belong to the vector. If PVS is set to 0 the predictability vector field does not exist, and the TYPE field is contiguous with the MBZ field. M: If M is set, then the payload contains more than one AAC frame. Hence, a LENGTH field for the first frame MUST be present. If M is clear, then only one AAC frame is present, and no LENGTH field is present. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 7] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 F: Fragmented Frame. MBZ: Must be set to 0. TYPE: The type of AAC DATA. This field specifies if the AAC DATA is an original AAC frame or contains some form of Error Correction data. The following types are defined for now: 0: Original AAC frame 1: Identical to original frame but sent for redundancy 2: Same AAC configuration but encoded using less bits (Repair Information) (all three types are valid AAC frames) SEQ: The sequence number enumerates AAC frames at the stream level. It may be used to support interleaving or association of Repair Information with TYPE==0 AAC frames, etc. LENGTH: The length of the AAC Data in bytes. AAC DATA: The actual AAC data chunk. This is either a valid AAC frame (TYPE = 0) or Repair Information belonging to a valid AAC frame (TYPE > 0). 3.1 RTP Header Fields Usage: The RTP header fields are used as follows: Payload Type (PT): It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or alternatively a payload type in the dynamic range shall be chosen. Marker (M) bit: Set to one to mark the last fragment (or only fragment) of an AAC frame. Extension (X) bit: Defined by the RTP profile used. Timestamp (TS): 32-bit timestamp representing the sampling time of the first sample of the first AAC frame in the packet. The clock frequency MUST be set to the sample rate of the encoded audio data and is conveyed out-of-band (i.e. through SDP [8]). If N > 1 frames are present in a RTP packet the TS of the frames 2...N can be calculated by computing the sequence number difference between those frames and the first frame since the sample rate and the number of samples per frame are fixed and known. All packets that make up a fragmented AAC frame MUST use the same TS. Timestamps start at a random value to improve security. SSRC: set as described in RFC1889 [2]. CC and CSRC fields are used as described in RFC 1889 [2]. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 8] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 RTCP SHOULD be used as defined in RFC 1889 [2] 4. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [2]. This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed on the compressed data so there is no conflict between the two operations. This payload type does not exhibit any significant non-uniformity in the receiver side computational complexity for packet processing to cause a potential denial-of-service threat. 5. Intellectual Property Disclosure A US patent application has been filed on the usage and computation of predictability information for transmission over lossy channels. 6. References [1] ISO/IEC 13818-7 Advanced Audio Coding (AAC). [2] ISO/IEC 14496-3:1999 "Information technology - Coding of audio-visual objects - Part 3: Audio," December, 1999. [3] ISO/IEC 14496-3:1999 / AMD1:2000. [4] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport Protocol for Real Time Applications RFC 1889, Internet Engineering Task Force, January 1996. [5] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, March 1997. [6] Perkins C., Kouvelas I., Hodson O., Hardman V., Bolot J.C, Vega-Garcia A., Fosse-Parisis S. "RTP Payload for Redundant Audio Data", RFC 2198, Internet Engineering Task Force, September 1997. [7] D. Hoffman, G. Fernando, V. Goyal, M. Civanlar RTP Payload Format for MPEG1/MPEG2 Video RFC 2250, Internet Engineering Task Force, January 1998. [8] M. Handley, V. Jacobson, SDP: Session Description Protocol RFC 2327, Internet Engineering Task Force, April 1998. Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 9] INTERNET-DRAFT RTP Payload Format for MPEG AAC Streams Jan 2002 7. Authors' Addresses Mathias Kretschmer AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: mathias@research.att.com Andrea Basso AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: basso@research.att.com M. Reha Civanlar AT&T Labs - Research 100 Schultz Drive Red Bank, NJ 07701 USA e-mail: civanlar@research.att.com Schuyler R. Quackenbush AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: srq@research.att.com James H. Snyder AT&T Labs - Research 180 Park Ave. Florham Park, NJ 07932 USA e-mail: jhs@research.att.com Kretschmer/Basso/Civanlar/Quackenbush/Snyder [Page 10]