Internet Engineering Task Force Basso-AT&T Internet Draft Civanlar-AT&T Gentric-Philips Herpel-Thomson Lifshitz-Optibase Lim-mp4cast Perkins-ISI Van Der Meer-Philips February 2002 Expires August 2002 Document: draft-ietf-avt-mpeg4-multisl-04.txt RTP Payload Format for MPEG-4 Streams Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." This specification is a product of the Audio/Video Transport working group within the Internet Engineering Task Force and ISO/IEC MPEG-4 ad hoc group on MPEG-4 over Internet. Comments are solicited and should be addressed to the working group's mailing list at avt@ietf.org and/or the authors. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. << Note for the RFC editor: XXXX should be replaced with this RFC number and YYYY replaced by the number given to the companion RFC which draft is: draft-ietf- avt-mpeg4-simple-**.txt. This document also contains a MIME type registration form that is intended to be taken as-is and therefore makes reference to this document, using the temporary placeholder: XXXX. >> Gentric et al. Expires August 2002 1 RTP Payload Format for MPEG-4 Streams February 2002 Abstract This document describes a payload format for transporting MPEG-4 encoded data using RTP. MPEG-4 is a recent standard from ISO/IEC for the coding of natural and synthetic audio-visual data. Several services provided by RTP are beneficial for MPEG-4 encoded data transport over the Internet. Additionally, the use of RTP makes it possible to synchronize MPEG-4 data with other real-time data types. Table of Contents 1. Introduction....................................................3 1.1 Overview of MPEG-4 End-System Architecture.....................3 1.2 The simplified MPEG-4 terminal model...........................4 1.3 The complete MPEG-4 terminal model.............................4 1.3.1 The Sync Layer and DMIF......................................6 2. Analysis of the carriage of MPEG-4 over IP......................8 2.1 The Sync Layer point of view...................................8 2.2 The Elementary Stream point of view............................9 2.3 How the two views reconcile...................................10 2.4 Rationale for features........................................11 2.5 Relation with RFC 3016........................................11 3. Payload format.................................................13 3.1 RTP Header Fields Usage.......................................14 3.2 RTP payload structure.........................................16 3.3 Payload Header Section structure..............................17 3.3.1 Payload Header structure....................................18 3.3.2 Fields of a Payload Header..................................19 3.4 RSLHSection structure.........................................21 3.4.1 RSLH structure..............................................22 3.4.2 Removal of fields...........................................22 3.4.3 Mapping of OCR..............................................23 3.4.4 Degradation Priority........................................23 3.5 Payload Section structure.....................................23 3.6 Interleaving..................................................24 3.6.1 Time stamp based interleaving (TSBI)........................25 3.6.2 Index based interleaving (IBI)..............................26 3.6.3 SL streams that should not be interleaved...................26 3.7 Fragmentation Rules...........................................26 4. Types and names................................................28 4.1 MIME type registration........................................28 4.2 Concatenation of parameters...................................33 4.3 Usage of SDP..................................................33 4.3.1 The a=fmtp keyword..........................................33 4.3.2 SDP example.................................................33 5. IANA considerations............................................34 6. Other issues...................................................34 6.1 SL-packetized stream reconstruction...........................34 6.2 Handling of scene description streams.........................38 6.3 Overlap with RFC 3016.........................................39 6.4 Multiplexing..................................................40 7. Security considerations........................................41 8. Acknowledgements...............................................42 Gentric et al. Expires March 2002 2 RTP Payload Format for MPEG-4 Streams February 2002 9. References.....................................................42 10. Authors's addresses...........................................43 APPENDIX: Examples of usage.......................................44 Appendix.1 RFC 3016 compatible MPEG-4 Video (no SL)...............44 Appendix.2 MPEG-4 Video with SL...................................46 Appendix.3 Low delay MPEG-4 Audio (no SL).........................48 Appendix.4 Media delivery MPEG-4 Audio (no SL)....................50 Appendix.5 AAC with interleaving (no SL)..........................51 Appendix.6 AAC with Index-based interleaving and SL...............53 1. Introduction MPEG-4 is a recent standard from ISO/IEC for the coding of natural and synthetic audio-visual data in the form of audiovisual objects that are arranged into an audiovisual scene by means of a scene description [1][2][3][4]. This draft specifies an RTP [5] payload format for transporting MPEG-4 encoded data streams. It supplements RFC 3016 in the respect that it can transport all MPEG-4 stream types while being compatible with RFC 3016 for the transport of MPEG-4 video. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [6]. The benefits of using RTP for MPEG-4 data stream transport include: i. Ability to synchronize MPEG-4 streams with other RTP payloads, one example is the transport and synchronization of MPEG-4 video associated with AMR audio in mobile networks. ii. Monitoring MPEG-4 delivery performance through RTCP. iii. Combining MPEG-4 and other real-time data streams received from multiple end-systems into a set of consolidated streams through RTP mixers. iv. Converting data types, etc. through the use of RTP translators. 1.1 Overview of MPEG-4 End-System Architecture Two types of terminals can use this specification. One case is a complete MPEG-4 terminal i.e. a terminal implementing the MPEG-4 system [1] specification and possibly also MPEG-4 video [2] and audio [3]. Another possibility is a terminal implementing only a part of this set of MPEG-4 specification; one example is a terminal using MPEG-4 video [2] but not MPEG-4 systems as in RFC3016. This document is structured so as to be understandable from both points of view (with or without MPEG-4 systems). The target is also that services deployed for one type of terminal can be adapted for the other type with only a minor change in the session description Gentric et al. Expires March 2002 3 RTP Payload Format for MPEG-4 Streams February 2002 because the media formats are the same. Another key assumption is that the properties of streams of various types (video, audio, scene description) can be described with the same Elementary Stream model so that this same payload format can transport any MPEG-4 stream. 1.2 The simplified MPEG-4 terminal model In the simplified MPEG-4 model MPEG-4 systems [1] is not used. However the concept of Elementary Stream remains, by MPEG definition: "A consecutive flow of mono-media data from a single source entity to a single destination entity on the compression layer". Indeed both MPEG-4 video [2] and MPEG-4 audio [3] documents describe how respectively audio and video bit streams are fragmented into pieces that are called Access Units, again by MPEG definition: "An individually accessible portion of data within an Elementary Stream. An access unit is the smallest data entity to which timing information can be attributed". Each Access Unit has by this definition a number of media independent basic properties: . Composition time stamp (CTS) . Framing . Possibly decoding time stamp (DTS) Furthermore both the video [2] and audio [3] specification also define how Access Units (AU) shall be themselves fragmented since in the spirit of Application Level Framing AUs should be fragmented in such a way that decoders can process the packets arriving immediately after a packet loss. In this case the signaling of Access Unit fragment boundaries is also required. In order to be understandable from this point of view this payload format is described in terms of Access Units (AU) and Access Units fragments. This specification does not make reference to media specific properties (but for a few exceptions). Indeed it is the purpose of this specification to provide RTP transport for all media types in MPEG-4 in a generic fashion. In this mode of operation the RTP framework is used for transport of timing and synchronization and protocols such as H.323, SIP, RTSP, etc, can be used for control. 1.3 The complete MPEG-4 terminal model Fig. 1 below shows the layered architecture of a terminal, which implements the complete MPEG-4 systems model. The Compression Layer processes individual audio-visual media streams. The MPEG-4 compression schemes are defined in the ISO/IEC specifications 14496- 2 [2] and 14496-3 [3]. The compression schemes in MPEG-4 achieve efficient encoding over a bandwidth ranging from a few kbps to many Mbps. The audio-visual content compressed by this layer is organized into Elementary Streams (ESs). The MPEG-4 standard specifies MPEG-4 compliant streams. Within the constraint of this compliance the compression layer is unaware of a Gentric et al. Expires March 2002 4 RTP Payload Format for MPEG-4 Streams February 2002 specific delivery technology, but it can be made to react to the characteristics of a particular delivery layer such as the path-MTU or loss characteristics. Also, some compressors can be designed to be delivery specific for implementation efficiency. In such cases the compressor may work in a non-optimal fashion with delivery technologies that are different than the one it is specifically designed to operate with. The hierarchical relations, location and properties of ESs in a presentation are described by a dynamic set of Object Descriptors (ODs). Each OD groups one or more ES Descriptors referring to a single content item (audio-visual object). Hence, multiple alternative or hierarchical representations of each content item are possible. ODs are themselves conveyed through one or more ESs. A complete set of ODs can be seen as an MPEG-4 resource or session description at a stream level. The resource description may itself be hierarchical, i.e. an ES conveying an OD may describe other ESs conveying other ODs. The session description is accompanied by a dynamic scene description, Binary Format for Scene (BIFS), again conveyed through one or more ESs. At this level, content is identified in terms of audio-visual objects. The spatio-temporal location of each object is defined by BIFS. The audio-visual content of those objects that are synthetic and static are described by BIFS also. Natural and animated synthetic objects may refer to an OD that points to one or more ESs that carries the coded representation of the object or its animation data. Gentric et al. Expires March 2002 5 RTP Payload Format for MPEG-4 Streams February 2002 media aware +-----------------------------------------+ delivery unaware | COMPRESSION LAYER | 14496-2 Visual |streams from as low as Kbps to multi-Mbps| 14496-3 Audio +-----------------------------------------+ Elementary Stream ===================================================Interface (ESI) +-------------------------------------------+ media and | SYNC LAYER | delivery unaware | manages elementary streams, their synch- | 14496-1 Systems | ronization and hierarchical relations | +-------------------------------------------+ DMIF Application ====================================================Interface (DAI) +-------------------------------------------+ delivery aware | DELIVERY LAYER | media unaware |provides transparent access to and delivery| 14496-6 DMIF | of content irrespective of delivery | | technology | +-------------------------------------------+ Figure 1: Conceptual MPEG-4 terminal architecture By conveying the session (or resource) description as well as the scene (or content composition) description through their own ESs, it is made possible to change portions of the content composition and the number and properties of media streams that carry the audio- visual content separately and dynamically at well known instants in time. One or more initial Scene Description streams and the corresponding OD stream are pointed to by an initial object descriptor (IOD). In this context the IOD needs to be made available to the receivers through some out-of-band means that are out of scope of this payload specification. However in the context of transport on IP networks it is defined in a separate document [9]. The Compression Layer organizes the ESs in Access Units (AU), the smallest elements that can be attributed individual timestamps. The Access Units concept defines the boundary between media specific processing and delivery specific processing. That is to say transport should not depend on the nature of the media data but only on AU properties. Gentric et al. Expires March 2002 6 RTP Payload Format for MPEG-4 Streams February 2002 1.3.1 The Sync Layer and DMIF The Sync Layer (SL) that primarily provides the synchronization between streams defines a homogeneous encapsulation of ESs carrying media or control data (ODs, BIFS). Integer or fractional AUs are then encapsulated in SL packets. All consecutive data from one stream is called an SL-packetized stream. The interface between the compression layer and the SL is called the Elementary Stream Interface (ESI). The ESI is informative i.e. it is extremely useful in order to define concepts and mechanisms but does not have to be implemented. The Delivery Layer in MPEG-4 consists of the Delivery Multimedia Integration Framework defined in ISO/IEC 14496-6 [4]. This layer is media unaware but delivery technology aware. It provides transparent access to and delivery of content irrespective of the technologies used. The interface between the SL and DMIF is called the DMIF Application Interface (DAI). It offers content location independent procedures for establishing MPEG-4 sessions and access to transport channels. This payload format can be used as an instance of the MPEG-4 Delivery Layer but is otherwise not tied to DMIF. The ESs from the encoders are fed into the SL with indications of AU boundaries, random access points, desired composition time and the current time. The Sync Layer fragments the ESs into SL packets, each containing a header that encodes information conveyed through the ESI. If the AU is larger than a SL packet, subsequent packets containing remaining parts of the AU are generated with subset headers until the complete AU is packetized. One SL packet describes an Access Units or fragments thereof, the SL packet header contains extended timing and framing information; the SL packet payload contains the bit stream frame (AU) or fragment. For the complete list of features of the Sync Layer refer to the MPEG-4 systems specification [1]. The syntax of the Sync Layer is configurable and can be adapted to the needs of the stream to be transported. This includes the possibility to select the presence or absence of individual syntax elements as well as configuration of their length in bits. The configuration for each individual stream is conveyed in a SLConfigDescriptor, which is an integral part of the ES Descriptor for this stream. The MPEG-4 SLConfigDescriptor, being configuration information, is not carried by the media stream itself but is rather transported via an ObjectDescriptor Stream encoded using the MPEG-4 Object Description framework. This can be done in a separate stream using this payload format (see section 6.2 for details). The SLConfigDescriptor MAY also be transported by other means (for example as a MIME parameter, see section 4.1). An important point is to note that this draft could just as well have been entirely written in terms of SL packets instead of Access Units and Access Unit fragments. However this could have created confusion for implementers who only need basic properties and do not want to cope with the additional complexity of the Sync Layer. Gentric et al. Expires March 2002 7 RTP Payload Format for MPEG-4 Streams February 2002 Instead this specification refers to the Sync Layer only when needed. 2. Analysis of the carriage of MPEG-4 over IP As explained above when transporting MPEG-4 audio and video, applications may or may not require the use of MPEG-4 systems. To achieve the highest level of interoperability between all MPEG-4 applications, it is desirable that (a) in both cases the same MPEG-4 transport format can be used and that (b) receivers that have no MPEG-4 system knowledge can easily skip the MPEG-4 system specific information, if any. An example of application not requiring MPEG-4 system is audio/video streaming from a single source. Examples of applications that would benefit from MPEG-4 system features are: . Audio/video streaming mixing RTP and non-RTP sources (e.g. local storage in the .mp4 interchange format) . Rich multimedia applications including 2D, 2.5D or 3D interactive scenes with multiple graphical/audio/video objects and/or a composition variable in time and/or according to a server-push and/or server-pull model. . Applications involving Digital Right Management for some or all parts/streams in the content . Applications involving the use of advanced meta-data and the associated content management features as provided by the MPEG suite of relevant standards (MPEG-7 and MPEG-11). 2.1 The Sync Layer point of view RTP is perfectly suitable to transport MPEG-4 audio and MPEG-4 video, but when using MPEG-4 systems a problem arises from the fact that both RTP and MPEG-4 systems contain a synchronization layer. In particular, the RTP header duplicates some of the information provided in SL packet headers such as the composition timestamps (CTS) and Access Unit boundaries. To avoid unnecessary overhead and potential interoperability risks when transporting MPEG-4 systems, it is desirable to remove the redundancy between the SL packet header and the RTP packet header. To be independent on the use of MPEG-4 systems, synchronization can rely on the parameters provided in the RTP header. Another desired property is to have compatibility with RFC3016 for MPEG-4 video transport. This is achieved in the following fashion (also depicted in figure 5): In case SL headers are used, the redundant fields are removed from the SL header. The remaining information from the SL header, if any, is contained inside the RTP packet payload, together with the SL packet payload. Some of this information is also useful for transport over RTP when an MPEG-4 system is not used. For that reason this information is split into "general useful information" Gentric et al. Expires March 2002 8 RTP Payload Format for MPEG-4 Streams February 2002 and "MPEG-4 systems only information". The "general useful information" hereinafter called Payload Header is carried by a number of fields configurable using parameters defined in section 4.1; all receivers MUST parse these fields. The "MPEG-4 systems only information", if any, is contained in an auxiliary header, hereinafter called Remaining SL Packet Header (RSLH), also configured using parameters (see section 4.1) and preceded by a length field, so that non-MPEG-4-system devices MAY skip this information. +------------+ extended framing and | AU or AU | timing information | fragment | +------------+ | | | | | | | | V V <----------SL Packet--------> +---------------------------+ | SL Packet | SL Packet | | Header | Payload | +---------------------------+ | | | | +-------------+----------+---+ | | | | | V V V V +-----------+ +-----------+ +-------------+ +-----------+ |RTP Packet | | Payload | | Remaining SL| | SL Packet | | Header | | Header | | Header | | Payload | +-----------+ +-----------+ +-------------+ +-----------+ <----RTP Packet Payload-------------------> Figure 5: Mapping of ES into SL, then SL Packet into RTP packet 2.2 The Elementary Stream point of view Another way to see the mapping of Elementary Streams (i.e. Access Units or AU fragments) into RTP packets is depicted in Figure 6. In this view the "basic" timing and fragmentation information listed in section 1.2 is obtained directly at the codec interfaces and mapped into the RTP header or the RTP Payload Header. For example this RTP payload format has been designed so that it is by default configured to be identical to RFC 3016 for the recommended MPEG-4 video configurations, specifically in this case the Payload Header is empty. Hence receivers that comply with this payload specification can decode such RTP payload without knowledge Gentric et al. Expires March 2002 9 RTP Payload Format for MPEG-4 Streams February 2002 about the Sync Layer (see the relevant examples in Appendix). In a similar fashion but with non-empty Payload Headers, MPEG-4 audio (see Appendix 3 and 4 for examples) can be transported without explicit use of the Sync Layer. +------------+ basic framing and | AU or AU | timing information | fragment | +------------+ | | | | +-------------+ | | | | V V V +-----------+ +-----------+ +-----------+ |RTP Packet | | Payload | | | | Header | | Header | | Payload | +-----------+ +-----------+ +-----------+ <----RTP Packet Payload---> Figure 6: Direct mapping of Elementary Streams into RTP packet 2.3 How the two views reconcile A simple concept enables to unify these apparently antagonistic points of view: a terminal that does not implement the Sync Layer can skip (ignore) the Remaining SL Header, if present. There are also cases when an Elementary Stream is such that SL packets are reduced to the media (compressed) data (empty headers) and in that case implementations do not actually need to be aware of the Sync Layer at all. In these cases it is logically equivalent to say that the Sync Layer is not implemented or to say that the SL packet headers are completely empty (or fully map into the RTP headers). The Sync Layer can then be seen as a purely conceptual construction that does not have to be implemented at all. Examples are video transported as in RFC3016 (see below) and some audio modes (see Annex). The above described MPEG-4 system model also deals with session setup through Object Descriptors. In cases where the complete MPEG-4 system framework is not used a replacement for this key functionally is required. In fact for simple (audio/video) systems only the knowledge of the decoder configuration is needed; we will see how this specification defines options so that decoder configuration can also be signaled without MPEG-4 system. In conclusion this payload format is intended to be capable of transporting data formatted according to the Sync Layer Gentric et al. Expires March 2002 10 RTP Payload Format for MPEG-4 Streams February 2002 specification but is also useful without the Sync Layer, or when the Sync Layer is invisible, which is equivalent to not using it. 2.4 Rationale for features This payload format has a number of uncommon features that are best understood by first considering their rationale: . Genericity: The payload structure does not depend on the nature of the stream (audio, video, scene, etc). In this respect the apparent complexity of this specification should be compared to the complexity of the only alternative solution, which would have been the specification and implementation of many different RTP payload formats. . Variable geometry: this payload format is highly configurable i.e. the structure of the RTP payload depends on MIME parameters; actually all the Payload Header components are optional and most of them have a configurable size. This is aligned with the Sync Layer definition and allows optimal efficiency in terms of payload size per packet. . Two packing style (single and multiple): the rationale for transporting a single AU or AU fragment per RTP packet is simplicity, it is also the packing style for backward compatibility with RFC3016. The rationale for transporting multiple AU per RTP packet is efficiency, at the cost of sensitivity to losses. . Two interleaving methods: the rationale for interleaving is to enable various error concealment strategies in case of packet losses when packing several AU or AU fragments per RTP packets. The need for two interleaving methods arises from the fact that the default one, based on time stamps, is the most efficient but does not work for all configurations. Another method, based on indexes, is therefore required. . The rationale for transporting multiple interleaved AU fragments per RTP packet is to benefit from advanced error resiliency properties of bit streams (such as MPEG-4 audio version 2). 2.5 Relation with RFC 3016 The following set of figures displays the relationship between the MPEG-4 RTP payload formats; there are 4 MPEG-4-related RTP payload formats. The FlexMux is a really separate issue [11] and need not be discussed here apart from the fact that is shares with this work the MPEG-4 Sync Layer as the interface into the MPEG-4 domain. RFC 3016 describes transport of MPEG-4 video and LATM (for speech and audio codecs). This specification defines transport of any MPEG-4 type of data, with or without the Sync Layer. RFC YYYY describes a subset of the configurations that this specification can handle. Figure 2 displays the situation for video; note that this specification is compatible with RFC 3016. Figure 3 displays the situation for audio, note the presence of the LATM multiplex, which makes RFC 3016 audio transport incompatible with this specification. Gentric et al. Expires March 2002 11 RTP Payload Format for MPEG-4 Streams February 2002 Figure 4 displays the situation for other MPEG-4 streams, including BIFS, ODS, IPMP, etc. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | | | MPEG-4 Video | | | I |+++++++++++++++++++++++| | S | | | O | Sync Layer | | / | | | M |+++++++++++++++++++++++| | P | | | | E | FlexMux | | | G | | | <- same RTP packet structure -> | |++++++++++++| +++++++++++++++++++++++++++|++++++++++++|*** | | | | | | FlexMux | RFC XXXX | RFC YYYY | RFC 3016 | I | RTP | MPEG-4 generic RTP | | for | E | payload | payload +++++++++++++ Video | T | | | | F +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure 2: Relationship of MPEG-4 RTP payload formats for the transport of video +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | | | MPEG-4 Audio | | | I |+++++++++++++++++++++++| | S | | | O | Sync Layer | | / | | | M |+++++++++++++++++++++++| +++++++++++++| P | | | | | E | FlexMux | | | LATM | G | | | | | |++++++++++++| +++++++++++++++++++++++++++|++++++++++++|*** | | | | | | FlexMux | RFC XXXX | RFC YYYY | RFC 3016 | I | RTP | MPEG-4 generic RTP | | for | E | payload | payload +++++++++++++ Audio | T | | | | F +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure 3: Relationship of MPEG-4 RTP payload formats for the transport of audio Gentric et al. Expires March 2002 12 RTP Payload Format for MPEG-4 Streams February 2002 ++++++++++++++++++++++++++++++++++++++++++++++++++++ | | | MPEG-4 system | | | I |+++++++++++++++++++++++| | S | | | O | Sync Layer | | / | | | M |+++++++++++++++++++++++| | P | | | | E | FlexMux | | | G | | | | |++++++++++++| +++++++++++++++++++++++++++|*** | | | | | FlexMux | RFC XXXX | RFC YYYY | I | RTP | MPEG-4 generic RTP | | E | payload | payload ++++++++++++| T | | | F ++++++++++++++++++++++++++++++++++++++++++++++++++++ Figure 4: Relationship of MPEG-4 RTP payload formats for the transport of MPEG-4 system streams (including BIFS, ODS, IPMP). 3. Payload Format One or more Access Units or Access Unit fragments (see section 3.9 for fragmentation rules) are mapped into each RTP packet. Some information attached to these AU or AU Fragment is mapped onto the RTP header (see section 3.1), some form an additional payload header. The resulting RTP payload is described in section 3.2, it is composed of 3 parts (see figure 5): . a Payload Header section (optional) . a RSLH (Remaining SL Header) section (optional) . a Payload Section. These are described respectively in section 3.3, 3.4 and 3.5 of this memo. When transporting SL streams, SL Packet Headers are transformed into Remaining SL Header (RSLH) with some fields extracted to be mapped in the RTP header and others extracted to be mapped in the corresponding Payload Header. The AU or AU fragment data (SL packet payload) i.e. Elementary Stream codec data is unchanged. When transporting Elementary Streams there is no RSLH section. This payload format has two packing styles. The "Single" packing style is a packing style where a single AU or AU fragment is transported per RTP packet. The "Multiple" packing style is a packing style where possibly more than one AU or AU fragment are transported per RTP packet. The default packing style is the "Single" packing style. Gentric et al. Expires March 2002 13 RTP Payload Format for MPEG-4 Streams February 2002 In the "Multiple" packing style, AU or AU fragments MUST be in decoding order inside one RTP packet. Decoding order is defined by the relevant codec specification. Note that decoding order and presentation order may be different, typically for video streams containing B frames (see [2]). According to the MPEG-4 system model the decoding order may be quantified using decoding time stamps (DTS). RTP Packets SHOULD be sent in the decoding order. In case of interleaving the first AU or AU fragment of each RTP packet is used as reference as in the following examples of RTP packets containing interleaved SL packets. This sequence is correct: [0,2,4][1,3,5] This sequence is correct: [0,3,6][1,2][4,5] This sequence is correct: [0,3,6][1,4][2,5] This sequence is prohibited: [0,4,2][1,5,3] This sequence is prohibited: [1,3,5][0,2,4] This sequence is prohibited: [0,3,6][2,5][1,4] In the "Multiple" packing style the Payload Header and RSLH contains fields with relative values, they MUST have sufficient bits to encode the difference i.e. senders MUST make sure that no fields undergo roll over inside one RTP packet. This may limit the number of SL packets inside one RTP packet and, when interleaving, may limit the interleaving period as detailed in section 3.6. The size and/or number of the payload(s) SHOULD be adjusted such that the resulting RTP packet is not larger than the path-MTU. To handle larger packets, this payload format relies on lower layers for fragmentation, which may not be desirable. 3.1 RTP Header Fields Usage Payload Type (PT): The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or if that is not done then a payload type in the dynamic range shall be chosen. Marker (M) bit: The M bit is set to 1 when all AU fragments in the RTP packet are Access Units ends. Specifically the M bit is set to 0 when the RTP packet contains one or more AU fragments that are not Access Unit ends, and the M bit is set to 1 for RTP packets that contain either: . A single complete Access Unit . The last fragment of an Access Unit . Several complete Access Units . Several last fragments of Access Units Gentric et al. Expires March 2002 14 RTP Payload Format for MPEG-4 Streams February 2002 . A mix of complete Access Units and last fragments of Access Units Therefore for streams where all SL packets are complete Access Units the M bit is 1 for all RTP packets. Note also that in terms of Sync Layer this means that the M bit is related to the accessUnitEndFlag. Extension (X) bit: Defined by the RTP profile used. Sequence Number: The RTP sequence number should be generated by the sender with a constant random offset. Timestamp: Set to a value corresponding to the compositionTimeStamp (CTS) of the first AU or AU fragment in the RTP packet. This mapping is established as follows: If CTS has less than 32 bits length, the RTP timestamp is generated to extend it out to 32 bits using the number of wraparounds. If CTS has more than 32 bits length, the RTP timestamp uses the 32 LSB of it. When using the Sync Layer the resolution of the timestamp (timeStampLength) is available from the SL configuration data and shall be used by receivers to reconstruct CTS with the original bit length. It is RECOMMENDED to use timeStampLength=32. When an RTP packet starts with a non-initial AU fragment, the timestamp of the initial fragment SHALL be used. For SL streams where CTS is never present the RTP packetizer SHOULD convey a reading of a local clock at the time the RTP packet is created. Note that since, according to RFC1889 [5, Section 5.1], timestamps are recommended to start at a random value, a receiver is not in the general case able to reconstruct the original MPEG-4 Time Stamps (CTS, DTS, OCR). This is not an issue for synchronization of multiple RTP streams. However, applications where streams from multiple sources are to be synchronized (for example one stream from local storage, another from a RTP streaming server) may have to transport out of band the random offset used to map CTS into RTP timestamp, which is not in the scope of this specification. Note also that since RTP devices may re-stamp the stream, all time stamps inside of the RTP payload (CTS and DTS in the Payload Header, OCR in RSLH) MUST be expressed as difference to the RTP time stamp. Since this subtraction may lead to negative values, the offset MUST be encoded as a two's complement signed integer in network octet order. Note these offsets (delta) typically require much fewer bits to be encoded than the Gentric et al. Expires March 2002 15 RTP Payload Format for MPEG-4 Streams February 2002 original length. Nevertheless senders MUST make sure that these fields have enough bits to encode these differences. When startCompositionTimeStamp is signaled in the SLConfigDescriptor the RTP time stamps MUST start with this value. SSRC, CC and CSRC fields are used as described in RFC 1889 [5]. RTCP SHOULD be used as defined in RFC 1889 [5]. 3.2 RTP payload structure The packet payload structure consists of 3 octet-aligned sections. The first section is the Payload Header Section and contains Payload Headers. Each Payload Header contains basic fragmentation and timing information (relative to the RTP timestamp) for one AU or AU fragment. The Payload Header structure is described in 3.3. In the "Single" packing style this section is empty by default. The second section is the RSLH Section and contains Remaining SL Headers (RSLH). The RSLH structure is described in 3.4. By default this section is empty. The last section (Payload Section) contains the AU or AU fragment codec bit stream fragments and is described in section 3.5. This section is never empty. The Nth Payload Header in the Payload Header Section, the Nth RSLH in the RSLH Section and the Nth AU or AU fragment payload in the Payload Section correspond to the Nth AU or AU fragment transported by the RTP packet. Gentric et al. Expires March 2002 16 RTP Payload Format for MPEG-4 Streams February 2002 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : contributing source (CSRC) identifiers : +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | | Payload Header Section (octet aligned) | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | RSLH Section (octet aligned) | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+ | | | | Payload Section (octet aligned) | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5: RTP packet for MPEG-4 3.3 Payload Header Section structure If the Payload Header Section consumes a non-integer number of octets, up to 7 zero-valued padding bits MUST be inserted at the end in order to achieve octet-alignment. In the "Single" packing style the Payload Header Section consists of a single Payload Header. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Header (x bits ) : padding bits| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 6: Payload Header Section structure in "Single" packing style In the "Multiple" packing style the Payload Header section consist of a 2 octets field giving the size in bits (in network octet order) Gentric et al. Expires March 2002 17 RTP Payload Format for MPEG-4 Streams February 2002 of the following block of bit-wise concatenated PayloadHeaders. This size excludes the padding bits, if any. This size field is absent in the "Single" packing style not because it is not needed (which would be a minor gain) but for compatibility with RFC 3016. This size field is also absent when the value would always be zero because the Payload Header is always empty, which happens when a constant payload size in signaled using ConstantSize (see below). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Payload Header section size | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | as many bit-wise concatenated Payload Headers | | as AU or AU fragments in this RTP packet | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : padding bits| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7: Payload Header Section structure in "Multiple" packing style 3.3.1 Payload Header structure The Payload Header content depends on parameters (as described in section 4.1); by default it is empty for the "Single" packing style and, in the "Multiple" packing style, contains at least the PayloadSize field, except when ConstantSize is signaled. When all options are used the Payload Header structure and the relationship with the related parameter is given in table 1. +===========================+=================================+ | Fields of Payload Header | Number of bits (parameters) | +===========================+=================================+ | PayloadSize | SizeLength | +---------------------------+---------------------------------+ | Index | IndexLength | +---------------------------+---------------------------------+ | IndexDelta | IndexDeltaLength | +---------------------------+---------------------------------+ | CTSFlag | 1 If (CTSDeltaLength > 0) | +---------------------------+---------------------------------+ | CTSDelta | CTSDeltaLength If (CTSFlag==1) | +---------------------------+---------------------------------+ | DTSFlag | 1 If (DTSDeltaLength > 0) | +---------------------------+---------------------------------+ | DTSDelta | DTSDeltaLength If (DTSFlag==1) | +---------------------------+---------------------------------+ Gentric et al. Expires March 2002 18 RTP Payload Format for MPEG-4 Streams February 2002 Table 1: Payload Header fields and parameters giving the sizes In the general case a receiver can only discover the size of a Payload Header by parsing it since for example the presence of CTSDelta is signaled by the value of CTSFlag. 3.3.2 Fields of a Payload Header PayloadSize: Indicates the size in octets of the associated Payload, which can be found in the Payload Section of the RTP packet. The length in bits of this field is signaled by the SizeLength parameter (see section 4.1). There is an exception to that. In the "Multiple" packing style when a RTP packet contains only one AU or AU fragment, the PayloadSize field SHALL contain the size of the entire corresponding AU. There are two reasons, firstly the size of the fragment is not needed when there is only one fragment in the RTP packet, secondly this is useful in order to detect if a full Access Unit has been received after the loss of a packet carrying a M bit set to 1. Index, IndexDelta: Encodes the serial number of the associated AU or AU fragment. IndexDelta is useful for interleaving (see section 3.6). When transporting a SL stream, Index and IndexDelta SHALL be used to encode the packetSequenceNumber field of the SL Packet Header, if present. Index is optional and -if present- appears in the first Payload Header of a RTP packet. The length in bits of the Index field is defined by the IndexLength parameter (see section 4.1). IndexDelta is optional and -if present- appears for subsequent (non-first) Payload Headers of a RTP packet. The length in bits of the IndexDelta field is defined by the IndexDeltaLength parameter (see section 4.1). Both Index and IndexDelta MUST be incremented so that 2 consecutive AU or AU fragments SHALL be distinguishable. One exception for Index is described in 3.6.1. If the parameter IndexDeltaLength is defined, non-first AU or AU fragments inside a RTP packet have their serial number encoded as a difference (thus the name IndexDelta). IndexDelta MUST have sufficient bits to encode this difference. This difference is relative to the previous AU or AU fragment in the RTP packet according to (with i>=0): Serial number(0) = Index(0) Gentric et al. Expires March 2002 19 RTP Payload Format for MPEG-4 Streams February 2002 Serial number (i+1) = Serial number (i) + IndexDelta(i+1) + 1 If the parameter IndexDeltaLength is not defined the default value is zero and then the IndexDelta field is not present for non-first AU or AU fragments. Nevertheless receivers SHALL then apply the above formula with IndexDelta equal to zero. In other words by default the serial number is incremented by 1 for each AU or AU fragment in the RTP packet. CTSFlag (1 bit): Indicates whether the CTSDelta field is present. A value of 1 indicates that the CTSDelta field is present, a value of 0 that it is not present. If CTSDeltaLength is not zero, CTSFlag is present in all Payload Headers regardless of whether the AU fragment is an Access Unit start or not. CTSDelta (CTSDeltaLength bits): Specifies the value of the CTS as a 2-complement offset (delta) from the timestamp in the RTP header of the RTP packet. The length in bits of each CTSDelta field is specified by the CTSDeltaLength parameter (see section 4.1). CTSDelta MUST have sufficient bits to encode this difference. The CTSDelta field is present if CTSFlag is 1. For the first Payload Header of each RTP packet CTSFlag is always 0, since the composition time stamp of the first AU or AU fragment in the RTP packet is mapped to the RTP time stamp. When using the Sync Layer the sender MUST remove the compositionTimeStamp from the RSLH. Senders MUST finish assembling a RTP packet for which CTSDelta would roll over since this would prevent the receiver from reconstructing the correct CTS. This can result in sub optimal RTP packets (smaller than the MTU) depending on the MTU, the AU or AU fragment sizes and CTSDeltaLength. DTSFlag (1 bit): Indicates whether the DTSDelta field is present. A value of 1 indicates that DTSDelta is present, a value of 0 that it is not present. If DTSDeltaLength is not zero, DTSFlag is present in all Payload Headers regardless of whether the AU fragment is an Access Unit start or not. When transporting SL streams the receiver needs this flag in order to reconstruct the decodingTimeStampFlag of SL Packet Headers. DTSDelta (DTSDeltaLength bits): Gentric et al. Expires March 2002 20 RTP Payload Format for MPEG-4 Streams February 2002 Encodes (compositionTimeStamp - decodingTimeStamp) for the same AU or AU fragment(always positive). The length in bits of each DTSDelta field is specified by the DTSDeltaLength parameter (see section 4.1). Senders MUST make sure that DTSDeltaLength is large enough to encode the difference between CTS and DTS (otherwise the DTS computed by the receiver would be incorrect). The DTSDelta field appears when DTSFlag is 1. The sender MUST always remove the decodingTimeStamp from the RSLH. If DTSDelta is zero i.e. if decodingTimeStamp equals compositionTimeStamp then DTSFlag MUST be set to 0 and no DTSDelta field SHALL be present. 3.4 RSLHSection structure This section is present only when using the Sync Layer, and then, when the rules in the previous section have left remaining fields. This section first consists of a field (RSLHSectionSize) giving the size in bits of the following block of bit-wise concatenated RSLHs (this size does not include padding bits). If the section consumes a non-integer number of octets, up to 7 zero padding bits MUST be inserted at the end in order to achieve octet- alignment. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RSLHSectionSize (RSLHSectionSizeLength bits)| RSLH (variable| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | number of bits) | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | RSLH (variable number of bits) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | etc | | as many bit-wise concatenated RSLHs | | as SL Packets in this RTP packet | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RSLH (variable number of bits) | | +-+-+-+-+-+-+-+ | : padding bits| |-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8: RSLHSection structure The length in bits of the RSLHSectionSize field is RSLHSectionSizeLength and is specified with a default value of zero Gentric et al. Expires March 2002 21 RTP Payload Format for MPEG-4 Streams February 2002 indicating that the whole RSLHSection is absent. Note that for compatibility with RFC 3016 we need to be able to make the RSLHSection disappear completely, including the RSLHSectionSize field. This is the reason why there is such a variable length with a zero default value indicating the absence of the RSLHSectionSize field. +=================================+===============================+ | Fields of RSLHSection | Number of bits | +=================================+===============================+ | RSLHSectionSize | RSLHSectionSizeLength | +---------------------------------+-------------------------------+ | all bit-wise concatenated RSLHs | RSLHSectionSize | +---------------------------------+-------------------------------+ Table 2: Sizes in bits inside RSLHSection Parsing of the bit-wise concatenated RSLHs requires MPEG-4 system awareness, specifically it requires to understand the MPEG-4 Sync Layer (SL) syntax and the modifications to this syntax described in the next section. However thanks to the RSLHSectionSize field non-MPEG-4-system receivers can skip this part by rounding up RSLPHSize/8 to the next integer number of octets. This means that receivers not implementing the Sync Layer can process streams containing Sync Layer specific items by simply ignoring the parts they would not be able to parse. 3.4.1 RSLH structure RSLH is present only when using the Sync Layer, and then, when the rules in the previous section have left remaining fields. A Remaining SL Packet Header (RSLH) is what remains of an SL header after modifications for mapping into this payload format. The following modifications of the SL Packet Header MUST be applied. The other fields of the SL Packet Header MUST remain unchanged but are bit-shifted to fill in the gaps left by the operations specified below. 3.4.2 Removal of fields The following SL Packet Header fields -if present- are removed since they are mapped either in the RTP header or in the corresponding Payload Header: . compositionTimeStampFlag . compositionTimeStamp . decodingTimeStampFlag . decodingTimeStamp . packetSequenceNumber . AccessUnitEndFlag (in "Single" packing style only) Gentric et al. Expires March 2002 22 RTP Payload Format for MPEG-4 Streams February 2002 The AccessUnitEndFlag, when present for a given stream, MUST be removed from every RSLH when using the "Single" packing style since it has the same meaning as the Marker bit (and for compatibility with RFC 3016). However when using the "Multiple" packing style, AccessUnitEndFlag MUST NOT be removed since it is useful to signal individual AU ends. 3.4.3 Mapping of OCR Furthermore if the SL Packet header contains an OCR, then this field is encoded in the RSLH as a 2-complement difference (delta) exactly like a compositionTimeStamp or a decodingTimeStamp in the PayloadHeader. The length in bit of this difference is indicated by the OCRDeltaLength parameter (see section 4.1). With this payload format OCRs MUST have the same clock frequency as Time Stamps. If compositionTimeStamp is not present for a SL packet that has OCR then the OCR SHALL be encoded as a difference to the RTP time stamp. 3.4.4 Degradation Priority For streams that use the optional degradationPriority field in the SL Packet Headers, only SL packets with the same degradation priority SHALL be transported by one RTP packet so that components may dispatch the RTP packets according to appropriate QoS or protection schemes. Furthermore only the first RSLH of one RTP packet SHALL contain the degradationPriority field since it would be otherwise redundant. 3.5 Payload Section structure The Payload Section contains the concatenated AU or AU fragment Payloads. By definition AU or AU fragment Payloads are octet aligned. For efficiency SL packets do not carry their own payload size. This is not an issue for RTP packets that contain a single SL Packet. However in the "Multiple" packing style the size of each AU or AU fragment payload MUST be available to the receiver. If the AU or AU fragment payload size is constant for a stream, the size information SHOULD NOT be transported in the RTP packet. However in that case it MUST be signaled using the ConstantSize parameter (see section 4.1). If the AU or AU fragment payload size is variable then the size of each AU or AU fragment payload MUST be indicated in the corresponding Payload Header. In order to do so the Payload Header MUST contain a PayloadSize field. The number of bits on which this PayloadSize field is encoded MUST be indicated using the SizeLength parameter (see section 4.1). Gentric et al. Expires March 2002 23 RTP Payload Format for MPEG-4 Streams February 2002 The absence of either ConstantSize or SizeLength indicates the "Single" packing style i.e. that a single AU or AU fragment is transported in each RTP packet for that stream. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | AU or AU fragment (variable number of octets) | | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | AU or AU fragment | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | (variable number of octets) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | etc | | as many octet-wise concatenated AU or AU fragment | | as required to finish RTP packet | |-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9: Payload Section structure 3.6 Interleaving SL Packets MAY be interleaved. Senders MAY perform interleaving. Receivers MUST support interleaving. Additional specifications MAY restrict this support by explicit signaling (see for example RFCYYYY). Note for Sync Layer implementers: the AUSequenceNumber field of the SL Header MUST NOT be used for interleaving since firstly it may collide with the Scene Description Carousel usage described in section 6.2 and secondly it is not visible to receivers that do not implement the Sync Layer and would skip the RSLH section transporting AUSequenceNumber. When interleaving of AU or AU fragments is used it SHALL be implemented using the IndexDelta fields of the Payload Header. Senders MUST NOT make RTP packets for which IndexDelta rolls over. Therefore depending on the interleaving scheme (if any), the MTU and the AU or AU fragment sizes, senders wishing to make optimally sized RTP packets (i.e. close to the MTU) will need to set IndexDeltaLength to a properly large value. Senders SHOULD use non zero values of IndexDeltaLength only for streams that exhibit interleaving, so that this can be interpreted by receivers as an indication that interleaving maybe present. There are, based on this, two ways for a receiver to implement de- interleaving: Gentric et al. Expires March 2002 24 RTP Payload Format for MPEG-4 Streams February 2002 . Time-Stamp-Based-Interleaving (TSBI see section 3.6.1) uses IndexDelta and timestamps. . Index-Based-Interleaving (see section 3.6.2) uses IndexDelta and Index. This is signaled using mime parameters as in the following table. Note that the need for two methods arises from two facts: firstly the time stamp based method is more economical and in basic cases (no multiple AU fragments, CTS always defined) simpler to implement. Secondly, unfortunately this method does not always work as explained below. ================================================================== | | IndexDeltaLength = 0 | IndexDeltaLength != 0 | ------------------------------------------------------------------ | IndexLength=0 | no interleaving | TSBI | ------------------------------------------------------------------ | IndexLength!=0 | no interleaving, | Index=0 | Index!=0 | | | SL.packetSeqNum |------------------------- | | transport | TSBI | IBI | ================================================================== 3.6.1 Time stamp based interleaving (TSBI) The conjunction of RTP time stamp, IndexDelta and CTS may allow a receiver to un-ambiguously re-order AU or AU fragments based on their time stamps (CTS). This is possible and efficient for streams where only complete Access Units are transported and receivers can always compute the time stamp of each Access Unit. In case of Access Units of constant duration (e.g. audio streams) the explicit presence of CTS in the Payload Header is not even required; Indeed then we have (i being the index of one AU in one RTP packet): CTS(0) = RTP-TS for (i >= 1): CTS(i) = CTS(i-1) + (IndexDelta(i)+1)*AU_duration AU_duration, when constant, can be either signaled in SLConfig or be deduced from the decoder configuration (see the "Config" MIME parameter). Senders MUST use either IndexLength=0 or set all Index values in all packets to zero so that receivers can detect this as an indication that de-interleaving SHOULD be performed using time stamps. When using the Sync Layer and when interleaving senders MUST use for SL.timeStampLength values large enough to prevent the CTS from rolling over more often than a packet loss burst length. Pre- existing SL streams that do not comply with this requirement cannot Gentric et al. Expires March 2002 25 RTP Payload Format for MPEG-4 Streams February 2002 be interleaved using this payload format (or by using IBI as in 3.6.2) 3.6.2 Index based interleaving (IBI) The timestamp-based interleaving algorithm described in the previous section does not work when a CTS cannot always be computed for all AU or AU fragments (for example after a packet loss); this happens: . If the AU duration is not constant (SL durationFlag = 0) and CTS is not signaled (SL useTimeStampsFlag= 0). . When interleaving AU fragments. When interleaving, senders of such streams MUST use the index-based technique described in this section. The conjunction of RTP sequence number, Index and IndexDelta can produce a quasi-unique identifier for each AU or AU fragment so that a receiver can unambiguously reconstruct the original order even in case of out-of-order packets, packet loss or duplication (see the pseudo code in 3.3.2 and 6.1). Specifically the RTP sequence number is used to re-order packets and inside one RTP packet we have: Serial number(0) = Index(0) Serial number(i+1) = Serial number(i) + IndexDelta(i+1) + 1 (i>=0) This requires, however, that IndexLength is not too small. For that reason senders when interleaving in this fashion MUST use for IndexLength values large enough to prevent Index from rolling over more often than a typical loss burst length. Pre-existing SL streams that do not comply with this requirement (specifically if SL.packetSeqNumLength is too small) cannot be interleaved using this payload format (or should use TSBI). Receivers SHOULD interpret non-zero values in the Index field as an indication that de-interleaving can be performed using Index and IndexDelta but cannot be performed using timestamps. 3.6.3 SL streams that should not be interleaved SL streams for which both SL.timeStampLength and SL.packetSeqNumLength are too small SHOULD NOT be interleaved with this payload format, the reason being that small values would cause a receiver to drop a large part of the stream in case of packet loss. The actual minimal length depends on network loss properties and on the expected quality of service. 3.7 Fragmentation Rules MPEG-4 Access Units are the default fragments for MPEG-4 bitstreams and SHOULD be mapped directly into RTP packets of this format with two exceptions: - Access Units larger than the MTU - When using interleaving for better packet loss resilience. Gentric et al. Expires March 2002 26 RTP Payload Format for MPEG-4 Streams February 2002 This section gives rules to apply when performing Access Unit fragmentation. Let us first explain the context before describing the rules. For error resilience purposes some MPEG-4 codecs define optional syntax of Access Units fragments that are independently decodable. Examples are Video Packets for video and Error Sensitivity Categories (ESC) for audio. This always corresponds to specific bitstream syntax, which is signaled in the DecoderSpecificInfo inside the DecoderConfig in SLConfig, and/or using the corresponding parameters as described in section 4.1. Thanks to that, decoders are aware whether encoders are operating in such a mode or not (however since this codec configuration is an opaque data block this is not explicitly signaled by this payload format). If not operating in such a mode it is obvious that the decoder has to skip packets after a loss until an Access Unit start is received. Similarly decoder implementations that do not implement robust decoding of Access Units fragments have to discard all packets after a packet loss until an Access Unit start is received. In the same way decoder implementations that do not implement re-synchronization at any Access Units start have to discard all packets after a packet loss until a Random Access Point Access Unit is received. These are all obvious things that a good implementation would do. However serious problems would arise for decoder implementations that try to restart decoding after a packet loss if independently decodable fragments are signaled (in the decoder configuration) but the fragments actually received are not independently decodable because the RTP sender has made RTP packets on different boundaries than the fragments provided by the encoder (so this issue applies to the interface between the encoder and the RTP sender and to the RTP sender component itself). Indeed the decoder has in general no way to detect such a faulty fragment (except for MPEG-4 video). For this reason the following rules must be applied: In the spirit of ALF this payload format should transport either complete Access Units or fragments of Access Units that are independently decodable. Specifically when a given codec has an independently decodable Access Unit fragments optional syntax this option SHOULD be used. Independently decodable Access Units fragments SHOULD NOT be split across several RTP packets. An MPEG-4 audio stream encoded using the ESC syntax MUST NOT split one ESC across 2 RTP packets. When using MPEG-4 Video Packets since all Video Packets start with a specific resynchronization marker that can be unambiguously detected this rule is not needed. However it is strongly RECOMMENDED to Gentric et al. Expires March 2002 27 RTP Payload Format for MPEG-4 Streams February 2002 always adapt the Video Packet size to fit the MTU. In any case a video AU or AU fragment start MUST always be aligned with either: . a VOP start. . a Video Packet start. . or a GOV followed by the first (or only) Video Packet of the following VOP. 4. Types and Names This section describes the MIME types and names associated with this payload format. Section 4.1 registers the MIME types, as per RFC 2048. This format may require additional information about the mapping to be made available to the receiver. This is done using parameters described in the next section. The absence of any of these fields is equivalent to a field set to the default value, which is always zero for numerical parameters. The absence of any such parameters resolves into a default "basic" configuration compatible with RFC3016 for MPEG-4 video. In the MPEG-4 framework the SL stream configuration information is carried using the Object Descriptor. For compatibility with receivers that do not implement the full MPEG-4 system specification this information MAY also be signaled using parameters described here. When such information is present both in an Object Descriptor and as a parameter of this payload format it MUST be exactly the same. For transport of MPEG-4 audio and video without the use of MPEG-4 systems, as well as to support non-MPEG-4 system receivers, it is also possible to transport information on the profile and level of the stream and on the decoder configuration. This is also described in the next section. Finally this MIME type also defines a mode parameter and a profile parameter that are intended for derivations of this payload format. One such derivation is described in the companion RFC YYYY. 4.1 MIME type registration MIME media type name: "video" or "audio" or "application" "video" MUST be used for MPEG-4 Visual streams (i.e. video as defined in ISO/IEC 14496-2 (Streamtype = 4) and/or graphics as defined in ISO/IEC 14496-1 (Streamtype = 3)) or MPEG-4 Systems streams that convey information needed for an audio/visual presentation. "audio" MUST be used for MPEG-4 Audio streams (ISO/IEC 14496-3) (Streamtype = 5)) or MPEG-4 Systems streams that convey information needed for an audio only presentation. Gentric et al. Expires March 2002 28 RTP Payload Format for MPEG-4 Streams February 2002 "application" MUST be used for MPEG-4 Systems streams (ISO/IEC14496- 1 (all other StreamType values)) that serve other purposes than audio/visual presentation, e.g. in some cases when MPEG-J streams are transmitted. MIME subtype name: mpeg4-generic Required parameters: none Optional parameters: mode: The mode in which this specification is used. This specification itself defines only the default mode (Mode=default). When the mode parameter is not present the default mode SHALL be assumed. In the default mode all parameters are OPTIONAL and as defined here. Other modes may be defined as needed in other RFCs. A mode MUST be a subset of this specification. Specifically when defining a mode care MUST be taken that an implementation of this specification can decode the payload format corresponding to this new mode. For this reason a mode MUST NOT specify new default values for MIME parameters and MIME parameters MUST be present (unless they have the default value) even if it is redundant in case the mode assigns fixed values. A mode may define additionally that some MIME parameters are required instead of optional, that some MIME parameters have fixed values (or ranges), and that there are rules restricting the usage (for example RFCYYYY forbids the carriage of multiple AU fragments in the same RTP packet and -logically- uses only TSBI interleaving). profile: The meaning of this parameter may be defined by a mode. This is meant to be used in order to define sub-configurations of a given mode, for example the maximum delay (and therefore the size of buffers) induced by the usage of interleaving. Implementations of this specification can ignore this parameter. DTSDeltaLength: The number of bits on which the DTSDelta field is encoded in each Payload Header. The default value is zero and indicates the absence of DTSFlag and DTSDelta in the Payload Header (the stream does not transport decodingTimeStamps). A value larger than zero indicates that there is a DTSFlag in each Payload Header. Since decodingTimeStamp, if present, must be encoded as a difference to the RTP time stamp, the DTSDeltaLength parameter MUST be present in order to transport decodingTimeStamps with this payload format. CTSDeltaLength: The number of bits on which the CTSDelta field is encoded. The default value is zero and indicates the absence of the CTSFlag Gentric et al. Expires March 2002 29 RTP Payload Format for MPEG-4 Streams February 2002 and CTSDelta fields in Payload Header. Non-zero values MUST NOT be signaled in the "Single" packing style. Since compositionTimeStamps, if present, must be encoded as a difference to the RTP time stamp, the CTSDeltaLength parameter MUST be present in order to transport compositionTimeStamps using this payload format (in the "Multiple" packing style). However CTSDeltaLength SHOULD be set to zero (or not signaled) for streams that have a constant Access Unit duration (which can be explicitly signaled using the DurationFlag and AccessUnitDuration field of SLConfigDescriptor). OCRDeltaLength: The number of bits on which the OCRDelta field is encoded in RSLH. The default value is zero and indicates the absence of OCR for this stream. Since objectClockReference -if present- must be encoded as a difference to the RTP time stamp, the OCRDeltaLength parameter MUST be present in order to transport objectClockReferences with this payload format. SizeLength: The number of bits on which the PayloadSize field of a Payload Header is encoded. The default value is zero and indicates the "Single" packing style (unless ConstantSize is present). Simultaneous presence of this parameter and ConstantSize is illegal. Either the SizeLength or ConstantSize parameter MUST be present in order to signal the "Multiple" packing style of this payload format. ConstantSize: The constant size in octets of each AU or AU fragment Payload for this stream. The default value is zero and indicates variable AU or AU fragment Payload size (or the "Single" packing style if SizeLength is absent). Simultaneous presence of this parameter and SizeLength is illegal. Either the SizeLength or ConstantSize parameter MUST be present in order to signal the "Multiple" packing style of this payload format. When ConstantSize is present the PayloadSize field of the Payload Header in the RTP packets MUST NOT be present. IndexLength: The number of bits on which the Index is encoded in the first Payload Header of a RTP packet. The default value is zero and indicates the absence of Index and IndexDelta for all Payload Headers. Since SL.packetSequenceNumber -if present- must be mapped in the Payload Header, the IndexLength parameter MUST be present in order to transport SL.packetSequenceNumber with this payload format. IndexDeltaLength: The number of bits on which the IndexDelta are encoded in any non-first Payload Header. The default value is zero and indicates that the serial number MUST be incremented by one for each AU or AU fragment in the RTP packet (see section 3.5). A Gentric et al. Expires March 2002 30 RTP Payload Format for MPEG-4 Streams February 2002 non-zero IndexDeltaLength parameter MUST be present when using interleaving with this payload format. RSLHSectionSizeLength: The number of bits that is used to encode the RSLHSectionSize field. The default value is zero and indicates the absence of the whole RSLHSection for all RTP packets of this stream. SLConfigDescriptor: A base-64 encoding of the SLConfigDescriptor. This SHALL be the original SLConfigDescriptor and it SHALL be the same as the one transported by the OD framework, if any. profile-level-id: A decimal representation of the MPEG-4 Profile Level indication value. For audio this parameter indicates which MPEG-4 Audio tool subsets are applied to encode the audio stream and is defined in ISO/IEC 14496-1 [1]. For video this parameter indicates which MPEG-4 Visual tool subsets are applied to encode the video stream and is defined in Table G-1 of ISO/IEC 14496-2 [2]. This parameter MAY be used in the capability exchange or session setup procedure to indicate MPEG-4 Profile and Level combination of which the relevant MPEG-4 media codec is capable. If this parameter is not specified its default value is 1 (Simple Profile/Level 1) for video (for compatibility with RFC 3016) and otherwise 254 (0xFE being defined in ISO/IEC 14496-1 [1] as being the generic default value). config: A hexadecimal representation of an octet string that expresses the media payload configuration. Configuration data is mapped onto the octet string in an MSB-first basis. The first bit of the configuration data SHALL be located at the MSB of the first octet. In the last octet, zero-valued padding bits, if necessary, shall follow the configuration data. For audio streams, config is the audio object type specific decoder configuration data AudioSpecificConfig() as defined in ISO/IEC 14496-3 [3]. For video this expresses the MPEG-4 Visual configuration information, as defined in subclause 6.2.1 Start codes of ISO/IEC14496-2 [2] and the configuration information indicated by this parameter SHALL be the same as the configuration information in the corresponding MPEG-4 Visual stream, except for first-half-vbv-occupancy and latter-half- vbv-occupancy, if it exists, which may vary in the repeated configuration information inside an MPEG-4 Visual stream (See 6.2.1 Start codes of ISO/IEC14496-2). StreamType: The integer value that indicates the type of MPEG-4 stream that is carried; its coding corresponds to the values of the streamType as defined for the DecoderConfigDescriptor in ISO/IEC 14496-1. Gentric et al. Expires March 2002 31 RTP Payload Format for MPEG-4 Streams February 2002 Encoding considerations: System bitstreams MUST be generated according to MPEG-4 System specifications (ISO/IEC 14496-1). Video bitstreams MUST be generated according to MPEG-4 Visual specifications (ISO/IEC 14496-2). Audio bitstreams MUST be generated according to MPEG- 4 Audio specifications (ISO/IEC 14496-3). If the Sync Layer is used SL streams MUST be generated according to MPEG-4 Sync Layer specifications (ISO/IEC 14496-1 section 10), then in order to read the RSLH parts of this format the SLConfigDescriptor is required. These bitstreams are binary data and MUST be encoded for non-binary transport (for Email, the Base64 encoding is sufficient). This type is also defined for transfer via RTP. The RTP packets MUST be packetized according to the RTP payload format defined in RFC XXXX. Security considerations: As in RFC XXXX. Interoperability considerations: MPEG-4 provides a large and rich set of tools for the coding of visual objects. For effective implementation of the standard, subsets of the MPEG-4 tool sets have been provided for use in specific applications. These subsets, called "Profiles", limit the size of the tool set a decoder is required to implement. In order to restrict computational complexity, one or more "Levels" are set for each Profile. A Profile@Level combination allows: . A codec builder to implement only the subset of the standard he needs, while maintaining interoperability with other MPEG-4 devices included in the same combination, and . Checking whether MPEG-4 devices comply with the standard ('conformance testing'). A stream SHALL be compliant with the MPEG-4 Profile@Level specified by the parameter "profile-level-id". Interoperability between a sender and a receiver may be achieved by specifying the parameter "profile-level-id" in MIME content, or by arranging in the capability exchange/announcement procedure to set this parameter mutually to the same value. Published specification: The specifications for MPEG-4 streams are presented in ISO/IEC 14469-1, 14469-2, and 14469-3. The RTP payload format is described in RFC XXXX. Applications that use this media type: Multimedia streaming and conferencing tools. Additional information: none Magic number(s): none Gentric et al. Expires March 2002 32 RTP Payload Format for MPEG-4 Streams February 2002 File extension(s): None. A file format with the extension .mp4 has been defined for MPEG-4 content but is not directly correlated with this MIME type which sole purpose is RTP transport. Macintosh File Type Code(s): none Person & email address to contact for further information: Authors of RFC XXXX. Intended usage: COMMON Author/Change controller: Authors of RFC XXXX, IETF Audio/Video Transport working group. 4.2 Concatenation of parameters Multiple parameters SHOULD be expressed as a MIME media type string, in the form of a semicolon-separated list of parameter=value pairs (see examples below). 4.3 Usage of SDP 4.3.1 The a=fmtp keyword It is assumed that one typical way to transport the above-described parameters associated with this payload format is via an SDP [10] message for example transported to the client in reply to a RTSP [13] DESCRIBE message or via SAP [14]. In that case the (a=fmtp) keyword MUST be used as described in RFC 2327 [10, section 6]. The syntax being then: a=fmtp: = 4.3.2 SDP example The following is an example of SDP syntax for the description of a session containing one MPEG-4 video, one MPEG-4 audio stream and three MPEG-4 system streams, the first one being BIFS, the second one OD stream and the third one IPMP. All are transported using this format and the AVP profile [12]. Note the usage of some MIME parameters: all stream display their StreamType; the video stream uses DTS with DTSDelta encoded on 4 bits; the audio stream uses the "Multiple" packing style with 12 bits to describe the size of each AU or AU fragment payload. See the Appendix for more examples. o= .... I= .... c=IN IP4 123.234.71.112 m=video 1034 RTP/AVP 97 a=rtpmap:97 mpeg4-generic a=fmtp:97 StreamType=4;DTSDeltaLength=4 m=audio 1810 RTP/AVP 98 Gentric et al. Expires March 2002 33 RTP Payload Format for MPEG-4 Streams February 2002 a=rtpmap:98 mpeg4-generic a=fmtp:98 StreamType=5;SizeLength=12; m=application 1234 RTP/AVP 99 a=rtpmap:99 mpeg4-generic a=fmtp:99 StreamType=3 m=application 1236 RTP/AVP 100 a=rtpmap:100 mpeg4-generic a=fmtp:100 StreamType=1 m=application 1238 RTP/AVP 101 a=rtpmap:101 mpeg4-generic a=fmtp:101 StreamType=7 5. IANA Considerations One new MIME subtype is to be registered, see Section 4.1. 6. Other issues 6.1 SL-packetized stream reconstruction The purpose of this section is to document how a receiver can reconstruct a valid SL-packetized stream. This reconstruction is performed by reversing the payload structure rules (section 3). We explicitly describe here the most complex transformations. In the following let (i) be the index of SL packets inside one RTP packet (starting at zero for each RTP packet), let SLPacketHeader.x denote field x of the reconstructed SL packet header, let PayloadHeader.x denote field x of the received PayloadHeader, etc. SLPacketHeader.packetSequenceNumber is restored from PayloadHeader.Index and PayloadHeader.IndexDelta using: If ( IndexLength == 0) { // or is absent if ( SLConfig.packetSeqNumLength == 0 ) { // this stream does not have SL packet sequence number } else { // illegal, normally the sender MUST map // SLPacketHeader.packetSequenceNumber in PayloadHeader // and set a relevant IndexLength value; // otherwise it is unfortunately impossible for the receiver // to reconstruct the correct sequence } } else { // IndexLength is not zero if ( SLConfig.packetSeqNumLength == 0 ) { // the original SL stream does not have SL packet // sequence numbers, typically the sender inserted them // in order to implement interleaving at the RTP level; // they must be ignored for SL stream reconstruction } else { Gentric et al. Expires March 2002 34 RTP Payload Format for MPEG-4 Streams February 2002 if (i == 0){ // first SL packet in RTP packet SLPacketHeader.packetSequenceNumber(0) = PayloadHeader.Index(0); } else { // remaining SL packets SLPacketHeader.packetSequenceNumber(i+1)= SLPacketHeader.packetSequenceNumber(i) + PayloadHeader.IndexDelta(i+1) +1; } } All time stamps (CTS, DTS, OCR), when present, are restored from the delta values. Time stamps flags (CTSFlag, DTSFlag) in PayloadHeader are used to reconstruct respectively the compositionTimeStampFlag and decodingTimeStampFlag of SLPacketHeader. The function corrected(x) for the RTP time stamp transformation is the mapping from 32 bits to SLConfig.timeStampLength, which may be smaller or larger than 32 bits: If (timeStampLength < 32 ) { // short SL time stamps corrected(x) = LSB(x); // only the timeStampLength LSBits of x } else If (timeStampLength > 32 ) { // long SL time stamps corrected(x) = x + m; // start with m=0 if ( x(i) < x(i-1) ) { // 32 bits RTPTS roll over has occurred { m += 2^32; } } else If (timeStampLength = 32 ) { // recommended value corrected(x) = x; // direct mapping } if ( CTSDeltaLength == 0) { // or CTSDeltaLength is absent // CTS is not transported for this RTP stream if (i == 0){ // first SL packet in RTP packet if ( SLConfig.useTimeStamps == 1 ) { if ( SLPacketHeader.accessUnitStartFlag == 1 ) { SLPacketHeader.compositionTimeStampFlag(0) = 1; SLPacketHeader.compositionTimeStamp(0) = corrected(RTP TimeStamp); } else { // ignore } } else { // empty } } else { // non-first SL packets in RTP packet Gentric et al. Expires March 2002 35 RTP Payload Format for MPEG-4 Streams February 2002 if ( SLConfig.useTimeStamps == 1 ) { if ( SLPacketHeader.accessUnitStartFlag == 1 ) { SLPacketHeader.compositionTimeStampFlag(i) = 0; } else { // ignore } } else { // empty } } } else { // CTSDeltaLength is not zero // CTS is transported for this stream if ( SLConfig.useTimeStamps == 1 ) { if ( SLPacketHeader.accessUnitStartFlag == 1 ) { SLPacketHeader.compositionTimeStampFlag(i) = PayloadHeader.CTSFlag(i); SLPacketHeader.compositionTimeStamp(i) = corrected(RTP TimeStamp) + PayloadHeader.CTSDelta(i); } else { // ignore CTSFlag (which must be zero) } else { // this is strange and sub-optimal at best // a receiver should ignore this } } if ( DTSDeltaLength == 0) { // or DTSDeltaLength is absent // DTS is not transported for this stream if ( SLConfig.useTimeStamps == 1 ) { if ( SLPacketHeader.accessUnitStartFlag == 1 ) { SLPacketHeader.decodingTimeStampFlag(i) = 0; } else { // ignore } } else { // empty } } else { // DTS is transported for this stream if ( SLConfig.useTimeStamps == 1 ) { if ( SLPacketHeader.accessUnitStartFlag == 1 ) { SLPacketHeader.decodingTimeStampFlag(i) = PayloadHeader.DTSFlag(i); SLPacketHeader.decodingTimeStamp(i)= Gentric et al. Expires March 2002 36 RTP Payload Format for MPEG-4 Streams February 2002 SLPacketHeader.compositionTimeStamp(i) - PayloadHeader.DTSDelta(i); // DTS <= CTS always } else { // ignore DTSFlag (which must be zero) } } else { // this is strange and sub-optimal at best // a receiver should ignore this } } if ( OCRDeltaLength == 0) { // or OCRDeltaLength is absent // the RTP stream does not transport any OCR if ( SLConfig.OCRLenght == 0 ) { // this stream does not have any OCR } else { // illegal, normally the sender MUST detect // OCRs, replace them with OCRDelta and set // a relevant OCRDeltaLength value } } else { if ( SLConfig.OCRLenght == 0 ) { // this is strange and sub-optimal at best // a receiver should ignore this } else { SLPacketHeader.OCRflag(i) = RSLH.OCRFlag(i); if ( SLPacketHeader.OCRflag(i) == 1) { SLPacketHeader.objectClockReference(i) = corrected(RTP TimeStamp) + RSLH.OCRDelta(i); } } } In the "Single" packing style the AccessUnitEndFlag, if needed, is restored from the M bit, as follows: if ( SLConfig.useAccessUnitEndFlag == 0 ) { // this SL stream does not signal access unit ends else { SLPacketHeader.AccessUnitEndFlag = M bit; } In the "Multiple" packing style the AccessUnitEndFlag is untouched in RSLH. The other SL packet header fields SHALL remain as found in RSLH. Gentric et al. Expires March 2002 37 RTP Payload Format for MPEG-4 Streams February 2002 It is obvious that in the general case the reconstruction of the original SL packetized stream requires SL-awareness. However this payload format allows in all cases a receiver that does not know about the SL syntax to reconstruct the semantic of Elementary Streams for the following very useful features: - Packet order (decoding order) - Access Unit boundaries (using the M bit) - Access Unit fragments (fragment boundaries using PayloadSize) - Composition Time Stamps, according to: compositionTimeStamp(i) = RTP TimeStamp + CTSDelta(i); - Decoding Time Stamps, according to: decodingTimeStamp(i) = compositionTimeStamp(i) - DTSDelta(i); - Packet serial number, according to: if (i == 0){ // first SL packet in RTP packet packet serial number(0) = Index(0); } else { // remaining SL packets packet serial number (i+1) = packet serial number (i) + IndexDelta(i+1) + 1; } 6.2 Handling of scene description streams MPEG-4 introduces new stream types as described in section 1 namely Object Descriptors and BIFS. In the following both OD and BIFS are discussed on the same basis i.e. as "scene description". Considering scene description as a "stream-able" type of content is a rather new concept and for that reasons some specific comments are needed. Typically scene descriptions are encoded in such a way that information loss would in the general case cripple the presentation beyond any hope of repair by the receiver. This is acceptable for a number of multimedia applications were the scene is first made available via reliable channels to the client and then played. This payload format is not primarily intended for this type of applications for which download of MPEG-4 interchange (.mp4) files would be typical. However this payload format can also be used. It is then RECOMMENDED however that the RTP packets should be transported using TCP (for example inside RTSP as described in [13, section 10.12]) or any other reliable protocol. On the other hand MPEG-4 has introduced the possibility to dynamically change the scene description by sending animation information (changes in parameters) and structural change information (updates). Since this information has to be sent in a timely fashion MPEG-4 has defined a number of techniques in order to encode the scene description in a manner that makes it behave similarly to other temporal encoding schemes such as audio and video. This payload format is intended for this usage. Gentric et al. Expires March 2002 38 RTP Payload Format for MPEG-4 Streams February 2002 Note that in many cases the application will consist of first the reliable transmission of a static initial scene followed by the streaming of animations and updates. For this reason the usage of this payload format is attractive since it offers a unique solution. Senders must be aware that suitable schemes should be used when scene description streams transport sensitive configuration information. For example in case the RTP packet transporting an OD- update command would be lost, the corresponding media stream would not be accessible by the receiver. Redundancy is a possibility and may either be added by tools hierarchically higher than this payload format, e.g. by packet based FEC, re-transmission, or similar tools. In such a case, the general congestion control principles have to be observed. Since BIFS and OD streams may be modified during the session with update commands, there is a need to send both update commands and full BIFS/OD refresh. For that reason MPEG-4 defines Random Access Points (RAP) for scene description streams (OD and BIFS) where by definition a decoder can restart decoding i.e. receives a "full update" of the scene. This mechanism is called Scene and Object Description Carousel. The AU Sequence Number field of SL Packet Header is used to support this behavior at the Sync Layer. When two access units are sent consecutively with the same AU Sequence Number, the second one is assumed to be a semantic repetition of the first. If a receiver starts to listen in the middle of a session or has detected losses, it can ignore all received AUs until such a RAP. The periodicity of transmission of these RAPs should be chosen/adjusted depending on the application and the network it is deployed on; i.e. exactly like Intra-coded frames for video, it is the responsibility of the sender to make sure the periodicity of RAPs is suitable. 6.3 Overlap with RFC 3016 This payload format has been designed to have a (large) overlap with RFC 3016 [7]. The conditions for this overlap are: Conditions for RFC 3016: C1. MPEG-4 video elementary streams only C2. There MUST be a single VOP or Video Packet per RTP packet (which is only recommended in RFC 3016) C3. The decoder configuration MUST be signaled out-of-band either using the Config mime parameter or using the OD framework Conditions for this payload format: C4. No MIME parameters defined (or all set to zero), i.e. "Single" packing style with empty Payload Header and empty RSLH. C5. Receivers MUST be ready to accept (and ignore) video configuration headers (e.g. VOSH, VO and VOL) and visual-object- sequence-end-code transported in-band. Gentric et al. Expires March 2002 39 RTP Payload Format for MPEG-4 Streams February 2002 Under conditions C2 and C4 the MPEG-4 video RTP packet structures are identical. Since C4 and C5 MUST be supported by implementations of this specification the conditions for RTP streams backward compatibility of this specification with RFC3016 are established when RFC3016 is used with condition C1, C2 and C3. Technically the most stringent condition is C2 but it is also a condition that makes a lot of sense for many reasons, whatever the application. Furthermore the MIME parameters have been aligned, specifically the parameters "config" and "profile-level-id" have the same name and signification in RFC3016 and in this memo. The remaining difference is therefore the MIME subtype name. It would be desirable then that specifications built upon this memo and enforcing the above minor usage restrictions of RFC3016 in order to provide a backward compatible solution would then specify that receivers can interpret the MIME subtype name "MP4V-ES" as being equivalent to MIME type "video" with subtype name "mpeg4-generic" and vice versa. In short this payload format is backward compatible with RFC3016 for video used in the recommended fashion. 6.4 Multiplexing An advanced MPEG-4 session may involve a large number of objects that may be as many as a few hundred, transporting each ES as an individual RTP stream may not always be practical. Allocating and controlling hundreds of destination addresses for each MPEG-4 session may pose insurmountable session administration problems. The input/output processing overhead at the end-points will be extremely high also. Additionally, low delay transmission of low bitrate data streams, e.g. facial animation parameters, results in extremely high header overheads. To solve these problems, MPEG-4 data transport requires a multiplexing scheme that allows selective bundling of several ESs. This is beyond the scope of the payload format defined here. The MPEG-4's Flexmux multiplexing scheme may be used for this purpose and a specific RTP payload format is being developed [11]. Another approach may be to develop a generic RTP multiplexing scheme usable for MPEG-4 data. The multiplexing scheme reported in [8] may be a candidate for this approach. For MPEG-4 applications, the multiplexing technique needs to address the following requirements: i. The ESs multiplexed in one stream can change frequently during a session. Consequently, the coding type, individual packet size and temporal relationships between the multiplexed data units must be handled dynamically. Gentric et al. Expires March 2002 40 RTP Payload Format for MPEG-4 Streams February 2002 ii. The multiplexing scheme should have a mechanism to determine the ES identifier (ES_ID) for each of the multiplexed packets. ES_ID is not a part of the SL header. iii. In general, an SL packet does not contain information about its size. The multiplexing scheme should be able to delineate the multiplexed packets whose lengths may vary from a few octets to close to the path-MTU. 7. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [5]. This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed on the compressed data so there is no conflict between the two operations. The packet processing complexity of this payload type (i.e. excluding media data processing) does not exhibit any significant non-uniformity in the receiver side to cause a denial- of-service threat. However, it is possible to inject non-compliant MPEG streams (Audio, Video, and Systems) to overload the receiver/decoder's buffers, which might compromise the functionality of the receiver or even crash it. This is especially true for end-to-end systems like MPEG where the buffer models are precisely defined. MPEG-4 Systems supports stream types including commands that are executed on the terminal like OD commands, BIFS commands, etc. and programmatic content like MPEG-J (Java(TM) Byte Code) and ECMAScript. It is possible to use one or more of the above in a manner non-compliant to MPEG to crash or temporarily make the receiver unavailable. Authentication mechanisms can be used to validate of the sender and the data to prevent security problems due to non-compliant malignant MPEG-4 streams. A security model is defined in MPEG-4 Systems streams carrying MPEG- J access units which comprises Java(TM) classes and objects. MPEG-J defines a set of Java APIs and a secure execution model. MPEG-J content can call this set of APIs and Java(TM) methods from a set of Java packages supported in the receiver within the defined security model. According to this security model, downloaded byte code is forbidden to load libraries, define native methods, start programs, read or write files, or read system properties. Receivers can implement intelligent filters to validate the buffer requirements or parametric (OD, BIFS, etc.) or programmatic (MPEG-J, Gentric et al. Expires March 2002 41 RTP Payload Format for MPEG-4 Streams February 2002 ECMAScript) commands in the streams. However, this can increase the complexity significantly. 8. Acknowledgements This document evolved across several years through many revisions thanks to contributions from a large number of people since it is based on work within the IETF AVT working group and various ISO MPEG working groups, especially the 4-on-IP ad-hoc group. The authors wish to thank Olivier Avaro, Stephen Casner, Guido Fransceschini, Art Howarth, Dave Mackie, Dave Singer, and Stephan Wenger for their valuable comments and support. Attentive readers and early implementers also found flaws and bugs, thank you all. 9. References [1] ISO/IEC 14496-1:2001 MPEG-4 Systems [2] ISO/IEC 14496-2:2001 MPEG-4 Visual [3] ISO/IEC 14496-3:2001 MPEG-4 Audio [4] ISO/IEC 14496-6:2001 Delivery Multimedia Integration Framework. [5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, RTP: A Transport Protocol for Real Time Applications, RFC 1889, Internet Engineering Task Force, January 1996. [6] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, RFC 2119, Internet Engineering Task Force, March 1997. [7] Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, H. Kimata, RTP payload format for MPEG-4 Audio/Visual streams, Internet Engineering Task Force, RFC 3016. [8] B. Thompson, T. Koren, D. Wing, Tunneling multiplexed Compressed RTP ("TCRTP"), work in progress, draft-ietf-avt-tcrtp-04.txt, July 2001. [9] D. Singer, Y Lim, A Framework for the delivery of MPEG-4 over IP-based Protocols, work in progress, draft-singer-mpeg4-ip-02.txt, May 2001. [10] M. Handley, V. Jacobson, SDP: Session Description Protocol, RFC 2327, Internet Engineering Task Force, April 1998. [11] C.Roux & al, RTP Payload Format for MPEG-4 FlexMultiplexed Streams, work in progress, draft-curet-avt-rtp-mpeg4-flexmux-00.txt, February 2001. [12] H. Schulzrinne, RTP Profile for Audio and Video Conferences with Minimal Control, RFC 1890, Internet Engineering Task Force, January 1996. Gentric et al. Expires March 2002 42 RTP Payload Format for MPEG-4 Streams February 2002 [13] H. Schulzrinne, A. Rao, R. Lanphier, Real Time Streaming Protocol, RFC 2326, Internet Engineering Task Force, April 1998. [14] M. Handley, C. Perkins, E. Whelan, Session Announcement Protocol, RFC 2974, Internet Engineering Task Force, October 2000. 10. Authors' Addresses Andrea Basso AT&T Labs Research 200 Laurel Avenue Middletown, NJ 07748 USA e-mail: basso@research.att.com M. Reha Civanlar AT&T Labs - Research 200 Laurel Ave. South, A5 4D04 Middletown, NJ 07748 USA e-mail: civanlar@research.att.com Philippe Gentric Philips Digital Networks, MP4Net 51 rue Carnot 92156 Suresnes France e-mail: philippe.gentric@philips.com Carsten Herpel THOMSON multimedia Karl-Wiechert-Allee 74 30625 Hannover Germany e-mail: herpelc@thmulti.com Zvi Lifshitz Optibase Ltd. 7 Shenkar St. Herzliya 46120 Israel e-mail: zvil@optibase.com Young-Kwon Lim net&tv Co., Ltd. 5th Floor Himart Building 1007-46 Sadang-Dong Dongjak-Gu, Seoul, 156-090, Korea e-mail : young@netntv.co.kr Gentric et al. Expires March 2002 43 RTP Payload Format for MPEG-4 Streams February 2002 Colin Perkins USC Information Sciences Institute 3811 N. Fairfax Drive suite 200 Arlington, VA 22203 USA e-mail : csp@isi.edu Jan van der Meer Philips Digital Networks Building WDB-1 Prof Holstlaan 4 5656 AA Eindhoven Netherlands e-mail : jan.vandermeer@philips.com APPENDIX: Examples of usage This section describes a number of examples of how this payload format can be used either with or without the Sync Layer. In all examples the Sync Layer syntax is given (which shows how it may become invisible in cases 1,3,4 and 5). A C++-like syntax called SDL (Syntactic Description Language) defined in [1, section 14] is used to economically describe MPEG-4 system data structures. These examples assume that the (a=fmtp) SDP syntax is used to convey the MIME parameters of the payload format. Appendix.1 RFC 3016 compatible MPEG-4 Video (no SL) This is an example of a video stream compatible with RFC 3016. SLConfigDescriptor In this example the SLConfigDescriptor is: class SLConfigDescriptor extends BaseDescriptor : bit(8) tag=SLConfigDescrTag { bit(8) predefined; if (predefined==0) { bit(1) useAccessUnitStartFlag; = 0 bit(1) useAccessUnitEndFlag; = 1 bit(1) useRandomAccessPointFlag; = 0 bit(1) hasRandomAccessUnitsOnlyFlag; = 0 bit(1) usePaddingFlag; = 0 bit(1) useTimeStampsFlag; = 0 bit(1) useIdleFlag; = 0 bit(1) durationFlag; = 0 bit(32) timeStampResolution; = 0 bit(32) OCRResolution; = 0 bit(8) timeStampLength; = 0 bit(8) OCRLength; = 0 Gentric et al. Expires March 2002 44 RTP Payload Format for MPEG-4 Streams February 2002 bit(8) AU_Length; = 0 bit(8) instantBitrateLength; = 0 bit(4) degradationPriorityLength; = 0 bit(5) AU_seqNumLength; = 0 bit(5) packetSeqNumLength; = 0 bit(2) reserved=0b11; } if (durationFlag) { bit(32) timeScale; // NOT USED bit(16) accessUnitDuration; // NOT USED bit(16) compositionUnitDuration; // NOT USED } if (!useTimeStampsFlag) { bit(timeStampLength) startDecodingTimeStamp; = 0 bit(timeStampLength) startCompositionTimeStamp; = 0 } } SL Packet Header structure With this configuration we have the following SL packet header structure: aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { if (SL.useAccessUnitEndFlag) { bit(1) accessUnitEndFlag; // 1 bit } } In this case this payload produces RTP packets that are exactly conformant to RFC 3016 and the SL is reduced to a purely logical construction that neither sender nor receiver need to implement. Parameters This configuration is the default one; no parameters are required. RTP packet structure Note that accessUnitEndFlag is mapped to the RTP header M bit. +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ | Access Unit or AU fragment | 1400 octets | +-----------------------------------------+-------------+ Overhead In this example we have an RTP overhead of 40 octets for 1400 octets of payload i.e. 3 % overhead. Gentric et al. Expires March 2002 45 RTP Payload Format for MPEG-4 Streams February 2002 Appendix.2 MPEG-4 Video with SL Let us consider the case of a 30 frames per second MPEG-4 video stream which bit rate is high enough that Access Units have to be split in several SL packets (typically above 300 kb/s). Let us assume also that the video codec generates in that case Video Packets suitable to fit in one SL packet i.e that the video codec is MTU aware and the MTU is 1500 octets. We assume furthermore that this stream contains B frames and that decodingTimeStamps are present. SLConfigDescriptor In this example the SLConfigDescriptor is: class SLConfigDescriptor extends BaseDescriptor : bit(8) tag=SLConfigDescrTag { bit(8) predefined; if (predefined==0) { bit(1) useAccessUnitStartFlag; = 1 bit(1) useAccessUnitEndFlag; = 0 bit(1) useRandomAccessPointFlag; = 1 bit(1) hasRandomAccessUnitsOnlyFlag; = 0 bit(1) usePaddingFlag; = 0 bit(1) useTimeStampsFlag; = 1 bit(1) useIdleFlag; = 0 bit(1) durationFlag; = 0 bit(32) timeStampResolution; = 30 bit(32) OCRResolution; = 0 bit(8) timeStampLength; = 32 bit(8) OCRLength; = 0 bit(8) AU_Length; = 0 bit(8) instantBitrateLength; = 0 bit(4) degradationPriorityLength; = 0 bit(5) AU_seqNumLength; = 0 bit(5) packetSeqNumLength; = 0 bit(2) reserved=0b11; } if (durationFlag) { bit(32) timeScale; // NOT USED bit(16) accessUnitDuration; // NOT USED bit(16) compositionUnitDuration; // NOT USED } if (!useTimeStampsFlag) { bit(timeStampLength) startDecodingTimeStamp; // NOT USED bit(timeStampLength) startCompositionTimeStamp; // NOT USED } } The useRandomAccessPointFlag is set so that the randomAccessPointFlag can indicate that the corresponding SL packet contains a GOV and the first Video Packet of an Intra coded frame. Gentric et al. Expires March 2002 46 RTP Payload Format for MPEG-4 Streams February 2002 SL Packet Header structure With this configuration we have the following SL packet header structure: aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { bit(1) accessUnitStartFlag; // 1 bit if (accessUnitStartFlag) { bit(1) randomAccessPointFlag; // 1 bit bit(1) decodingTimeStampFlag; // 1 bit bit(1) compositionTimeStampFlag; // 1 bit if (decodingTimeStampFlag) { bit(SL.timeStampLength) decodingTimeStamp; } if (compositionTimeStampFlag) { bit(SL.timeStampLength) compositionTimeStamp; } } Parameters decodingTimeStamps are encoded on 32 bits, which is much more than needed for delta. Therefore the sender will use DTSDeltaLength to signal that only 7 bits are used for the coding of relative DTS in the RTP packet. The RSLHSectionSize cannot exceed 4 (bits), which is encoded on 3 bits and signaled by RSLHSectionSizeLength. The resulting concatenated fmtp line is: a=fmtp: DTSDeltaLength=7;RSLHSectionSizeLength=3 RTP packet structure Two cases can occur; for packets that transport first fragments of Access Units we have: +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ | DTSFlag = (1) | 1 bit | +-----------------------------------------+-------------+ | DTSDelta | 7 bits | +-----------------------------------------+-------------+ | bits to octet alignment | 0 bits | +-----------------------------------------+-------------+ | RSLHSectionSize = (100) | 3 bits | +-----------------------------------------+-------------+ | accessUnitStartFlag = (1) | 1 bit | +-----------------------------------------+-------------+ | randomAccessPointFlag | 1 bit | +-----------------------------------------+-------------+ | decodingTimeStampFlag | 1 bit | Gentric et al. Expires March 2002 47 RTP Payload Format for MPEG-4 Streams February 2002 +-----------------------------------------+-------------+ | compositionTimeStampFlag | 1 bit | +-----------------------------------------+-------------+ | bits to octet alignment =(0) | 1 bit | +-----------------------------------------+-------------+ | SL packet payload | N octets | +-----------------------------------------+-------------+ For packets that transport non-first fragments of Access Units we have: +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ | DTSFlag = 0 | 1 bit | +-----------------------------------------+-------------+ | bits to octet alignment = (0000000) | 7 bits | +-----------------------------------------+-------------+ | RSLHSectionSize = (001) | 3 bits | +-----------------------------------------+-------------+ | accessUnitStartFlag = (0) | 1 bit | +-----------------------------------------+-------------+ | bits to octet alignment = (0000) | 4 bits | +-----------------------------------------+-------------+ | SL packet payload | N octets | +-----------------------------------------+-------------+ Overhead estimation In this example we have a RTP overhead of 40 + 2 octets for 1400 octets of payload i.e. 3 % overhead. Appendix.3 Low delay MPEG-4 Audio (no SL) This example is for a low delay audio service. For this reason a single Access Unit is transported in each RTP packet (in terms of Sync Layer each SL packet contains a complete Access Unit). SLConfigDescriptor Since CTS=DTS and Access Unit duration is constant, signaling of MPEG-4 time stamps is not needed (the durationFlag of SLConfig is set). We also assume here an audio Object Type for which all Access Units are Random Access Points, which is signaled using the hasRandomAccessUnitsOnlyFlag in the SLConfigDescriptor. We assume furthermore a mode where the Access Unit size is constant and equal to 5 octets (which is signaled with AU_Length). Gentric et al. Expires March 2002 48 RTP Payload Format for MPEG-4 Streams February 2002 In this example the SLConfigDescriptor is: class SLConfigDescriptor extends BaseDescriptor : bit(8) tag=SLConfigDescrTag { bit(8) predefined; if (predefined==0) { bit(1) useAccessUnitStartFlag; = 0 bit(1) useAccessUnitEndFlag; = 0 bit(1) useRandomAccessPointFlag; = 0 bit(1) hasRandomAccessUnitsOnlyFlag; = 1 bit(1) usePaddingFlag; = 0 bit(1) useTimeStampsFlag; = 0 bit(1) useIdleFlag; = 0 bit(1) durationFlag; = 1 // signals constant AU duration bit(32) timeStampResolution; = 0 bit(32) OCRResolution; = 0 bit(8) timeStampLength; = 0 bit(8) OCRLength; = 0 bit(8) AU_Length; = 5 bit(8) instantBitrateLength; = 0 bit(4) degradationPriorityLength; = 0 bit(5) AU_seqNumLength; = 0 bit(5) packetSeqNumLength; = 0 bit(2) reserved=0b11; } if (durationFlag) { bit(32) timeScale; = 1000 // for milliseconds bit(16) accessUnitDuration; = 10 // ms bit(16) compositionUnitDuration; = 10 // ms } if (!useTimeStampsFlag) { bit(timeStampLength) startDecodingTimeStamp; = 0 bit(timeStampLength) startCompositionTimeStamp; = 0 } } SL packet header With this configuration the SL packet header is empty. The Sync Layer is reduced to a purely logical construction that neither sender nor receiver need to implement. Parameters No parameters are required. RTP packet structure Note that the RTP header M bit must be set to 1. +=========================================+=============+ | Field | size | Gentric et al. Expires March 2002 49 RTP Payload Format for MPEG-4 Streams February 2002 +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ | Access Unit | 5 octets | +-----------------------------------------+-------------+ Overhead estimation The overhead is extremely large i.e. more than 800 %, since 40 octets of headers are required to transport 5 octets of data. Note however that RTP header compression would work well since time stamps increments are constant. Appendix.4 Media delivery MPEG-4 Audio (no SL) This example is for a media delivery service where delay is not an issue but efficiency is. In this case several Access Units are transported in each RTP packet. SLConfigDescriptor Similar to previous example. SL packet header With this configuration the SL packet header is empty. The Sync Layer is reduced to a purely logical construction that neither sender nor receiver need to implement. Parameters The absence of RSLHSectionSizeLength indicates that the RSLHSection is empty. The size of SL Packets (which are all complete Access Units in this case) is constant and is indicated with: a=fmtp: ConstantSize=5 This also indicates to the receiver that the "Multiple" packing style will be used, the 2 octets field that would give the size of the Payload Header Section is ommited since in this case this field always contains zero (the Payload Header Section is always empty due to the absence of any other MIME parameter). RTP packet structure Note that the RTP header M bit is always set to 1, which indicates to the receiver that only complete Access Units are transported. +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | Gentric et al. Expires March 2002 50 RTP Payload Format for MPEG-4 Streams February 2002 +-----------------------------------------+-------------+ | Access Unit data | 5 octets | +-----------------------------------------+-------------+ | Access Unit data | 5 octets | +-----------------------------------------+-------------+ | etc, until MTU is reached | +-----------------------------------------+-------------+ | Access Unit data | 5 octets | +-----------------------------------------+-------------+ Overhead estimation The overhead is 3% i.e. minimal. Appendix.5 AAC with interleaving (no SL) Let us consider AAC at 128 kb/s where each Access Unit is in the average 320 octets. Interleaving is applied using a continuous interleaving scheme (see table below) where 4 Access Units are used to construct each RTP packet in order to match a MTU of 1500 octets. IndexDelta is constant and equal to 2 (since +1 is automatically added); it is encoded on 2 bits. As explained in section 3.8 this is a time stamp based interleaving (TSBI) scheme (IndexLength=0); indeed receivers know that each payload is a complete Access Unit because all RTP packets have the M bit set to 1 and therefore, since Access Unit duration is constant, Access Unit timestamps can be computed from RTP timestamps and IndexDelta values; this can be used for de-interleaving even in case of losses. Note that it is also be possible to use IndexLength=2 so as to maintain a octet alignement in the Payload Header portions; in this case however the value of these two bits MUST be zero as stated in 3.8.1. This solution is used in the companion RFC YYYY. +-----------------------------------------------------------------+ | RTP packet | RTP Timestamp | Aus | IndexDelta | +-----------------------------------------------------------------+ | 1 | CTS(AU1) | 1 | - | +-----------------------------------------------------------------+ | 2 | CTS(AU2) | 2, 5 | -,2 | +-----------------------------------------------------------------+ | 3 | CTS(AU3) | 3, 6, 9 | -,2,2 | +-----------------------------------------------------------------+ | 4 | CTS(AU4) | 4, 7,10,13 | -,2,2,2 | +-----------------------------------------------------------------+ | 5 | CTS(AU8) | 8,11,14,17 | -,2,2,2 | +-----------------------------------------------------------------+ | 6 | CTS(AU12) | 12,15,18,21 | -,2,2,2 | +-----------------------------------------------------------------+ | 7 | CTS(AU16) | 16,19,22,25 | -,2,2,2 | Gentric et al. Expires March 2002 51 RTP Payload Format for MPEG-4 Streams February 2002 +----------------------------------------------------------------+ | 8 | CTS(AU20) | 20,23,26,29 | -,2,2,2 | +-----------------------------------------------------------------+ | 9 | CTS(AU24) | 24,27,30,33 | -,2,2,2 | +-----------------------------------------------------------------+ | 10 | CTS(AU28) | 28,31,34,37 | -,2,2,2 | +-----------------------------------------------------------------+ | etc | +-----------------------------------------------------------------+ SLConfigDescriptor Similar to previous example. SL Packet Header Similar to previous example (empty). Parameters The resulting concatenated fmtp line is: a=fmtp: SizeLength=9; IndexDeltaLength=2; RTP packet structure +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ Payload Header Section +=========================================+=============+ | PayloadHeaderSection size = (42) | 2 octets | +-----------------------------------------+-------------+ | PayloadSize | 9 bits | +-----------------------------------------+-------------+ | PayloadSize | 9 bits | +-----------------------------------------+-------------+ | IndexDelta | 2 bits | +-----------------------------------------+-------------+ | PayloadSize | 9 bits | +-----------------------------------------+-------------+ | IndexDelta | 2 bits | +-----------------------------------------+-------------+ | PayloadSize | 9 bits | +-----------------------------------------+-------------+ | IndexDelta | 2 bits | +-----------------------------------------+-------------+ | bits to octet alignment = (000000) | 6 bits | +-----------------------------------------+-------------+ Payload Section +=========================================+=============+ Gentric et al. Expires March 2002 52 RTP Payload Format for MPEG-4 Streams February 2002 | AAC Access Unit | x octets | +-----------------------------------------+-------------+ | AAC Access Unit | x octets | +-----------------------------------------+-------------+ | AAC Access Unit | x octets | +-----------------------------------------+-------------+ | AAC Access Unit | x octets | +-----------------------------------------+-------------+ Overhead estimation The PayloadHeaderSection is 8 octets; in this example we have therefore a RTP overhead of 40 + 8 octets for 1400 octets (approx) of payload i.e. around 4 % overhead. Appendix.6 AAC with Index-based interleaving and SL Let us consider AAC around 130 kb/s where each Access Unit is split in 4 SL packets corresponding to Error Sensitivity Categories (ESC) of maximum 90 octets for which interleaving is very useful in terms of error resilience. We thus use an interleaving scheme where 15 SL Packets (extracted from 15 consecutive Access Units) are used to construct each RTP packet in order to match a MTU of 1500 octets. Note that since ESC fragments are not octet aligned we also use the paddingFlag and paddingBits features of the Sync Layer. The interleaving sequence is 4 RTP packets and 350 ms long, which is too long for conferencing but perfectly OK for Internet radio. Since the sequence contains 60 SL packets, IndexLength is set to 16 bits so as to provide a safe margin in case of long loss bursts. This will also indicate to the receiver that this is a Index-Based- Interleaving scheme (and indeed CTS cannot be computed for SL packets that are not AU starts so TSBI would not work). 2 bits are enough for IndexDelta, which is constant and equal to 3 (since +1 is automatically added). Note that the 4th RTP packet in each sequence has its M bit set to 1 since it contains 15 SL packets transporting the end of 15 consecutive Access Units. With this scheme a sender (for example upon reception of RTCP reports indicating high loss rates) can (for example) choose to duplicate for each interleaving sequence the first RTP packet that contains the most useful data in terms of ESC or apply other error protection techniques, with due care to congestion issues. In this example we will also show several other SL features (OCR, AU boundary flags, padding, as detailed below). Gentric et al. Expires March 2002 53 RTP Payload Format for MPEG-4 Streams February 2002 One feature demonstrated by this example is the degradation priority. We assume degradation priority can take 4 different values, mapped to Error Sensitivity Categories, and is encoded on 2 bits. This interleaving scheme makes sure that only SL packets of identical degradation priorities are grouped in the same RTP packet (3.6.3) and that only the first RSLH of each RTP packet transports the degradation priority. We also assume that for each last SL packet of each RTP packet the server inserts an OCR. SLConfigDescriptor In this example the SLConfigDescriptor is: class SLConfigDescriptor extends BaseDescriptor : bit(8) tag=SLConfigDescrTag { bit(8) predefined; if (predefined==0) { bit(1) useAccessUnitStartFlag; = 1 bit(1) useAccessUnitEndFlag; = 1 bit(1) useRandomAccessPointFlag; = 0 bit(1) hasRandomAccessUnitsOnlyFlag; = 1 bit(1) usePaddingFlag; = 1 // we need to signal padding bits bit(1) useTimeStampsFlag; = 0 bit(1) useIdleFlag; = 0 bit(1) durationFlag; = 1 bit(32) timeStampResolution; = 0 bit(32) OCRResolution; = 30 bit(8) timeStampLength; = 0 bit(8) OCRLength; = 32 bit(8) AU_Length; = 0 bit(8) instantBitrateLength; = 0 bit(4) degradationPriorityLength; = 2 bit(5) AU_seqNumLength; = 0 bit(5) packetSeqNumLength; = 6 bit(2) reserved=0b11; } if (durationFlag) { bit(32) timeScale; = 1000// milliseconds bit(16) accessUnitDuration; = 23.22 // ms bit(16) compositionUnitDuration; = 23.22 // ms } if (!useTimeStampsFlag) { bit(timeStampLength) startDecodingTimeStamp; = 0 bit(timeStampLength) startCompositionTimeStamp; = 0 } } SL Packet Header structure With this configuration we have the following SL packet header structure: aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { bit(1) accessUnitStartFlag; bit(1) accessUnitEndFlag; bit(1) OCRflag; bit(1) paddingFlag; Gentric et al. Expires March 2002 54 RTP Payload Format for MPEG-4 Streams February 2002 if (paddingFlag) bit(3) paddingBits; bit(SL.packetSeqNumLength) packetSequenceNumber; bit(1) DegPrioflag; if (DegPrioflag) { bit(SL.degradationPriorityLength) degradationPriority;} if (OCRflag) { bit(SL.OCRLength) objectClockReference;} } } Parameters The resulting concatenated fmtp line is: a=fmtp: SizeLength=7; RSLHSectionSizeLength=8; IndexLength=16; IndexDeltaLength=2; OCRDeltaLength=16 RTP packet structure +=========================================+=============+ | Field | size | +=========================================+=============+ | RTP header | - | +-----------------------------------------+-------------+ Payload Header Section +=========================================+=============+ | Payload Header Section size = 149 bits | 2 octets | +-----------------------------------------+-------------+ | PayloadSize | 7 bits | +-----------------------------------------+-------------+ | Index | 16 bits | +-----------------------------------------+-------------+ | PayloadSize | 7 bits | +-----------------------------------------+-------------+ | IndexDelta = (11) | 2 bits | +-----------------------------------------+-------------+ | etc + 12 times 9 bits | +-----------------------------------------+-------------+ | PayloadSize | 7 bits | +-----------------------------------------+-------------+ | IndexDelta = (11) | 2 bits | +-----------------------------------------+-------------+ | bits to octet alignment = (000) | 3 bits | +-----------------------------------------+-------------+ RSLHSection +=========================================+=============+ | RSLHSectionSize = (10000111) | 8 bits | +-----------------------------------------+-------------+ | accessUnitStartFlag | 1 bit | +-----------------------------------------+-------------+ | accessUnitEndFlag | 1 bit | +-----------------------------------------+-------------+ | OCRFlag = (0) | 1 bit | +-----------------------------------------+-------------+ | paddingFlag = (1) | 1 bit | +-----------------------------------------+-------------+ Gentric et al. Expires March 2002 55 RTP Payload Format for MPEG-4 Streams February 2002 | paddingBits | 3 bits | +-----------------------------------------+-------------+ | DegPrioflag = (1) | 1 bit | +-----------------------------------------+-------------+ | degradationPriority | 2 bits | +-----------------------------------------+-------------+ | accessUnitStartFlag | 1 bit | +-----------------------------------------+-------------+ | accessUnitEndFlag | 1 bit | +-----------------------------------------+-------------+ | OCRFlag = (0) | 1 bit | +-----------------------------------------+-------------+ | paddingFlag = (1) | 1 bit | +-----------------------------------------+-------------+ | paddingBits | 3 bits | +-----------------------------------------+-------------+ | DegPrioflag = (0) | 1 bit | +-----------------------------------------+-------------+ | etc + 12 times 8 bits | +-----------------------------------------+-------------+ | accessUnitStartFlag | 1 bit | +-----------------------------------------+-------------+ | accessUnitEndFlag | 1 bit | +-----------------------------------------+-------------+ | OCRFlag = (1) | 1 bit | +-----------------------------------------+-------------+ | OCRDelta | 16 bits | +-----------------------------------------+-------------+ | paddingFlag = (0) | 1 bit | +-----------------------------------------+-------------+ | DegPrioflag = (0) | 1 bit | +-----------------------------------------+-------------+ | bits to octet alignment = (000) | 3 bits | +-----------------------------------------+-------------+ Payload Section +=========================================+=============+ | SL packet payload |max 90 octets| +-----------------------------------------+-------------+ | etc + 13 SL packets | +-----------------------------------------+-------------+ | SL packet payload |max 90 octets| +-----------------------------------------+-------------+ Note that in the above table the last SL packet in the RTP packet has a payload that is octet-aligned (at the end). When this happens paddingFlag is set to zero and the paddingBits field is omitted. Overhead estimation The PayloadHeaderSection is 19 octets, the RSLHSection is 16 octets; in this example we have therefore a RTP overhead of 40 + 35 octets for 1350 octets of payload i.e. around 6 % overhead. Gentric et al. Expires March 2002 56