idnits 2.17.1 draft-ietf-avt-mpeg4-simple-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 32 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Authors' Addresses Section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: The AU-headers are configured using MIME format parameters and MAY be empty. If the AU-header is configured empty, the AU-headers-length field SHALL not be present and consequently the AU Header Section is empty. If the AU-header is not configured empty, then the AU-headers-length is a two octet field that specifies the length in bits of the immediately following AU-headers, excluding the padding bits. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: Applications MAY use more parameters, in addition to those defined above. Receivers MUST tolerate the presence of such additional parameters, but these parameters SHALL not impact the decoding of receivers that comply to this specification. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 2002) is 7801 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 1495, but not defined == Missing Reference: '9' is mentioned on line 1441, but not defined == Missing Reference: '20' is mentioned on line 1480, but not defined == Missing Reference: '7' is mentioned on line 1499, but not defined == Missing Reference: '11' is mentioned on line 1500, but not defined == Missing Reference: '15' is mentioned on line 1501, but not defined == Missing Reference: '19' is mentioned on line 1502, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 1889 (ref. '2') (Obsoleted by RFC 3550) ** Obsolete normative reference: RFC 3016 (ref. '5') (Obsoleted by RFC 6416) ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) Summary: 7 errors (**), 0 flaws (~~), 11 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force J. van der Meer 2 Internet Draft Philips Electronics 3 D. Mackie 4 Cisco Systems Inc. 5 V. Swaminathan 6 Sun Microsystems Inc. 7 D. Singer 8 Apple Computer 9 P. Gentric 10 Philips Electronics 12 June 2002 13 Expires December 2002 15 Document: draft-ietf-avt-mpeg4-simple-03.txt 17 Transport of MPEG-4 Elementary Streams 19 Status of this Memo 21 This document is an Internet-Draft and is in full conformance with 22 all provisions of Section 10 of RFC2026. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF), its areas, and its working groups. Note that 26 other groups may also distribute working documents as Internet- 27 Drafts. Internet-Drafts are draft documents valid for a maximum of 28 six months and may be updated, replaced, or obsoleted by other 29 documents at any time. It is inappropriate to use Internet- Drafts 30 as reference material or to cite them other than as "work in 31 progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 35 The list of Internet-Draft Shadow Directories can be accessed at 36 http://www.ietf.org/shadow.html. 38 This specification is a product of the Audio/Video Transport working 39 group within the Internet Engineering Task Force. Comments are 40 solicited and should be addressed to the working group's mailing 41 list at avt@ietf.org and/or the authors. 43 << Note for the RFC editor: xxxx should be replaced with the RFC 44 number that will be assigned. >> 46 Abstract 48 The MPEG Committee (ISO/IEC JTC1/SC29 WG11) is a working group in 49 ISO that produced the MPEG-4 standard. MPEG defines tools to 50 compress content such as audio-visual information into elementary 51 streams. This specification defines a simple, but generic RTP 52 payload format for transport of any non-multiplexed MPEG-4 53 elementary stream. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Carriage of MPEG-4 elementary streams over RTP . . . . . . . 4 59 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 4 60 2.2. MPEG Access Units . . . . . . . . . . . . . . . . . . . . 4 61 2.3. Concatenation of Access Units . . . . . . . . . . . . . . 4 62 2.4. Fragmentation of Access Units . . . . . . . . . . . . . . 5 63 2.5. Interleaving . . . . . . . . . . . . . . . . . . . . . . . 5 64 2.6. Time stamp information . . . . . . . . . . . . . . . . . . 6 65 2.7. Random Access Indication . . . . . . . . . . . . . . . . . 6 66 2.8. State indication of MPEG-4 system streams . . . . . . . . 6 67 2.9. Carriage of auxiliary information . . . . . . . . . . . . 7 68 2.10. MIME format parameters and configuring conditional field . 7 69 2.11. Global structure of payload format . . . . . . . . . . . . 7 70 2.12. Modes to transport MPEG-4 streams . . . . . . . . . . . . 8 71 2.13. Alignment with RFC 3016 . . . . . . . . . . . . . . . . . 8 72 3. Payload format . . . . . . . . . . . . . . . . . . . . . . . 9 73 3.1. Usage of RTP header fields and RTCP . . . . . . . . . . . 9 74 3.2. RTP payload structure . . . . . . . . . . . . . . . . . . 10 75 3.2.1. The AU Header Section . . . . . . . . . . . . . . . . . 10 76 3.2.1.1. The AU-header . . . . . . . . . . . . . . . . . . . . 10 77 3.2.2. The Auxiliary Section . . . . . . . . . . . . . . . . . 13 78 3.2.3. The Access Unit Data Section . . . . . . . . . . . . . . 13 79 3.2.3.1. Fragmentation . . . . . . . . . . . . . . . . . . . . 14 80 3.2.3.2. Interleaving . . . . . . . . . . . . . . . . . . . . . 14 81 3.2.3.3. Constraints for interleaving . . . . . . . . . . . . . 15 82 3.3. Usage of this specification . . . . . . . . . . . . . . . 16 83 3.3.1. General . . . . . . . . . . . . . . . . . . . . . . . . 16 84 3.3.2. The generic mode . . . . . . . . . . . . . . . . . . . . 16 85 3.3.3. Constant bit rate CELP . . . . . . . . . . . . . . . . . 17 86 3.3.4. Variable bit rate CELP . . . . . . . . . . . . . . . . . 18 87 3.3.5. Low bit rate AAC . . . . . . . . . . . . . . . . . . . . 19 88 3.3.6. High bit rate AAC . . . . . . . . . . . . . . . . . . . 19 89 3.3.7. Additional modes . . . . . . . . . . . . . . . . . . . . 20 90 4. IANA considerations . . . . . . . . . . . . . . . . . . . . 21 91 4.1. MIME type registration . . . . . . . . . . . . . . . . . . 21 92 4.2. Concatenation of parameters . . . . . . . . . . . . . . . 26 93 4.3. Usage of SDP . . . . . . . . . . . . . . . . . . . . . . . 26 94 4.3.1. The a=fmtp keyword . . . . . . . . . . . . . . . . . . . 26 95 5. Security considerations . . . . . . . . . . . . . . . . . . 27 96 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 28 97 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 28 98 8. Author addresses . . . . . . . . . . . . . . . . . . . . . . 29 99 APPENDIX: Usage of this payload format . . . . . . . . . . . 30 100 A. Examples of delay analysis with interleave . . . . . . . 30 101 A.1 Group interleave . . . . . . . . . . . . . . . . . . . . 30 102 A.2 Continuous interleave . . . . . . . . . . . . . . . . . 31 104 1. Introduction 106 The MPEG Committee is Working Group 11 (WG11) in ISO/IEC JTC1 SC29 107 that specified the MPEG-1, MPEG-2 and, more recently, the MPEG-4 108 standards [1]. The MPEG-4 standard specifies compression of 109 audio-visual data into for example an audio or video elementary 110 stream. In the MPEG-4 standard, these streams take the form of 111 audiovisual objects that may be arranged into an audio-visual scene 112 by means of a scene description. Each MPEG-4 elementary stream 113 consists of a sequence of Access Units; examples of an Access Unit 114 (AU) are an audio frame and a video picture. 116 This specification defines a general and configurable payload 117 structure to transport MPEG-4 elementary streams, in particular 118 MPEG-4 audio (including speech) streams, MPEG-4 video streams and 119 also MPEG-4 systems streams, such as BIFS (BInary Format for 120 Scenes), OCI (Object Content Information), OD (Object Descriptor) 121 and IPMP (Intellectual Property Management and Protection) streams. 122 The RTP payload defined in this document is simple to implement and 123 reasonably efficient. It allows for optional interleaving of Access 124 Units (such as audio frames) to increase error resiliency in packet 125 loss. 127 Though the RTP payload format defined in this document is capable 128 to transport any MPEG-4 stream, more dedicated formats may exist, 129 such as RFC 3016 for transport of MPEG-4 video (part 2). 131 Configuration of the payload is provided to accommodate transport 132 of any MPEG-4 stream at any possible bit rate. However, for a 133 specific MPEG-4 elementary stream typically only very few 134 configurations are needed. So as to allow for the design of 135 simplified, but dedicated receivers, this specification requires 136 that specific modes are defined for transport of MPEG-4 streams. 137 This document defines modes for MPEG-4 CELP and AAC streams, as 138 well as a generic mode that can be used to transport any MPEG-4 139 stream. In the future new RFCs are expected to specify additional 140 modes for transport of MPEG-4 streams. 142 The RTP payload format defined in this document specifies carriage 143 of system-related information that is often equivalent to the 144 information that may be contained in the MPEG-4 SL. This 145 document does not prescribe how to transcode or map information 146 from the SL to fields defined in the RTP payload format. Such 147 processing, if any, is left to the discretion of the application. 148 However, to anticipate the need for transport of any additional 149 system-related information in future, an auxiliary field can be 150 configured that may carry any such data. 152 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 153 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 154 this document are to be interpreted as described in RFC 2119 [3]. 156 2. Carriage of MPEG-4 elementary streams over RTP 158 2.1 Introduction 160 With this payload format a single MPEG-4 elementary stream can be 161 transported. Information on the type of MPEG-4 stream carried in 162 the payload is conveyed by MIME format parameters, for example in 163 an SDP [6] message or by other means. These MIME format parameters 164 specify the configuration of the payload. To allow for simplified 165 and dedicated receivers, a MIME format parameter is available 166 to signal a specific mode of using this payload. A mode definition 167 MAY include the type of MPEG-4 elementary stream as well as the 168 applied configuration, so as to avoid the need in receivers 169 to parse all MIME format parameters. The applied mode MUST be 170 signalled. 172 2.2 MPEG Access Units 174 For carriage of compressed audio-visual data MPEG defines Access 175 Units. An MPEG Access Unit (AU) is the smallest data entity to 176 which timing information is attributed. In case of audio an Access 177 Unit may represent an audio frame and in case of video a picture. 178 MPEG Access Units are by definition byte aligned. If for example an 179 audio frame is not byte aligned, up to 7 zero-padding bits MUST be 180 inserted at the end of the frame to achieve a byte-aligned Access 181 Unit. MPEG-4 decoders MUST be able to decode AUs in which such 182 padding is applied. 184 Consistent with the MPEG-4 specification, this document requires 185 that each MPEG-4 part 2 video Access Unit includes all the coded 186 data of a picture, any video stream headers that may precede the 187 coded picture data, and any video stream stuffing that may follow 188 it, up to, but not including the startcode indicating the start of 189 a new video stream or the next Access Unit. 191 2.3 Concatenation of Access Units 193 Frequently it is possible to carry multiple Access Units in one RTP 194 packet. This is particularly useful for audio; for example, when 195 AAC is used for encoding of a stereo signal at 64 kbits/sec, AAC 196 frames contain on average approximately 200 octets. On a LAN with a 197 1500 octet MTU this would allow on average 7 complete AAC frames to 198 be carried per AAC packet. 200 Access Units may have a fixed size in octets, but a variable size 201 is also possible. To facilitate parsing in case of multiple 202 concatenated AUs in one RTP packet, the size of each AU is made 203 known to the receiver. When concatenating in case of a constant AU 204 size, this size is communicated "out of band" through a MIME format 205 parameter. When concatenating in case of variable size AUs, the RTP 206 payload carries "in band" an AU size field for each contained AU. 207 In combination with the RTP payload length the size information 208 allows the RTP payload to be split by the receiver back into the 209 individual AUs. 211 To simplify the implementation of RTP receivers, it is required 212 that when multiple AUs are carried in an RTP packet, each AU MUST 213 be complete, i.e. the number of AUs in an RTP packet MUST be 214 integral. 216 2.4 Fragmentation of Access Units 218 MPEG allows for very large Access Units. Since most IP networks 219 have significantly smaller MTU sizes, this payload format allows 220 for the fragmentation of an Access Unit over multiple RTP packets 221 so as to avoid IP layer fragmentation. To simplify the 222 implementation of RTP receivers, an RTP packet SHALL either carry 223 one or more complete Access Units or a single fragment of one 224 Access Unit. 226 2.5 Interleaving 228 When an RTP packet carries a contiguous sequence of Access Units, 229 the loss of such a packet can result in a "decoding gap" for the 230 user. One method to alleviate this problem is to allow for the 231 Access Units to be interleaved in the RTP packets. For a modest 232 cost in latency and implementation complexity, significant error 233 resiliency to packet loss can be achieved. 235 To support optional interleaving of Access Units, this payload 236 format allows for index information to be sent for each Access Unit. 237 The RTP sender is free to choose the interleaving pattern without 238 propagating this information to the receiver(s). Indeed the sender 239 could dynamically adjust the interleaving pattern based on the 240 Access Unit size, error rates, etc. The RTP receiver does not need 241 to know the interleaving pattern used, it only needs to extract the 242 index information of the Access Unit and insert the Access Unit 243 into the appropriate sequence in the rendering queue. An example of 244 interleaving is given below. 246 Assume that an RTP packet contains 3 AUs, and that the AUs are 247 numbered 1, 2, 3, 4, etc. If an interleaving group length of 9 is 248 chosen, then RTP packet(i) contains the following AU(n): 250 RTP packet(1): AU(1), AU(4), AU(7) 251 RTP packet(2): AU(2), AU(5), AU(8) 252 RTP packet(3): AU(3), AU(6), AU(9) 253 RTP packet(4): AU(10), AU(13), AU(16) 254 RTP packet(5): AU(11), AU(14), AU(17) 255 Etc. 257 2.6 Time stamp information 259 The RTP time stamp MUST carry the sampling instance of the first AU 260 (fragment) in the RTP packet. When multiple AUs are carried within 261 an RTP packet, the time stamps of subsequent AUs can be calculated 262 if the frame period of each AU is known. For audio and video this 263 is possible if the frame rate is constant. However, in some cases 264 it is not possible to make such calculation, for example for 265 variable frame rate video and for MPEG-4 BIFS streams carrying 266 composition information. To support such cases, this payload format 267 can be configured to carry a time stamp in the RTP payload for each 268 contained Access Unit. A time stamp MAY be conveyed in the RTP 269 payload only for non-first AUs in the RTP packet, and SHALL NOT be 270 conveyed for the first AU (fragment), as the time stamp for the 271 latter is carried by the RTP time stamp. 273 MPEG-4 defines two type of time stamps, the composition time stamp 274 (CTS) and the decoding time stamp (DTS). The CTS represents the 275 sampling instance of an AU, and hence the CTS is equivalent to the 276 RTP time stamp. The DTS may be used only in MPEG-4 video streams 277 that use bi-directional coding, i.e. when pictures are predicted in 278 both forward and backward direction by using either a reference 279 picture in the past, or a reference picture in the future. The DTS 280 cannot be carried in the RTP header. In some cases the DTS can be 281 derived from the RTP time stamp using frame rate information; this 282 requires deep parsing in the video stream, which may be considered 283 objectionable. But if the video frame rate is variable, the required 284 information may not even be present in the video stream. For both 285 reasons, the capability has been defined to optionally carry the 286 DTS in the RTP payload for each contained Access Unit. 288 Since RTP time stamps may be re-stamped by RTP devices, each time 289 stamp contained in the RTP payload is coded differentially, the CTS 290 from the RTP time stamp, and the DTS from the CTS, so as to avoid 291 extensive parsing by re-stamping devices. 293 2.7 Random access indication 295 Random access to the content of MPEG-4 elementary streams may be 296 possible at some but not all Access Units. To signal Access Units 297 where random access is possible, a random access point flag can 298 optionally be carried in the RTP payload for each contained Access 299 Unit. 301 2.8 State indication of MPEG-4 system streams 303 ISO/IEC 14496-1 defines states for MPEG-4 system streams. So as to 304 convey state information when transporting MPEG-4 system streams, 305 this payload format allows for the optional carriage in the RTP 306 payload of the stream state for each contained Access Unit. The 307 indication of stream states is particularly useful when repeating 308 AUs according to the carousel mechanism defined in ISO/IEC 14496-1. 310 2.9 Carriage of auxiliary information. 312 This payload format defines a specific field to carry auxiliary 313 data. The auxiliary data field is preceded by a field that specifies 314 the length of the auxiliary data, so as to facilitate skipping of 315 the data without parsing it. The coding of the auxiliary data is not 316 defined in this document, but is left to the discretion of 317 applications. Receivers that have knowledge of the auxiliary data 318 MAY decode the auxiliary data, but receivers without knowledge of 319 such data MUST skip the auxiliary data field. 321 2.10 MIME format parameters and configuring conditional fields 323 To support the features described in the previous sections several 324 fields are defined for carriage in the RTP payload. However, their 325 use strongly depends on the type of MPEG-4 elementary stream that 326 is carried. Sometimes a specific field is needed with a certain 327 length, while in other cases such field is not needed at all. To be 328 efficient in either case, the fields to support these features are 329 configurable by means of MIME format parameters. In general, a MIME 330 format parameter defines the presence and length of the associated 331 field. A length of zero indicates absence of the field. As a 332 consequence, parsing of the payload requires knowledge of MIME 333 format parameters. The MIME format parameters are conveyed to the 334 receiver via SDP [6] messages or through other means. 336 2.11 Global structure of payload format 338 The RTP payload following the RTP header, contains three byte 339 aligned data sections, of which the first two MAY be empty. See 340 figure 1. 342 +---------+-----------+-----------+---------------+ 343 | RTP | AU Header | Auxiliary | Access Unit | 344 | Header | Section | Section | Data Section | 345 +---------+-----------+-----------+---------------+ 347 <----------RTP Packet Payload-----------> 349 Figure 1: Data sections within an RTP packet 351 The first data section is the AU (Access Unit) Header Section, that 352 contains one or more AU-headers; however, each AU-header MAY be 353 empty, in which case the entire AU Header Section is empty. The 354 second section is the Auxiliary Section, containing auxiliary data; 355 this section MAY also be configured empty. The third section is the 356 Access Unit Data Section, containing either a single fragment of 357 one Access Unit or one or more complete Access Units. The Access 358 Unit Data Section is never empty. 360 2.12 Modes to transport MPEG-4 streams 362 While it is possible to build fully configurable receivers capable 363 of receiving any MPEG-4 stream, this specification also allows for 364 the design of simplified, but dedicated receivers, that are capable 365 for example of receiving only one type of MPEG-4 stream. This 366 is achieved by requiring that specific modes be defined for using 367 this specification. Each mode may define constraints for transport 368 of one or more type of MPEG-4 streams, for instance on the payload 369 configuration. 371 The applied mode MUST be signalled. Signalling the mode is 372 particularly important for receivers that are only capable of 373 decoding one or more specific modes. Such receivers need to 374 determine whether the applied mode is supported, so as to avoid 375 problems with processing of payloads that are beyond the 376 capabilities of the receiver. 378 In this document several modes are defined for transport of MPEG-4 379 CELP and AAC streams, as well as a generic mode that can be used 380 for any MPEG-4 stream. In future, new RFCs are expected to specify 381 additional modes of using this specification. New modes can be 382 defined as deemed appropriate, typically by specifications that are 383 hierarchically higher than this payload format. However, each mode 384 MUST be in full compliance with this specification. 386 2.13 Alignment with RFC 3016 388 This payload can be configured to be nearly identical to the 389 payload format defined in RFC 3016 [5] for the MPEG-4 video 390 configurations recommended in RFC 3016. Hence, receivers that 391 comply with RFC 3016 can decode such RTP payload, providing that 392 additional packets containing video decoder configuration (VO, 393 VOL, VOSH) are inserted in the stream, as required by RFC 3016. 394 Conversely, receivers that comply with the specification in this 395 document SHOULD be able to decode payloads, names and parameters 396 defined for MPEG-4 video in RFC 3016. In this respect it is 397 strongly recommended to implement the ability to ignore "in band" 398 video decoder configuration packets in the RFC 3016 payload. 400 Note the "out of band" availability of the video decoder 401 configuration is optional in RFC 3016. To achieve maximum 402 interoperability with the RTP payload format defined in this 403 document, applications that use RFC 3016 to transport MPEG-4 video 404 (part 2) are recommended to make the video decoder configuration 405 available as a MIME parameter. 407 3. Payload Format 409 3.1 Usage of RTP Header Fields and RTCP 411 Payload Type (PT): The assignment of an RTP payload type for this 412 RTP packet format is outside the scope of this document, and will 413 not be specified here. It is expected that the RTP profile for a 414 particular class of applications will assign a payload type for 415 this encoding, or if that is not done, then a payload type in the 416 dynamic range shall be chosen. 418 Marker (M) bit: The M bit is set to 1 to indicate that the RTP 419 packet payload includes the end of each Access Unit of which data 420 is contained in this RTP packet. As the payload either carries one 421 or more complete Access Units or a single fragment of an Access 422 Unit, the M bit is always set to 1, except when the packet carries 423 a single fragment of an Access Unit that is not the last one. 425 Extension (X) bit: Defined by the RTP profile used. 427 Sequence Number: The RTP sequence number SHOULD be generated by 428 the sender with a constant random offset. 430 Timestamp: Indicates the sampling instance of the first AU 431 contained in the RTP payload. This sampling instance is equivalent 432 to the CTS in the MPEG-4 time domain. When using SDP the clock rate 433 of the RTP time stamp MUST be expressed using the "rtpmap" 434 attribute. If an MPEG-4 audio stream is transported, the rate SHOULD 435 be set to the same value as the sampling rate of the audio stream. 436 If an MPEG-4 video stream is transported, it is RECOMMENDED to set 437 the rate to 90 kHz. 438 In all cases, the sender SHALL make sure that RTP time stamps 439 are identical only if the RTP time stamp refers to fragments of the 440 same Access Unit. 441 According to RFC 1889 [2] (section 5.1), RTP time stamps are 442 recommended to start at a random value for security reasons. This 443 is not an issue for synchronization of multiple RTP streams. 444 However, in applications where streams from multiple sources are to 445 be synchronized (for example one stream from local storage, another 446 from a RTP streaming server), synchronization may become impossible. 447 To also enable synchronization in such cases, it may be necessary to 448 provide the required relationship between time stamps for obtaining 449 synchronization by out of band means. The format of such information 450 as well as methods to convey such information are beyond the scope 451 of this specification. 453 SSRC: set as described in RFC1889 [2]. 455 CC and CSRC fields are used as described in RFC 1889 [2]. 457 RTCP SHOULD be used as defined in RFC 1889 [2]. 459 3.2 RTP Payload Structure 461 3.2.1 The AU Header Section 463 When present, the AU Header Section consists of the AU-header-length 464 field, followed by a number of AU-headers. See figure 2. 466 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+ 467 |AU-headers-length|AU-header|AU-header| |AU-header|padding| 468 | | (1) | (2) | | (n) | bits | 469 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+-+ 471 Figure 2: The AU Header Section 473 The AU-headers are configured using MIME format parameters and MAY 474 be empty. If the AU-header is configured empty, the 475 AU-headers-length field SHALL not be present and consequently the 476 AU Header Section is empty. If the AU-header is not configured 477 empty, then the AU-headers-length is a two octet field that 478 specifies the length in bits of the immediately following 479 AU-headers, excluding the padding bits. 481 Each AU-header is associated with a single Access Unit (fragment) 482 contained in the Access Unit Data Section in the same RTP packet. 483 For each contained Access Unit (fragment) there is exactly one 484 AU-header. Within the AU Header Section, the AU-headers are 485 bit-wise concatenated in the order in which the Access Units are 486 contained in the Access Unit Data Section. Hence, the n-th 487 AU-header refers to the n-th AU (fragment). If the concatenated 488 AU-headers consume a non-integer number of octets, up to 7 489 zero-padding bits MUST be inserted at the end in order to achieve 490 byte-alignment of the AU Header Section. 492 3.2.1.1 The AU-header 494 The AU-header contains the fields given in figure 3. The length in 495 bits of the above fields with the exception of the CTS-flag, the 496 DTS-flag and the RAP-flag fields is defined by MIME format 497 parameters; see section 4.1. If a MIME format parameter has the 498 default value of zero, then the associated field is not present. 500 +---------------------------------------+ 501 | AU-size | 502 +---------------------------------------+ 503 | AU-Index / AU-Index-delta | 504 +---------------------------------------+ 505 | CTS-flag | 506 +---------------------------------------+ 507 | CTS-delta | 508 +---------------------------------------+ 509 | DTS-flag | 510 +---------------------------------------+ 511 | DTS-delta | 512 +---------------------------------------+ 513 | RAP-flag | 514 +---------------------------------------+ 515 | Stream-state | 516 +---------------------------------------+ 518 Figure 3: The fields in the AU-header. If used, the AU-Index field 519 only occurs in the first AU-header within an AU Header 520 Section; in any other AU-header the AU-Index-delta field 521 occurs instead. 523 AU-size: Indicates the size in octets of the associated Access Unit 524 in the Access Unit Data Section in the same RTP packet. When 525 the AU-size is associated with an AU fragment, the AU size 526 indicates the size of the entire AU and not the size of the 527 fragment. This can be exploited to determine whether a packet 528 contains an entire AU or a fragment, which is particularly 529 useful after losing a packet carrying the last fragment of an 530 AU. 532 AU-Index: Indicates the serial number of the associated Access Unit 533 (fragment). For each (in decoding order) consecutive AU or AU 534 fragment, the serial number is incremented with 1. When 535 present, the AU-Index field occurs in the first AU-header in 536 the AU Header Section, but MUST NOT occur in any subsequent 537 (non-first) AU-header in that Section. To encode the serial 538 number in any such non-first AU-header, the AU-Index-delta 539 field is used. If each AU-Index field is coded with the value 540 0, the serial number of the AU (fragment) is not specified, 541 and in that case receivers MAY ignore the AU-Index field. 543 AU-Index-delta: The AU-Index-delta field is an unsigned integer 544 that specifies the serial number of the associated AU as the 545 difference with respect to the serial number of the previous 546 Access Unit. Hence, for the n-th (n>1) AU the serial number 547 is found from: 548 AU-Index(n) = AU-Index(n-1) + AU-Index-delta(n) + 1 549 If the AU-Index field is present in the first AU-header in 550 the AU Header Section, then the AU-Index-delta field MUST be 551 present in any subsequent (non-first) AU-header. When the 552 AU-Index-delta is coded with the value 0, it indicates that 553 the Access Units are consecutive in decoding order. An 554 AU-Index-delta value larger than 0 signals that interleaving 555 is applied. 557 CTS-flag: Indicates whether the CTS-delta field is present. 558 A value of 1 indicates that the field is present, a value 559 of 0 that it is not present. 560 The CTS-flag field MUST be present in each AU-header if the 561 length of the CTS-delta field is signalled to be larger than 562 zero. In that case, the CTS-flag field MUST have the value 0 563 in the first AU-header and MAY have the value 1 in all 564 non-first AU-headers. The CTS-flag field SHOULD be 0 for 565 any non-first fragment of an Access Unit. 567 CTS-delta: Encodes the CTS by specifying the value of CTS as a 2's 568 complement offset (delta) from the time stamp in the RTP 569 header of this RTP packet. The CTS MUST use the same clock 570 rate as the time stamp in the RTP header. 572 DTS-flag: Indicates whether the DTS-delta field is present. A value 573 of 1 indicates that DTS-delta is present, a value of 0 that 574 it is not present. 575 The DTS-flag field MUST be present in each AU-header if the 576 length of the DTS-delta field is signalled to be larger than 577 zero. The DTS-flag field SHOULD be 0 for any non-first 578 fragment of an Access Unit. 580 DTS-delta: Specifies the value of the DTS as a 2's complement 581 offset (delta) from the CTS. The DTS MUST use the 582 same clock rate as the time stamp in the RTP header. 584 RAP-flag: Indicates when set to 1 that the associated Access Unit 585 provides a random access point to the content of the stream. 586 If an Access Unit is fragmented, the RAP flag, if present, 587 MUST be set to 0 for each non-first fragment of the AU. 589 Stream-state: Specifies the state of the stream for the AU of an 590 MPEG-4 system stream. For states of MPEG-4 system streams see 591 ISO/IEC 14496-1. The stream state is set either to 0 or to 1. 592 A change of the stream state value (either from 1 to 0 or from 593 0 to 1) indicates another state of the stream. At an AU that 594 provides a random access point, as signalled by the RAP-flag, 595 a change in the stream state MUST occur, unless the AU is a 596 repeated random access point. Hence, receivers MAY ignore AUs 597 with the RAP-flag set to 1 if the stream state does not 598 change. Receivers that don't ignore a repeated random access 599 point SHOULD take care that such processing does not disrupt 600 the decoding process. 601 Note: no relation is required between stream-states of 602 different streams. 604 If present, the fields MUST occur in the mutual order given in 605 figure 3. In the general case a receiver can only discover the size 606 of an AU-header by parsing it since the presence of the CTS-delta 607 and DTS-delta fields is signalled by the value of the CTS-flag and 608 DTS-flag, respectively. 610 3.2.2 The Auxiliary Section 612 The Auxiliary Section consists of the auxiliary-data-size field 613 followed by the auxiliary-data field. Receivers MAY (but are not 614 required to) parse the auxiliary-data field; to facilitate skipping 615 of the auxiliary-data field by receivers, the auxiliary-data-size 616 field indicates the length in bits of the auxiliary-data. If the 617 concatenation of the auxiliary-data-size and the auxiliary-data 618 fields consume a non-integer number of octets, up to 7 zero padding 619 bits MUST be inserted immediately after the auxiliary data in order 620 to achieve byte-alignment. See figure 4. 622 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+ 623 | auxiliary-data-size | auxiliary-data |padding bits | 624 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- .. -+-+-+-+-+-+-+-+-+ 626 Figure 4: The fields in the Auxiliary Section 628 The length in bits of the auxiliary-data-size field is configurable 629 by a MIME format parameter; see section 4.1. The default length of 630 zero indicates that the entire Auxiliary Section is absent. 632 auxiliary-data-size: specifies the length in bits of the immediately 633 following auxiliary-data field; 635 auxiliary-data: the auxiliary-data field contains data of a format 636 not defined by this specification. 638 3.2.3 The Access Unit Data Section 640 The Access Unit Data Section contains an integer number of complete 641 Access Units or a single fragment of one AU. The Access Unit Data 642 Section is never empty. If data of more than one Access Unit is 643 present, then the AUs are concatenated into a contiguous string 644 of octets. See figure 5. The AUs inside the Access Unit Data 645 Section MUST be in decoding order. 647 The size and number of Access Units SHOULD be adjusted such that 648 the resulting RTP packet is not larger than the path MTU. To handle 649 larger packets, this payload format relies on lower layers for 650 fragmentation, which may not be desirable. 652 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 653 |AU(1) | 654 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- | 655 | | 656 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 657 | |AU(2) | 658 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 659 | | 660 | -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 661 | | AU(n) | 662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 663 | | 664 |-+-+-+-+-+-+-+-+ 666 Figure 5: Access Unit Data Section; each AU is byte aligned. 668 When multiple Access Units are carried, the size of each AU MUST be 669 made available to the receiver. If the AU size is variable then the 670 size of each AU MUST be indicated in the AU-size field of the 671 corresponding AU-header. However, if the AU size is constant for a 672 stream, this mechanism SHOULD NOT be used, but instead the fixed 673 size SHOULD be signalled by the MIME format parameter 674 "ConstantSize", see section 4.1. 676 The absence of both AU-size in the AU-header and the ConstantSize 677 MIME format parameter indicates carriage of a single AU (fragment), 678 i.e. that a single Access Unit (fragment) is transported in each 679 RTP packet for that stream. 681 3.2.3.1 Fragmentation 683 A packet SHALL carry either one or more Access Units, or a single 684 fragment of an Access Unit. Fragments of the same Access Unit have 685 the same time stamp but different RTP sequence numbers. The marker 686 bit in the RTP header is 1 on the last fragment of an Access Unit, 687 and 0 on all other fragments. 689 3.2.3.2 Interleaving 691 Access Units MAY be interleaved. Senders MAY perform interleaving. 692 Receivers MUST support interleaving. When interleaving of Access 693 Units is used it SHALL be implemented using the AU-Index and 694 AU-Index-delta fields in the AU-header. 696 Based on the RTP sequence number, the RTP time stamp, the AU-Index 697 and the AU-Index-delta, a receiver can unambiguously reconstruct 698 the original order even in case of out-of-order packets, packet 699 loss or duplication. Note that for this purpose the AU-Index is 700 redundant when the RTP time stamp and the AU-Index-delta values are 701 sufficient for placing the AUs correctly in time. In such cases 702 receivers MAY ignore the AU-Index value and senders MAY code the 703 AU-Index field with the value 0, but only if they code each AU-Index 704 field with that value. 706 When interleaving is applied, a de-interleave buffer is needed in 707 receivers to put the Access Units in their correct logical 708 consecutive decoding order. This requires the computation of the 709 time stamp for each Access Unit. In case of a fixed time duration 710 per Access Unit, the time stamp of the i-th access unit in an RTP 711 packet with RTP time stamp T is calculated as follows: 713 Timestamp[0] = T 714 Timestamp[i, i > 0] = T +(Sum(for k=1 to i of (AU-Index-delta[k] 715 + 1))) * access-unit-duration 717 When AU-Index-delta is always 0, this reduces to T + i * (access- 718 unit-duration). This is the non-interleaved case, where the frames 719 are consecutive in decoding order. Note that the AU-Index field 720 (present for the first Access Unit) is not needed in this 721 calculation. Hence in cases where the Access-unit-duration has a 722 fixed and known value, the AU-Index does not need to provide index 723 information and can be coded with the value 0. See also the 724 semantics of the AU-Index field in 3.2.1.1. 726 When an RTP packet arrives (after any reordering has been done), 727 receivers may 'flush' all Access Units from the interleave buffer 728 which have a time stamp strictly less than the time stamp of the 729 arriving packet. Similarly the first Access Unit of every arriving 730 packet can always be flushed (as no following packet can provide 731 an earlier Access Unit), and any Access Units which are consecutive 732 with it which have already been received. Access Units should also 733 be flushed in time to be played; this can be important if there is 734 loss before end-of-stream, before a silence interval, or before a 735 large drop-out. 737 3.2.3.3 Constraints for interleaving 739 The size of the packets should be suitably chosen to be appropriate 740 to both the path MTU and the duration and capacity of the receiver's 741 de-interleave buffer. The maximum packet size for a session should 742 be chosen not to exceed the path MTU. 744 In order to control receiver latency and mitigate the effects of 745 loss, there are profile-based limits on the size of the packet. 746 This is expressed as a duration: it is calculated from the duration 747 of the Access Units contained within a packet. Note that this 748 duration is NOT the difference between the time stamps of the first 749 and last Access Unit in a packet. 751 No matter what interleaving scheme is used, the scheme must be 752 analyzed to calculate the minimum number of frames a receiver has 753 to buffer in order to de-interleave. 755 Three profiles are defined to constrain the latency when 756 interleaving. The applied profile is signalled by the MIME format 757 parameter "Profile", indicating the decimal number of the profile. 758 The maximum de-interleave buffer required at the receiver can be 759 determined if the maximum packet duration is known. The maximum 760 packet duration in milliseconds for the three profiles, shall not 761 exceed: 763 Profile 0 -- 200 milliseconds 764 Profile 1 -- 500 milliseconds 765 Profile 2 -- 1500 milliseconds 767 When interleaving is applied, the applied profile MUST be signalled 768 by the MIME format parameter "Profile"; see section 4.1. 770 Note that for low bit-rate material, this duration limit may make 771 packets shorter than the MTU size. 773 3.3 Usage of this specification 775 3.3.1 General 777 Usage of this specification requires definition of a mode. A mode 778 defines how to use this specification, as deemed appropriate. 779 Senders MUST signal the applied mode via the MIME format parameter 780 "Mode". This specification defines a generic mode that can be used 781 for any MPEG-4 stream, as well as specific modes for transport of 782 MPEG-4 CELP and MPEG-4 AAC streams, defined in ISO/IEC 14496-3. 784 In any mode compliant to this specification the same requirements 785 apply for the rtpmap attributes. The general form of an rtpmap 786 attribute is: 787 a=rtpmap: /[/] 789 For audio streams, specifies the number of 790 audio channels: 2 for stereo material (see RFC 2327) and 1 for 791 mono. Provided no additional parameters are needed, this parameter 792 may be omitted for mono material, hence its default value is 1. 794 3.3.2 The generic mode 796 The generic mode can be used for any MPEG-4 stream. In this mode 797 no mode-specific constraints are applied; hence, in the generic 798 mode the full flexibility of this specification can be exploited. 799 The generic mode is signalled by mode=generic. 801 An example is given below for transport of a BIFS stream. In this 802 example carriage of multiple BIFS Access Units is allowed in one 803 RTP packet. The AU-header contains the AU-size field, the CTS-flag 804 and, if the CTS flag is set to 1, the CTS-delta field. The number 805 of bits of the AU-size and the CTS-delta fields is 14 and 15, 806 respectively. The AU-header also contains the RAP-flag and the 807 Stream-state, both of 1 bits. This results in an AU-header with a 808 Total size of two or four octets per BIFS AU. The RTP time stamp 809 uses a 1 kHz clock. Note that the media type name is video, 810 because the BIFS stream is part of an audiovisual presentation. For 811 conventions on media type names see section 4.1. 813 In detail: 815 m=video 49230 RTP/AVP 96 816 a=rtpmap:96 mpeg4-generic/1000 817 a=fmtp:96 streamtype=3; profile-level-id=257; mode=generic; 818 ObjectType=2; config=BIFSConfiguration(); SizeLength=15; 819 CTSDeltaLength=16; RandomAccessIndication=1; 820 StreamStateIndication=1 822 Note that BIFSConfiguration() is defined in ISO/IEC 14496-1; for 823 the description of MIME parameters see section 4.1. 825 3.3.3 Constant bit-rate CELP 827 This mode is signalled by mode=CELP-cbr. In this mode one or more 828 fixed size CELP frames can be transported in one RTP packet; there 829 is no support for interleaving. The RTP payload consist of one or 830 more concatenated CELP frames, each of the same size. Both the AU 831 Header Section and the Auxiliary Section are empty. 833 The MIME format parameter ConstantSize MUST be provided to specify 834 the length of each CELP frame. 836 For example: 838 m=audio 49230 RTP/AVP 96 839 a=rtpmap:96 mpeg4-generic/44100/2 840 a=fmtp:96 streamtype=5; profile-level-id=15; mode=CELP-cbr; config= 841 AudioSpecificConfig(); ConstantSize=xxx; 843 The AudioSpecificConfig(), defined in ISO/IEC 14496-3, specifies 844 that the audio stream type is CELP. For the description of MIME 845 parameters see section 4.1. 847 3.3.4 Variable bit-rate CELP 849 This mode is signalled by mode=CELP-vbr. With this mode one or 850 more variable size CELP frames can be transported in one RTP packet 851 with optional interleaving. As the largest possible frame size in 852 this mode is greater than the maximum CELP frame size, there is no 853 support for fragmentation of CELP frames. 855 In this mode the RTP payload consists of the AU Header Section, 856 followed by one or more concatenated CELP frames. The Auxiliary 857 Section is empty. For each CELP frame contained in the payload 858 there is a one octet AU-header in the AU Header Section to 859 provide: 860 (a) the size of each CELP frame in the payload and 861 (b) index information for computing the sequence (and hence timing) 862 of each CELP frame. 863 Transport of CELP frames requires that the AU-size field is coded 864 with 6 bits. In this mode therefore 6 bits are allocated to the 865 AU-size field, and 2 bits to the AU-Index(-delta) field. Each 866 AU-Index field MUST be coded with the value 0. In the AU Header 867 Section, the concatenated AU-headers are preceded by the 16-bit 868 AU-headers-length field, as specified in 3.2.1. 870 In addition to the required MIME format parameters, the following 871 parameters MUST be present: SizeLength, IndexLength, and 872 IndexDeltaLength. 873 When interleaving is applied (AU-Index-delta coded with a value 874 larger than 0), the parameter Profile MUST also be present. 876 For example: 878 m=audio 49230 RTP/AVP 96 879 a=rtpmap:96 mpeg4-generic/44100/2 880 a=fmtp:96 streamtype=5; profile-level-id=15; mode=CELP-vbr; config= 881 AudioSpecificConfig(); SizeLength=6; IndexLength=2; 882 IndexDeltaLength=2; Profile=1 884 The AudioSpecificConfig(), defined in ISO/IEC 14496-3, specifies 885 that the audio stream type is CELP. For the description of MIME 886 parameters see section 4.1. 888 3.3.5 Low bit-rate AAC 890 This mode is signalled by mode=AAC-lbr. This mode supports transport 891 of one or more variable size AAC frames with optional support for 892 interleaving and fragmenting. The maximum size of an AAC frame 893 (fragment) in this mode is 63 octets. 895 The payload configuration in this mode is the same as in the 896 variable bit-rate CELP mode as defined in 3.3.4. The RTP payload 897 consists of the AU Header Section, followed by concatenated AAC 898 frames. The Auxiliary Section is empty. For each AAC frame contained 899 in the payload the one octet AU-header provides: 900 (a) the size of each AAC frame in the payload and 901 (b) index information for computing the sequence (and hence timing) 902 of each AAC frame. 903 In the AU-header, the AU-size is coded with 6 bits and the 904 AU-Index(-delta) with 2 bits; the AU-Index field MUST have the 905 value 0 in each AU-header. 906 In the AU-header Section, the concatenated AU-headers are preceded 907 by the 16-bit AU-headers-length field, as specified in 3.2.1. 909 In addition to the required MIME format parameters, the following 910 parameters MUST be present: SizeLength, IndexLength, and 911 IndexDeltaLength. 912 When interleaving is applied (AU-Index-delta coded with a value 913 larger than 0), also the parameter Profile MUST be present. 915 For example: 917 m=audio 49230 RTP/AVP 96 918 a=rtpmap:96 mpeg4-generic/44100/2 919 a=fmtp:96 streamtype=5; profile-level-id=15; mode=AAC-lbr; config= 920 AudioSpecificConfig(); SizeLength=6; IndexLength=2; 921 IndexDeltaLength=2; Profile=1 923 The AudioSpecificConfig(), defined in ISO/IEC 14496-3, specifies 924 that the audio stream type is AAC. For the description of MIME 925 parameters see section 4.1. 927 3.3.6 High bit-rate AAC 929 This mode is signalled by mode=AAC-hbr. This mode supports transport 930 of one or more large variable size AAC frames in one RTP packet with 931 optional support for interleaving and fragmenting. The maximum size 932 of an AAC frame (fragment) in this mode is 8191 octets. 934 In this mode the RTP payload consists of the AU Header Section, 935 followed by one or more concatenated AAC frames. The Auxiliary 936 Section is empty. For each AAC frame contained in the payload there 937 is an AU-header in the AU Header Section to provide: 938 (a) the size of each AAC frame in the payload and 939 (b) index information for computing the sequence (and hence timing) 940 of each AAC frame. 942 To code the maximum size of an AAC frame requires 13 bits. Therefore 943 in this configuration 13 bits are allocated to the AU-size, and 944 3 bits to the AU-Index(-delta) field. Thus each AU-header has a size 945 of 2 octets. Each AU-Index field MUST be coded with the value 0. In 946 the AU Header Section, the concatenated AU-headers are preceded by 947 the 16-bit AU-headers-length field, as specified in 3.2.1. 949 In addition to the required MIME format parameters, the following 950 parameters MUST be present: SizeLength, IndexLength, and 951 IndexDeltaLength. 952 When interleaving is applied (AU-Index-delta coded with a value 953 larger than 0), also the parameter Profile MUST be present. 955 For example: 957 m=audio 49230 RTP/AVP 96 958 a=rtpmap:96 mpeg4-generic/44100/2 959 a=fmtp:96 streamtype=5; profile-level-id=15; mode=AAC-hbr; 960 config=AudioSpecificConfig(); SizeLength=13; IndexLength=3; 961 IndexDeltaLength=3; Profile=1 963 The AudioSpecificConfig(), defined in ISO/IEC 14496-3, specifies 964 that the audio stream type is AAC. For the description of MIME 965 parameters see section 4.1. 967 3.3.7 Additional modes 969 This specification only defines the modes specified in sections 970 3.3.2 up to 3.3.6. Additional modes are expected to be defined in 971 future RFCs. Each additional mode MUST be in full compliance with 972 this specification. 974 When defining a new mode care MUST be taken that an implementation 975 of all features of this specification can decode the payload format 976 corresponding to this new mode. For this reason a mode MUST NOT 977 specify new default values for MIME parameters. In particular, MIME 978 parameters that configure the RTP payload MUST be present (unless 979 they have the default value), even if its presence is redundant in 980 case the mode assigns a fixed value to a parameter. A mode may 981 define additionally that some MIME parameters are required instead 982 of optional, that some MIME parameters have fixed values (or 983 ranges), and that there are rules restricting the usage. 985 4. IANA considerations 987 This section describes the MIME types and names associated with 988 this payload format. Section 4.1 registers the MIME types, as per 989 RFC 2048. 991 This format may require additional information about the mapping to 992 be made available to the receiver. This is done using parameters 993 also described in the next section. 995 4.1 MIME type registration 997 MIME media type name: "video" or "audio" or "application" 999 "video" MUST be used for MPEG-4 Visual streams (ISO/IEC 14496-2) 1000 or MPEG-4 Systems streams (ISO/IEC 14496-1) that convey information 1001 needed for an audio/visual presentation. 1003 "audio" MUST be used for MPEG-4 Audio streams (ISO/IEC 14496-3) 1004 or MPEG-4 Systems streams that convey information needed for an 1005 audio only presentation. 1007 "application" MUST be used for MPEG-4 Systems streams (ISO/IEC 1008 14496-1) that serve purposes other than audio/visual presentation, 1009 e.g. in some cases when MPEG-J streams are transmitted. 1011 Depending on the required payload configuration, MIME format 1012 parameters need to be available to the receiver. This is done using 1013 the parameters described in the next section. There are required 1014 and optional parameters. 1016 Optional parameters are of two types: general parameters and 1017 configuration parameters. The configuration parameters are used to 1018 configure the fields in the AU Header section and in the auxiliary 1019 section. The absence of any configuration parameter is equivalent to 1020 the associated field set to its default value, which is always zero. 1021 The absence of all configuration parameters resolves into a default 1022 "basic" configuration with an empty AU-header section and an empty 1023 auxiliary section in each RTP packet. 1025 MIME subtype name: mpeg4-generic 1026 Required parameters: 1028 MIME format parameters are not case dependent; however for clarity 1029 both upper and lower case are used in the names of the parameters 1030 described in this specification. 1032 StreamType: 1033 The integer value that indicates the type of MPEG-4 stream that 1034 is carried; its coding corresponds to the values of the 1035 streamType as defined in Table 9 (objectTypeIndication Values) 1036 in ISO/IEC 14496-1. Note that the StreamType allows signalling of 1037 an MPEG-7 stream; this RTP payload format is not designed to 1038 carry an MPEG-7 stream, and may not be suitable for transport of 1039 MPEG-7 streams. 1041 Profile-level-id: 1042 A decimal representation of the MPEG-4 Profile Level indication. 1043 This parameter MUST be used in the capability exchange or 1044 session set-up procedure to indicate the MPEG-4 Profile and Level 1045 combination of which the relevant MPEG-4 media codec is capable 1046 of. 1047 For MPEG-4 Audio streams, this parameter is the decimal value 1048 from Table 5 (audioProfileLevelIndication Values) in ISO/IEC 1049 14496-1, indicating which MPEG-4 Audio tool subsets are 1050 required to decode the audio stream. 1051 For MPEG-4 Visual streams, this parameter is the decimal value 1052 from Table G-1 (FLC table for profile and level indication of 1053 ISO/IEC 14496-2), indicating which MPEG-4 Visual tool subsets 1054 are required to decode the visual stream. 1055 For BIFS streams, this parameter is the decimal value that is 1056 obtained from (SPLI + 256*GPLI), where: 1057 SPLI is the decimal value from Table 4 in ISO/IEC 14496-1 with 1058 the applied sceneProfileLevelIndication; 1059 GPLI is the decimal value from Table 7 in ISO/IEC 14496-1 with 1060 the applied graphicsProfileLevelIndication. 1061 For MPEG-J streams, this parameter is the decimal value from 1062 table 13 (MPEGJProfileLevelIndication) in ISO/IEC 14496-1, 1063 indicating the profile and level of the MPEG-J stream. 1064 For OD streams, this parameter is the decimal value from table 3 1065 (ODProfileLevelIndication) in ISO/IEC 14496-1, indicating the 1066 profile and level of the OD stream. 1067 For IPMP streams, this parameter has either the decimal value 0, 1068 indicating an unspecified profile and level, or a value larger 1069 than zero, indicating an MPEG-4 IPMP profile and level as 1070 defined in a future MPEG-4 specification. 1071 For Clock Reference streams and Object Content Info streams, this 1072 parameter has the decimal value zero, indicating that profile 1073 and level information is conveyed through the OD framework. 1075 Config: 1076 A hexadecimal representation of an octet string that expresses 1077 the media payload configuration. Configuration data is mapped 1078 onto the hexadecimal octet string in an MSB-first basis. The 1079 first bit of the configuration data SHALL be located at the MSB 1080 of the first octet. In the last octet, if necessary to achieve 1081 byte alignment, up to 7 zero-valued padding bits shall follow 1082 the configuration data. 1083 For MPEG-4 Audio streams, config is the audio object type 1084 specific decoder configuration data AudioSpecificConfig() as 1085 defined in ISO/IEC 14496-3. For Stuctured Audio, the 1086 AudioSpecificConfig()may be conveyed by other means, not 1087 defined by this specification. If the AudioSpecificConfig() 1088 is conveyed by other means for Stuctured Audio, then the 1089 config MUST be a quoted empty hexadecimal octet string, as 1090 follows: config="". 1091 Note that a future mode of using this RTP payload format for 1092 Structured Audio may define such other means. 1093 For MPEG-4 Visual streams, config is the MPEG-4 Visual 1094 configuration information as defined in subclause 6.2.1 Start 1095 codes of ISO/IEC 14496-2. The configuration information 1096 indicated by this parameter SHALL be the same as the 1097 configuration information in the corresponding MPEG-4 Visual 1098 stream, except for first-half-vbv-occupancy and 1099 latter-half-vbv-occupancy, if it exists, which may vary in 1100 the repeated configuration information inside an MPEG-4 1101 Visual stream (See 6.2.1 Start codes of ISO/IEC 14496-2). 1102 For BIFS streams, this is the BIFSConfig() information as defined 1103 in ISO/IEC 14496-1. For version 1, BIFSConfig is defined in 1104 section 9.3.5.2, and for version 2 in section 9.3.5.3. The 1105 MIME format parameter ObjectType signals the version of 1106 BIFSConfig. 1107 For IPMP streams, this is either a quoted empty hexadecimal octet 1108 string, indicating the absence of any decoder configuration 1109 information (config=""), or the IPMPConfiguration() as 1110 defined in a future MPEG-4 IPMP specification. 1111 For Object Content Info (OCI) streams, this is the 1112 OCIDecoderConfiguration() information of the OCI stream, as 1113 defined in section 8.4.2.4 in ISO/IEC 14496-1. 1114 For OD streams, Clock Reference streams and MPEG-J streams, this 1115 is a quoted empty hexadecimal octet string (config=""), as 1116 no information on the decoder configuration is required. 1118 Mode: 1119 The mode in which this specification is used. The following modes 1120 can be signalled: 1121 mode=generic, 1122 mode=CELP-cbr, 1123 mode=CELP-vbr, 1124 mode=AAC-lbr and 1125 mode=AAC-hbr. 1126 Other modes are expected to be defined in future RFCs. See also 1127 section 3.3.7. 1129 Optional general parameters: 1131 ObjectType: 1132 The decimal value from Table 8 in ISO/IEC 14496-1, indicating 1133 the value of the objectTypeIndication of the transported stream. 1134 For BIFS streams this parameter MUST be present to signal the 1135 version of BIFSConfiguration(). Note that the ObjectType MAY 1136 signal a non-MPEG-4 stream, and that the RTP payload format 1137 defined in this document may not be suitable to carry a stream 1138 that is not defined by MPEG-4. 1140 ConstantSize: 1141 The constant size in octets of each Access Unit for this stream. 1142 Simultaneous presence of ConstantSize and the SizeLength 1143 parameters is not permitted. 1145 Profile: 1146 The decimal representation of the applied profile to constrain 1147 the latency when interleaving; see section 3.2.3.3. Absence of 1148 this parameter signals that the profile is not specified. 1150 Optional configuration parameters: 1152 SizeLength: 1153 The number of bits on which the AU-size field is encoded in the 1154 AU-header. Simultaneous presence of SizeLength and the 1155 ConstantSize parameter is not permitted. 1157 IndexLength: 1158 The number of bits on which the AU-Index is encoded in the first 1159 AU-header. The default value of zero indicates the absence of 1160 the AU-Index and AU-Index-delta fields in each AU-header. 1162 IndexDeltaLength: 1163 The number of bits on which the AU-Index-delta field is encoded 1164 in any non-first AU-header. 1166 CTSDeltaLength: 1167 The number of bits on which the CTS-delta field is encoded in 1168 the AU-header. 1170 DTSDeltaLength: 1171 The number of bits on which the DTS-delta field is encoded in 1172 the AU-header. 1174 RandomAccessIndication: 1175 A decimal value of zero or one, indicating whether the RAP-flag 1176 is present in the AU-header. The decimal value of one indicates 1177 presence of the RAP-flag, the default value zero its absence. 1179 StreamStateIndication: 1180 A decimal value of zero or one, indicating whether the 1181 Stream-state field is present in the AU-header. The decimal 1182 value of one indicates presence of the Stream-state field, the 1183 default value zero its absence. 1185 AuxiliaryDataSizeLength: 1186 The number of bits that is used to encode the auxiliary-data-size 1187 field. 1189 Applications MAY use more parameters, in addition to those defined 1190 above. Receivers MUST tolerate the presence of such additional 1191 parameters, but these parameters SHALL not impact the decoding of 1192 receivers that comply to this specification. 1194 Encoding considerations: 1195 System bitstreams MUST be generated according to MPEG-4 Systems 1196 specifications (ISO/IEC 14496-1). Video bitstreams MUST be generated 1197 according to MPEG-4 Visual specifications (ISO/IEC 14496-2). Audio 1198 bitstreams MUST be generated according to MPEG-4 Audio 1199 specifications (ISO/IEC 14496-3). The RTP packets MUST be packetized 1200 according to the RTP payload format defined in RFC xxxx. 1202 Security considerations: 1203 As defined in section 5 of RFC xxxx. 1205 Interoperability considerations: 1206 MPEG-4 provides a large and rich set of tools for the coding of 1207 visual objects. For effective implementation of the standard, 1208 subsets of the MPEG-4 tool sets have been provided for use in 1209 specific applications. These subsets, called 'Profiles', limit the 1210 size of the tool set a decoder is required to implement. In order to 1211 restrict computational complexity, one or more 'Levels' are set for 1212 each Profile. A Profile@Level combination allows: 1213 . a codec builder to implement only the subset of the standard he 1214 needs, while maintaining interworking with other MPEG-4 devices 1215 that implement the same combination, and 1216 . checking whether MPEG-4 devices comply with the standard 1217 ('conformance testing'). 1219 A stream SHALL be compliant with the MPEG-4 Profile@Level specified 1220 by the parameter "profile-level-id". Interoperability between a 1221 sender and a receiver is achieved by specifying the parameter 1222 "profile-level-id" in MIME content. In the capability exchange / 1223 announcement procedure this parameter may mutually be set to the 1224 same value. 1226 Published specification: 1227 The specifications for MPEG-4 streams are presented in ISO/IEC 1228 14496-1, 14496-2, and 14496-3. The RTP payload format is described 1229 in RFC xxxx. 1231 Applications which use this media type: 1232 Multimedia streaming and conferencing tools, Internet messaging and 1233 Email applications. 1235 Additional information: none 1237 Magic number(s): none 1239 File extension(s): 1240 None. A file format with the extension .mp4 has been defined for 1241 MPEG-4 content but is not directly correlated with this MIME type 1242 for which the sole purpose is RTP transport. 1244 Macintosh File Type Code(s): none 1246 Person & email address to contact for further information: 1247 Authors of RFC xxxx, IETF Audio/Video Transport working group. 1249 Intended usage: COMMON 1251 Author/Change controller: 1252 Authors of RFC xxxx, IETF Audio/Video Transport working group. 1254 4.2 Concatenation of parameters 1256 Multiple parameters SHOULD be expressed as a MIME media type string, 1257 in the form of a semicolon-separated list of parameter=value pairs 1258 (for parameter usage examples see sections 3.3.2 up to 3.3.6). 1260 4.3 Usage of SDP 1262 4.3.1 The a=fmtp keyword 1264 It is assumed that one typical way to transport the above-described 1265 parameters associated with this payload format is via a SDP message 1266 [6] for example transported to the client in reply to a RTSP 1267 DESCRIBE or via SAP. In that case the (a=fmtp) keyword MUST be used 1268 as described in RFC 2327 [6], section 6, the syntax being then: 1270 a=fmtp: =[; =] 1272 5. Security Considerations 1274 RTP packets using the payload format defined in this specification 1275 are subject to the security considerations discussed in the RTP 1276 specification [2]. This implies that confidentiality of the media 1277 streams is achieved by encryption. Because the data compression used 1278 with this payload format is applied end-to-end, encryption may be 1279 performed on the compressed data so there is no conflict between the 1280 two operations. The packet processing complexity of this payload 1281 type (i.e. excluding media data processing) does not exhibit any 1282 significant non-uniformity in the receiver side to cause a denial- 1283 of-service threat. 1285 However, it is possible to inject non-compliant MPEG streams (Audio, 1286 Video, and Systems) to overload the receiver/decoder's buffers, 1287 which might compromise the functionality of the receiver or even 1288 crash it. This is especially true for end-to-end systems like MPEG 1289 where the buffer models are precisely defined. 1291 MPEG-4 Systems supports stream types including commands that are 1292 executed on the terminal like OD commands, BIFS commands, etc. and 1293 programmatic content like MPEG-J (Java(TM) Byte Code) and 1294 ECMAScript. It is possible to use one or more of the above in a 1295 manner non-compliant to MPEG to crash or temporarily make the 1296 receiver unavailable. 1298 Senders SHOULD ensure that packet loss does not cause severe 1299 problems in application execution when the packet carries OD 1300 commands, BIFS commands, or programmatic content such as MPEG-J and 1301 ECMAScript. For example, the reliability can be improved by 1302 re-transmission, or by using the carousel mechanism as defined by 1303 MPEG in ISO/IEC 14496-1, while observing the general congestion 1304 control principles. When such measures are deemed unsufficiently 1305 adequate, instead of this payload format applications SHOULD use 1306 more reliable means to transport the information, for example by 1307 applying an FEC scheme for RTP (such as in RFC 2733), or by using 1308 RTP over TCP (such as in RFC 2326, section 10.12), while giving due 1309 consideration to congestion control. For a general description of 1310 methods to repair streaming media see RFC 2354. 1312 Authentication mechanisms can be used to validate the sender and 1313 the data to prevent security problems due to non-compliant malignant 1314 MPEG-4 streams. 1316 In ISO/IEC 14496-1 a security model is defined for MPEG-4 Systems 1317 streams carrying MPEG-J access units which comprise Java(TM) classes 1318 and objects. MPEG-J defines a set of Java APIs and a secure 1319 execution model. MPEG-J content can call this set of APIs and 1320 Java(TM) methods from a set of Java packages supported in the 1321 receiver within the defined security model. According to this 1322 security model, downloaded byte code is forbidden to load libraries, 1323 define native methods, start programs, read or write files, or read 1324 system properties. 1326 Receivers can implement intelligent filters to validate the buffer 1327 requirements or parametric (OD, BIFS, etc.) or programmatic (MPEG-J, 1328 ECMAScript) commands in the streams. However, this can increase the 1329 complexity significantly. 1331 6. Acknowledgements 1333 This document evolved through several revisions thanks to 1334 contributions by people from the ISMA forum, from the IETF AVT 1335 Working Group and from the 4-on-IP ad-hoc group within MPEG. The 1336 authors wish to thank all involved people, and in particular John 1337 Lazarro, Alex MacAulay, Bill May, Colin Perkins, Dorairaj V and 1338 Stephan Wenger for their valuable comments and support. 1340 7. References 1342 [1] ISO/IEC International Standard 14496 (MPEG-4); "Information 1343 technology - Coding of audio-visual objects", January 2000 1345 [2] Schulzrinne, Casner, Frederick, Jacobson RTP, "A Transport 1346 Protocol for Real Time Applications", RFC 1889, Internet 1347 Engineering Task Force, January 1996. 1349 [3] S. Bradner, "Key words for use in RFCs to Indicate Requirement 1350 Levels", RFC 2119, March 1997. 1352 [4] D. Hoffman, G. Fernando, V. Goyal, M. Civanlar, "RTP payload 1353 format for MPEG1/MPEG2 Video", RFC 2250, January 1998. 1355 [5] Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, H. Kimata, "RTP 1356 payload format for MPEG-4 Audio/Visual streams", RFC 3016. 1358 [6] Handley, Jacobson, "SDP: Session Description Protocol", 1359 RFC 2327, Internet Engineering Task Force, April 1998. 1361 8. Author Adresses 1363 Jan van der Meer 1364 Philips Digital Networks 1365 Cederlaan 4 1366 5600 JB Eindhoven 1367 Netherlands 1368 Email : jan.vandermeer@philips.com 1370 David Mackie 1371 Cisco Systems Inc. 1372 170 West Tasman Dr. 1373 San Jose, CA 95134 1374 Email: dmackie@cisco.com 1376 Viswanathan Swaminathan 1377 Sun Microsystems Inc. 1378 901 San Antonio Road, M/S UMPK15-214 1379 Palo Alto, CA 94303 1380 Email: viswanathan.swaminathan@sun.com 1382 David Singer 1383 Apple Computer, Inc. 1384 One Infinite Loop, MS:302-3MT 1385 Cupertino CA 95014 1386 Email: singer@apple.com 1388 Philippe Gentric 1389 Philips Digital Networks, MP4Net 1390 51 rue Carnot 1391 92156 Suresnes 1392 France 1393 e-mail: philippe.gentric@philips.com 1395 Full Copyright Statement 1397 "Copyright (C) The Internet Society (date). All Rights Reserved. 1398 This document and translations of it may be copied and furnished to 1399 others, and derivative works that comment on or otherwise explain 1400 it or assist in its implementation may be prepared, copied, 1401 published and distributed, in whole or in part, without restriction 1402 of any kind, provided that the above copyright notice and this 1403 paragraph are included on all such copies and derivative works. 1404 However, this document itself may not be modified in any way, such 1405 as by removing the copyright notice or references to the Internet 1406 Society or other Internet organizations, except as needed for the 1407 purpose of developing Internet standards in which case the 1408 procedures for copyrights defined in the Internet Standards process 1409 MUST be followed, or as required to translate it into. 1411 APPENDIX: Usage of this payload format 1413 Appendix A. Examples of delay analysis with interleave 1415 A.1 Group interleave 1417 An example of regular interleave is when packets are formed into 1418 groups. If the number of packets in a group is N, for example 1419 packet 0 could contain frame 0, frame N, frame 2N, and so on; 1420 packet 1 could contain frame 1, frame 1+N, 1+2N, and so on. The 1421 AU-Index field is used to document the sequence of the packet 1422 within the group (or the first frame in the packet, which is the 1423 same thing in this scheme), and all the AU-Index-delta fields 1424 contain N-1. 1426 Because each subsequent frame in the packet has a higher time stamp 1427 than the preceding frame, receivers can tell when a new interleave 1428 group is starting, by noting that the computed time stamp of the 1429 first frame in a packet is later than any previously computed time 1430 stamp. In that case the time stamps of all frames contained in the 1431 packet are higher than any previously computed time stamp, and 1432 hence interleaving with any previously received frame is not 1433 possible. In conclusion, a new group has been started. 1435 If the group size is 3, then packets can be formed as follows: 1437 Packet Time stamp Frame Numbers AU-Index, AU-Index-delta 1438 0 T[0] 0, 3, 6 0, 2, 2 1439 1 T[1] 1, 4, 7 0, 2, 2 1440 2 T[2] 2, 5, 8 0, 2, 2 1441 3 T[9] 9,12,15 0, 2, 2 1443 In this case, the receiver would have to buffer 4 frames at least 1444 from packets 0 and 1, and can flush all frames when packet 2 1445 arrives. (Frame 0 can be flushed as packet 0 arrives, since it is 1446 the earliest frame we hold, and likewise frame 1 from packet 1; we 1447 are therefore holding 3,4,6,7 until packet 2 arrives). 1449 If there is loss, then the receiver may wait longer than is strictly 1450 necessary before it emits frames. For example, say packet 1 is lost 1451 from the above example. Packet 0 allows frame 0 to be emitted, and 1452 then packet 2 arrives, allowing us to notice the loss of frame 1, 1453 and emit frame 2 and 3. Then it is not until the arrival of packet 3 1454 (which has a time-stamp beyond the times of all the frames seen so 1455 far), that we can finish dealing with the loss, even though the 1456 first group has, in fact, ended. (This is in contrast to schemes 1457 which signal the group size explicitly; if the receiver knows that 1458 this is packet 3 of 3, then even if 2 of 3 is missing, it can 1459 de-interleave this group without waiting for the next one to start). 1461 In the above example the AU-Index is coded with the value 0, as 1462 required for the modes defined in this document. To reconstruct the 1463 original order, the RTP time stamp and the AU-Index-delta are used. 1464 See also 3.2.3.2. 1466 Another example of forming packets with group interleave is given 1467 below. In this example the packets are formed such that the loss of 1468 two subsequent RPT packets does not cause the loss of two subsequent 1469 audio frames. Note that in this example the RTP time stamps of 1470 packets 3 and 4 are earlier than the RTP time stamps of packets 1 1471 and 2. 1473 Packet Time stamp Frame Numbers AU-Index, AU-Index-delta 1474 0 T[0] 0, 5, 10, 15 0, 5, 5, 5 1475 1 T[2] 2, 7, 12, 17 0, 5, 5, 5 1476 2 T[4] 4, 9, 14, 19 0, 5, 5, 5 1477 3 T[1] 1, 6, 11, 16 0, 5, 5, 5 1478 4 T[3] 3, 8, 13, 18 0, 5, 5, 5 1480 5 T[20] 20, 25, 30, 35 0, 5, 5, 5 1481 and so on .. 1483 A.2 Continuous interleave 1485 In continuous interleave, once the scheme is 'primed', the number of 1486 frames in a packet exceeds the 'stride' (the distance between them). 1487 This shortens the buffering needed, smooths the data-flow, and gives 1488 slightly larger packets -- and thus lower overhead -- for the same 1489 interleave. For example, here is a continuous interleave also over 1490 a stride of 3 frames, but with 4 frames per packet, for a run of 20 1491 frames. This shows both how the scheme 'starts up' and how it 1492 finishes. 1494 Packet Time-stamp Frame Numbers AU-Index, AU-Index-delta 1495 0 T[0] 0 0 1496 1 T[1] 1 4 0 2 1497 2 T[2] 2 5 8 0 2 2 1498 3 T[3] 3 6 9 12 0 2 2 2 1499 4 T[7] 7 10 13 16 0 2 2 2 1500 5 T[11] 11 14 17 20 0 2 2 2 1501 6 T[15] 15 18 0 2 1502 7 T[19] 19 0 1504 In this case, the receiver has to buffer only 3 frames, not 4. Say 1505 we are waiting for packet 4. We can flush frames 0, 1, 2, 3, 4, 5, 1506 6; we are holding therefore 8, 9, 12. Packet 4 arrives, allowing 1507 us to emit 7,8,9,10, and we are holding 12,13,16. Each arriving 1508 packet contains 4 frames, and allows 4 frames to be flushed. 1510 In the above example the AU-Index is coded with the value 0, as 1511 required for the modes defined in this document. To reconstruct the 1512 original order, the RTP time stamp and the AU-Index-delta are used. 1513 See also 3.2.3.2. 1515 If there is loss, again the receiver has to wait to emit the erasure 1516 frames. In this case, say packet 3 is lost. We were holding frames 1517 4, 5, and 8. On the arrival of packet 4, (time-stamp of frame 7), 1518 we now know frame 3 was lost, we can emit frames 4,5, and we know 6 1519 must be lost, and emit 7, which is in the packet that arrived. Then 1520 on the arrival of packet 5 (time-stamp 11) we can emit 8, indicate 1521 loss of 9, and emit 10 and 11. Finally, the arrival of packet 6 1522 (time-stamp 15) indicates that 12 must be lost; we have now 1523 detected all the lost frames.