idnits 2.17.1 draft-ietf-avt-rtp-mpeg4-es-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 251 instances of too long lines in the document, the longest one being 4 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 18, 2000) is 8620 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '1' on line 14 looks like a reference -- Missing reference section? '3' on line 719 looks like a reference -- Missing reference section? '5' on line 719 looks like a reference -- Missing reference section? '2' on line 587 looks like a reference -- Missing reference section? '4' on line 587 looks like a reference -- Missing reference section? '6' on line 44 looks like a reference -- Missing reference section? '7' on line 153 looks like a reference -- Missing reference section? '9' on line 587 looks like a reference -- Missing reference section? '8' on line 815 looks like a reference -- Missing reference section? '10' on line 688 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 1 warning (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Yoshihiro Kikuchi - Toshiba 2 Internet Draft Toshiyuki Nomura - NEC 3 Document: draft-ietf-avt-rtp-mpeg4-es-04.txt Shigeru Fukunaga - Oki 4 Yoshinori Matsui - Matsushita 5 Hideaki Kimata - NTT 6 September 18, 2000 8 RTP payload format for MPEG-4 Audio/Visual streams 10 Status of this Memo 12 This document is an Internet-Draft and is in full conformance with all 13 provisions of Section 10 of RFC2026 [1]. 15 Internet-Drafts are working documents of the Internet Engineering Task 16 Force (IETF), its areas, and its working groups. Note that other groups 17 may also distribute working documents as Internet-Drafts. Internet-Drafts 18 are draft documents valid for a maximum of six months and may be updated, 19 replaced, or obsoleted by other documents at any time. It is 20 inappropriate to use Internet- Drafts as reference material or to cite 21 them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/ietf/1id-abstracts.txt 24 The list of Internet-Draft Shadow Directories can be accessed at 25 http://www.ietf.org/shadow.html. 27 Abstract 29 This document describes respective RTP payload formats for carrying each 30 of MPEG-4 Audio and MPEG-4 Visual bitstreams without using MPEG-4 31 Systems. For the purpose of directly mapping MPEG-4 Audio/Visual 32 bitstreams onto RTP packets, it provides specifications for the use of 33 RTP header fields and also specifies fragmentation rules. It also 34 provides specifications for MIME type registrations and the use of SDP. 36 1. Introduction 38 The RTP payload formats described in this document specify a way of how 39 MPEG-4 Audio [3][5] and MPEG-4 Visual streams [2][4] are to be fragmented 40 and mapped directly onto RTP packets. 42 These RTP payload formats enable to carry MPEG-4 Audio/Visual streams 43 without using the synchronization and stream management functionality of 44 MPEG-4 Systems [6]. Such RTP payload format will be used in systems that 45 have intrinsic stream management functionality and thus require no such 46 functionality in MPEG-4 Systems. H.323 terminals are an example of such 47 systems. MPEG-4 Audio/Visual streams are not managed by MPEG-4 Systems 48 Object Descriptors but by H.245. The streams are directly mapped onto RTP 49 packets without using MPEG-4 Systems Sync Layer. Other examples are SIP 50 and RTSP where MIME and SDP are used. MIME types and SDP usages of the 51 RTP payload formats described in this document are defined to directly 52 specify the attribute of Audio/Visual streams (e.g. media type, 53 packetization format and codec configuration) without using MPEG-4 54 Systems. It is basically the same approach as those taken by RTP payload 55 formats for the existing audio/video codecs. The obvious benefit is that 56 these MPEG-4 Audio/Visual RTP payload formats can be handled in an 57 unified way together with those formats defined for non-MPEG-4 codecs. 59 The semantics of RTP headers in such cases need to be clearly defined, 60 including the association with MPEG-4 Audio/Visual data elements. In 61 addition, it would be beneficial to define the fragmentation rules of RTP 62 packets for MPEG-4 Video streams so as to enhance error resiliency by 63 utilizing the error resilience tools provided inside the MPEG-4 Video 64 stream. These issues, however, have yet to be addressed by other MPEG-4 65 RTP payload format specifications. 67 1.1 MPEG-4 Visual RTP payload format 69 MPEG-4 Visual is a visual coding standard with many new features: high 70 coding efficiency; high error resiliency; multiple, arbitrary shape 71 object-based coding; etc. [2]. It covers a wide range of bitrate from 72 scores of Kbps to several Mbps. It also covers a wide variety of 73 networks, ranging from those guaranteed to be almost error-free to mobile 74 networks with high error rates. 76 With respect to the fragmentation rules for an MPEG-4 visual bitstream 77 defined in this document, since MPEG-4 Visual is used for a wide variety 78 of networks, it is desirable not to apply too much restriction on 79 fragmentation, and a fragmentation rule such as "a single video packet 80 shall always be mapped on a single RTP packet" may be inappropriate. On 81 the other hand, careless, media unaware fragmentation may cause 82 degradation in error resiliency and bandwidth efficiency. The 83 fragmentation rules described in this document are flexible but manage to 84 define the minimum rules for preventing meaningless fragmentation while 85 utilizing the error resilience functionalities of MPEG-4 Visual. 87 The fragmentation rule recommends not to map more than one VOP in an RTP 88 packet so that RTP timestamp uniquely indicates the VOP time framing. On 89 the other hand, MPEG-4 video may generate VOPs of very small size, in 90 cases with a not coded VOP containing only VOP header or an arbitrary 91 shaped VOP with a small number. To reduce the overhead for such cases, 92 the fragmentation rule permits concatenating multiple VOPs in an RTP 93 packet. (See fragmentation rule (4) in section 3.2 and marker bit and 94 timestamp in section 3.1.) 96 While the additional media specific RTP header defined for such video 97 coding tools as H.261 or MPEG-1/2 is effective in helping to recover 98 picture headers corrupted by packet losses, MPEG-4 Visual has already 99 error resilience functionalities for recovering corrupt headers, and 100 these can be used on RTP/IP networks as well as on other networks 101 (H.223/mobile, MPEG-2/TS, etc.). Therefore, no extra RTP header fields 102 are defined in this MPEG-4 Visual RTP payload format. 104 1.2 MPEG-4 Audio RTP payload format 106 MPEG-4 Audio is a new kind of audio standard that integrates many 107 different types of audio coding tools. It also supports a mechanism for 108 representing synthesized sounds. Low-overhead MPEG-4 Audio Transport 109 Multiplex (LATM) manages the sequences of audio data with relatively 110 small overhead. In audio-only applications, then, it is desirable for 111 LATM-based MPEG-4 Audio bitstreams to be directly mapped onto the RTP 112 packets without using MPEG-4 Systems. 114 While LATM has several multiplexing features as follows; 115 - Carrying configuration information with audio data, 116 - Concatenation of multiple audio frames in one audio stream, 117 - Multiplexing multiple objects (programs), 118 - Multiplexing scalable layers, 119 in RTP transmission there is no need for the last two features that 120 multiplex payloads of different objects and scalable layers into one RTP 121 packet. Therefore, these two features SHOULD NOT be used in applications 122 based on RTP packetization specified by this document. 124 For transmission of scalable streams, audio data of each layer should be 125 packetized onto different RTP packets. On the other hand, all 126 configuration data of the scalable streams are contained in one LATM 127 configuration data "StreamMuxConfig" and every scalable layer shares the 128 StreamMuxConfig. The mapping between each layer and its configuration 129 data is achieved by LATM header information attached to the audio data. 130 In order to indicate the dependency information of the scalable streams, 131 a restriction is applied to the dynamic assignment rule of payload type 132 (PT) values (see section 4.2). 134 For MPEG-4 Audio coding tools except synthesis tools, as is true for 135 other audio coders, if the payload of a packet is a single audio frame, 136 packet loss will not impair the decodability of adjacent packets. On the 137 other hands, MPEG-4 Audio synthesis tools may be sensitive to error. For 138 example, an SA_access_unit in the payload may set a global value to a new 139 value, which is then references throughout the audio content to make a 140 macro change in the performance. In this case, an error in the payload 141 influences all audio data produced after the error. In order to enhance 142 error resiliency, the element of SA_access_unit that makes the above 143 macro change should be transmitted across several SA_access_unit 144 repeatedly. The number of repetition will be dependent on the network 145 condition. Therefore, the additional media specific header for recovering 146 errors will not be required for MPEG-4 Audio. 148 2. Conventions used in this document 150 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 151 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 152 document are to be interpreted as described in RFC-2119 [7]. 154 3. RTP Packetization of MPEG-4 Visual bitstream 156 This section specifies RTP packetization rules for MPEG-4 Visual content. 157 An MPEG-4 Visual bitstream is mapped directly onto the RTP payload 158 without any addition of extra header fields or any removal of Visual 159 syntax elements. The Combined Configuration/Elementary stream mode is 160 used so that configuration information will be carried to the same RTP 161 port as the elementary stream. (see 6.2.1 "Start codes" of ISO/IEC 14496- 162 2 [2][9][4]) The configuration information MAY additionally be specified 163 by some out-of-band means; in H.323 terminals, H.245 codepoint 164 "decoderConfigurationInformation" MAY be used for this purpose; in 165 systems using MIME content type and SDP parameters, e.g. SIP and RTSP, 166 the optional parameter "config" MAY be used to specify the configuration 167 information. (see 5.1 and 5.2) 169 When the short video header mode is used, the RTP payload format used MAY 170 be that specified for H.263 in the relevant RFCs or in other relevant 171 standards. (e.g., RFC 2190 or RFC 2429) 172 0 1 2 3 173 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 175 |V=2|P|X| CC |M| PT | sequence number | RTP 176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 177 | timestamp | Header 178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 179 | synchronization source (SSRC) identifier | 180 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 181 | contributing source (CSRC) identifiers | 182 | .... | 183 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 184 | | RTP 185 | MPEG-4 Visual stream (byte aligned) | Payload 186 | | 187 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 188 | :...OPTIONAL RTP padding | 189 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 191 Figure 1 - An RTP packet for MPEG-4 Visual stream 193 3.1 Use of RTP header fields for MPEG-4 Visual 195 Payload Type (PT): Payload type is to be specifically assigned as the 196 MPEG-4 Visual RTP payload format. If this assignment is to be carried out 197 dynamically, it can be performed by such out-of-band means as H.245, SDP, 198 etc. 200 Extension (X) bit: Defined by the RTP profile used. 202 Sequence Number: Incremented by one for each RTP data packet sent, 203 starting, for security reasons, with a random initial value. 205 Marker (M) bit: The marker bit is set to one to indicate the last RTP 206 packet (or only RTP packet) of a VOP. When multiple VOPs are carried in 207 the same RTP packet, the marker bit is set to 1. 209 Timestamp: The timestamp indicates the composition time, or the 210 presentation time in a no-compositor decoder. A constant offset, which is 211 random, is added for security reasons. The detailed definition of the 212 timestamp is as follows: 213 - For a video object plane, it is defined as vop_time_increment (in units 214 of 1/vop_time_increment_resolution seconds) plus the cumulative number 215 of whole seconds specified by modulo_time_base and, if present, 216 time_code of Group_of_VideoObjectPlane() fields. 218 - In the case of interlaced video, a VOP will consist of lines from two 219 fields, and the timestamp will indicate the composition time of the 220 first field. 221 - For a video object plane with short header, the timestamps (after the 222 first random timestamp) are equal to the presentation time sequence 223 associated with the semantics of the temporal_reference field. 224 Specifically, each timestamp value SHALL be calculated by rounding the 225 value of a precise clock that advances delta_time with each successive 226 video object plane with short header. The time increment SHOULD be 227 calculated as delta_time = (((temporal_reference + 256 - 228 (temporal_reference of previous VOP) modulo 256) * 1001/30000) for each 229 successive video object plane with short header. The RTP timestamp 230 should be consistently rounded or truncated to the resolution of the 231 RTP timestamp field. 232 - When multiple VOPs are carried in the same RTP packet, the timestamp 233 indicates the earliest of the composition times within the VOPs carried 234 in the RTP packet. Timestamp information of the rest of the VOPs are 235 derived from the timestamp fields in the VOP header (modulo_time_base 236 and vop_time_increment), or from the temporal_reference field in the 237 case of short video header. 238 - If the RTP packet contains only configuration information and/or 239 Group_of_VideoObjectPlane() fields, the composition time of the next 240 VOP in the coding order is used. 241 - If the RTP packet contains only visual_object_sequence_end_code 242 information, the composition time of the immediately preceding VOP in 243 the coding order is used. 245 The resolution of the timestamp is set to its default value of 90KHz, 246 unless specified by an out-of-band means (e.g. SDP parameter or MIME 247 parameter as defined in section 5). 249 SSRC, CC and CSRC fields are used as described in RFC 1889 [8]. 251 3.2 Fragmentation of MPEG-4 Visual bitstream 253 A fragmented MPEG-4 Visual bitstream is mapped directly onto the RTP 254 payload without any addition of extra header fields or any removal of 255 Visual syntax elements. The Combined Configuration/Elementary streams 256 mode is used. The following rules apply for the fragmentation. 258 (1) Configuration information and Group_of_VideoObjectPlane() fields 259 SHALL be placed at the beginning of the RTP payload (just after the RTP 260 header) or just after the header of the syntactically upper layer 261 function. 263 (2) If one or more headers exist in the RTP payload, the RTP payload 264 SHALL begin with the header of the syntactically highest function. 265 Note: The visual_object_sequence_end_code is regarded as the lowest 266 function. 268 (3) A header SHALL NOT be split into a plurality of RTP packets. 270 (4) Different VOPs SHOULD be fragmented into different RTP packets so 271 that one RTP packet consists of the data bytes associated with a unique 272 presentation time (that is indicated in the timestamp field in the RTP 273 packet header), with the exception that more than one integral number of 274 consecutive VOPs MAY be carried within one RTP packet in the decoding 275 order if the size of the VOPs is small. 276 Note: When multiple VOPs are carried in one RTP payload, the presentation 277 time of the VOPs after the first one may be calculated by the decoder. 278 This operation is necessary only for RTP packets in which the marker bit 279 equals to one and the beginning of RTP payload corresponds to a start 280 code. (See timestamp and marker bit in section 3.1) 282 (5) A single video packet SHOULD NOT be split into a plurality of RTP 283 packets. The size of a video packet SHOULD be adjusted in such a way that 284 the resulting RTP packet is not larger than the path-MTU. A video packet 285 MAY be split into a plurality of RTP packets when the size of the video 286 packet is large. 287 Note: Rule (5) does not apply when the video packet is disabled by the 288 coder configuration (by setting resync_marker_disable in the VOL header 289 to 1), or in coding tools where the video packet is not supported. In 290 this case, a VOP MAY be split at arbitrary byte-positions. 292 Here, header means: 293 - Configuration information (Visual Object Sequence Header, Visual Object 294 Header and Video Object Layer Header) 295 - visual_object_sequence_end_code 296 - The header of the entry point function for an elementary stream 297 (Group_of_VideoObjectPlane() or the header of VideoObjectPlane(), 298 video_plane_with_short_header(), MeshObject() or FaceObject()) 299 - The video packet header (video_packet_header() excluding 300 next_resync_marker()) 301 - The header of gob_layer() 302 See 6.2.1 "Start codes" of ISO/IEC 14496-2[2][9][4] for the definition of 303 the configuration information and the entry point functions. 305 The video packet starts with the VOP header or the video packet header, 306 followed by motion_shape_texture(), and ends with next_resync_marker() or 307 next_start_code(). 309 3.3 Examples of packetized MPEG-4 Visual bitstream 311 Considering the fact that MPEG-4 Visual covers a wide variety of networks 312 ranging from scores of Kbps to several Mbps, and from those guaranteed to 313 be almost error-free to mobile networks with high error rates, it is 314 desirable not to apply too much restriction on fragmentation. On the 315 other hand, careless, media unaware fragmentation will cause degradation 316 in error resiliency and bandwidth efficiency. The fragmentation criteria 317 described in 3.2 are flexible but serve to define the minimum rules to 318 prevent meaningless fragmentation. 320 Figure 2 shows examples of RTP packets generated based on the criteria 321 described in 3.2 323 (a) is an example of the first RTP packet or the random access point of 324 an MPEG-4 visual bitstream containing the configuration information. 325 According to criterion (1), the Visual Object Sequence Header(VS header) 326 is placed at the beginning of the RTP payload, preceding the Visual 327 Object Header and the Video Object Layer Header(VO header, VOL header). 328 Since the fragmentation rule defined in 3.2 guarantees that the 329 configuration information, starting with 330 visual_object_sequence_start_code, is always placed at the beginning of 331 the RTP payload, RTP receivers can detect the random access point by 332 checking if the first 32-bit field of the RTP payload is 333 visual_object_sequence_start_code. 335 (b) is another example of the RTP packet containing the configuration 336 information. It differs from example (a) in that the RTP packet also 337 contains a video packet in the VOP following the configuration 338 information. Since the length of the configuration information is 339 relatively short (typically scores of bytes) and an RTP packet containing 340 only the configuration information may thus increase the overhead, the 341 configuration information and the immediately following GOV and/or (a 342 part of) VOP can be effectively packetized into a single RTP packet as in 343 this example. 345 (c) is an example of the RTP packet that contains 346 Group_of_VideoObjectPlane(GOV). Following criterion (1), the GOV is 347 placed at the beginning of the RTP payload. It would be a waste of RTP/IP 348 header overhead to generate an RTP packet containing only a GOV whose 349 length is 7 bytes. Therefore, (a part of) the following VOP can be placed 350 in the same RTP packet as shown in (c). 352 (d) is an example of the case where one video packet is packetized into 353 one RTP packet. When the packet-loss rate of the underlying network is 354 high, this kind of packetization is recommended. It is recommended to set 355 resync_marker_disable to 0 in the VOL header to enable the adjustment of 356 the video packet size. Even when the RTP packet containing the VOP header 357 is discarded by a packet loss, the other RTP packets can be decoded by 358 using the HEC(Header Extension Code) information in the video packet 359 header. No extra RTP header field is necessary. 361 (e) is an example of the case where more than one video packets are 362 packetized into one RTP packet. This kind of packetization is effective 363 to save the overhead of RTP/IP headers when the bit-rate of the 364 underlying network is low. However, it will decrease the packet-loss 365 resiliency because multiple video packets are discarded by a single RTP 366 packet loss. The optimal number of video packets in an RTP packet and the 367 length of the RTP packet can be determined considering the packet-loss 368 rate and the bit-rate of the underlying network. 370 (f) is an example of the case when the video packet is disabled by 371 setting resync_marker_disable in the VOL header to 1. In this case, a VOP 372 may be split into a plurality of RTP packets at arbitrary byte-positions. 373 For example, it is possible to split a VOP into fixed-length packets. 374 This kind of coder configuration and RTP packet fragmentation may be used 375 when the underlying network is guaranteed to be error-free. On the other 376 hand, it is not recommended to use it in error-prone environment since it 377 provides only poor packet loss resiliency. 379 Figure 3 shows examples of RTP packets prohibited by the criteria of 3.2. 381 Fragmentation of a header into multiple RTP packets, as in (a), will not 382 only increase the overhead of RTP/IP headers but also decrease the error 383 resiliency. Therefore, it is prohibited by the criterion (3). 385 When concatenating more than one video packets into an RTP packet, VOP 386 header or video_packet_header() shall not be placed in the middle of the 387 RTP payload. The packetization as in (b) is not allowed by criterion (2) 388 due to the aspect of the error resiliency. Comparing this example with 389 Figure 2(d), although two video packets are mapped onto two RTP packets 390 in both cases, the packet-loss resiliency is not identical. Namely, if 391 the second RTP packet is lost, both video packets 1 and 2 are lost in the 392 case of Figure 3(b) whereas only video packet 2 is lost in the case of 393 Figure 2(d). 395 +------+------+------+------+ 396 (a) | RTP | VS | VO | VOL | 397 |header|header|header|header| 398 +------+------+------+------+ 400 +------+------+------+------+------------+ 401 (b) | RTP | VS | VO | VOL |Video Packet| 402 |header|header|header|header| | 403 +------+------+------+------+------------+ 405 +------+-----+------------------+ 406 (c) | RTP | GOV |Video Object Plane| 407 |header| | | 408 +------+-----+------------------+ 410 +------+------+------------+ +------+------+------------+ 411 (d) | RTP | VOP |Video Packet| | RTP | VP |Video Packet| 412 |header|header| (1) | |header|header| (2) | 413 +------+------+------------+ +------+------+------------+ 415 +------+------+------------+------+------------+------+------------+ 416 (e) | RTP | VP |Video Packet| VP |Video Packet| VP |Video Packet| 417 |header|header| (1) |header| (2) |header| (3) | 418 +------+------+------------+------+------------+------+------------+ 420 +------+------+------------+ +------+------------+ 421 (f) | RTP | VOP |VOP fragment| | RTP |VOP fragment| 422 |header|header| (1) | |header| (2) | ___ 423 +------+------+------------+ +------+------------+ 425 Figure 2 - Examples of RTP packetized MPEG-4 Visual bitstream 427 +------+-------------+ +------+------------+------------+ 428 (a) | RTP |First half of| | RTP |Last half of|Video Packet| 429 |header| VP header | |header| VP header | | 430 +------+-------------+ +------+------------+------------+ 432 +------+------+----------+ +------+---------+------+------------+ 433 (b) | RTP | VOP |First half| | RTP |Last half| VP |Video Packet| 434 |header|header| of VP(1) | |header| of VP(1)|header| (2) | 435 +------+------+----------+ +------+---------+------+------------+ 437 Figure 3 - Examples of prohibited RTP packetization for MPEG-4 Visual 438 bitstream 440 4. RTP Packetization of MPEG-4 Audio bitstream 442 This section specifies RTP packetization rules for MPEG-4 Audio 443 bitstreams. MPEG-4 Audio streams are formatted by LATM (Low-overhead 444 MPEG-4 Audio Transport Multiplex) tool[5], and the LATM-based streams are 445 then mapped onto RTP packets as described the three sections below. 447 4.1 RTP Packet Format 449 LATM-based streams consist of a sequence of audioMuxElements that include 450 one or more audio frames. A complete audioMuxElement or a part of one 451 SHALL be mapped directly onto an RTP payload without any removal of 452 audioMuxElement syntax elements (see Figure 4). The first byte of each 453 audioMuxElement SHALL be located at the first payload location in an RTP 454 packet. 456 0 1 2 3 457 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 458 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 459 |V=2|P|X| CC |M| PT | sequence number |RTP 460 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 461 | timestamp |Header 462 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 463 | synchronization source (SSRC) identifier | 464 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 465 | contributing source (CSRC) identifiers | 466 | .... | 467 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 468 | |RTP 469 : audioMuxElement (byte aligned) :Payload 470 | | 471 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 472 | :...OPTIONAL RTP padding | 473 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 474 Figure 4 - An RTP packet for MPEG-4 Audio 476 In order to decode the audioMuxElement, the following muxConfigPresent 477 information is required to be indicated by an out-of-band means. 479 muxConfigPresent: If this value is set to 1, the audioMuxElement SHALL 480 include an indication bit "useSameStreamMux" and MAY include the 481 configuration information for audio compression "StreamMuxConfig". The 482 useSameStreamMux bit indicates whether the StreamMuxConfig element in the 483 previous frame is applied in the current frame. 485 4.2 Use of RTP Header Fields for MPEG-4 Audio 487 Payload Type (PT): Payload type is to be specifically assigned as the 488 MPEG-4 Audio RTP payload format. If this assignment is to be carried out 489 dynamically, it can be performed by such out-of-band means as H.245, SDP, 490 etc. In the dynamic assignment of RTP payload types for scalable streams, 491 a different value should be assigned to each layer. The assigned values 492 should be in order of enhance layer dependency, where the base layer has 493 the smallest value. 495 Marker (M) bit: The marker bit indicates audioMuxElement boundaries. It 496 is set to one to indicate that the RTP packet contains a complete 497 audioMuxElement or the last fragment of an audioMuxElement. 499 Timestamp: The timestamp indicates composition time, or presentation time 500 in a no-compositor decoder. Timestamps are recommended to start at a 501 random value for security reasons. 503 Unless specified by an out-of-band means, the resolution of the timestamp 504 is set to its default value of 90 kHz. 506 Sequence Number: Incremented by one for each RTP packet sent, starting, 507 for security reasons, with a random value. 509 SSRC, CC and CSRC fields are used as described in RFC 1889 [8]. 511 4.3 Fragmentation of MPEG-4 Audio bitstream 513 It is desirable to put one audioMuxElement in each RTP packet. If the 514 size of an audioMuxElement can be kept small enough that the size of the 515 RTP packet containing it does not exceed the size of the path-MTU, this 516 will be no problem. If it cannot, the audioMuxElement MAY be fragmented 517 and spread across multiple packets, following the rules below: 519 (1) "payloadMux", which consists of payload elements, MAY be fragmented 520 across several RTP packets, so that each of those RTP packets will 521 contain one or more payload elements. Individual payload elements 522 themselves SHOULD NOT be fragmented. 524 (2) If the audioMuxElement includes StreamMuxConfig, StreamMuxConfig 525 SHALL be included in the RTP packet that contains the first payload 526 element. 528 5. MIME type registration for MPEG-4 Audio/Visual streams 530 The following sections describe the MIME type registrations for MPEG-4 531 Audio/Visual streams. MIME type registration and SDP usage for the MPEG-4 532 Visual stream are described in Sections 5.1 and 5.2, respectively, while 533 MIME type registration and SDP usage for MPEG-4 Audio stream are 534 described in Sections 5.3 and 5.4, respectively. 536 (In the following sections, the RFC number "XXXX" represents the RFC 537 number, which should be assigned for this document.) 539 5.1 MIME type registration for MPEG-4 Visual 541 MIME media type name: video 543 MIME subtype name: MP4V 545 Required parameters: none 547 Optional parameters: 548 rate: This parameter is used only for RTP transport. It indicates the 549 resolution of the timestamp field in the RTP header. If this parameter 550 is not specified, its default value of 90000 (90KHz) is used. 552 profile-level-id: A decimal representation of MPEG-4 Visual Profile 553 Level indication value (profile_and_level_indication) defined in Table 554 G-1 of ISO/IEC 14496-2 [2][4]. This parameter MAY be used in the 555 capability exchange or session setup procedure to indicate MPEG-4 556 Visual Profile and Level combination of which the MPEG-4 Visual codec 557 is capable. If this parameter is not specified by the procedure, its 558 default value of 1 (Simple Profile/Level 1) is used. 560 config: This parameter indicates the configuration of the 561 corresponding MPEG-4 visual bitstream. It SHALL NOT be used to 562 indicate the codec capability in the capability exchange procedure. It 563 is a hexadecimal representation of an octet string that expresses the 564 MPEG-4 Visual configuration information, as defined in subclause 6.2.1 565 Start codes of ISO/IEC14496-2[2][4][9]. The configuration information 566 is mapped onto the octet string in an MSB-first basis. The first bit 567 of the configuration information SHALL be located at the MSB of the 568 first octet. The configuration information indicated by this parameter 569 SHALL be the same as the configuration information in the 570 corresponding MPEG-4 Visual stream, except for 571 first_half_vbv_occupancy and latter_half_vbv_occupancy, if exist, 572 which may vary in the repeated configuration information inside an 573 MPEG-4 Visual stream (See 6.2.1 Start codes of ISO/IEC14496-2). 575 Example usages for these parameters are: 576 - MPEG-4 Visual Simple Profile/Level 1: 577 Content-type: video/mp4v; profile-level-id=1 579 - MPEG-4 Visual Core Profile/Level 2: 580 Content-type: video/mp4v; profile-level-id=34 582 - MPEG-4 Visual Advanced Real Time Simple Profile/Level 1: 583 Content-type: video/mp4v; profile-level-id=145 585 Published specification: 586 The specifications for MPEG-4 Visual streams are presented in ISO/IEC 587 14469-2[2][4][9]. The RTP payload format is described in RFCXXXX. 589 Encoding considerations: 590 Video bitstreams must be generated according to MPEG-4 Visual 591 specifications (ISO/IEC 14496-2). A video bitstream is binary data and 592 must be encoded for non-binary transport (for Email, the Base64 593 encoding is sufficient). This type is also defined for transfer via 594 RTP. The RTP packets MUST be packetized according to the MPEG-4 Visual 595 RTP payload format defined in RFCXXXX. 597 Security considerations: 598 See section 6 of RFCXXXX. 600 Interoperability considerations: 601 MPEG-4 Visual provides a large and rich set of tools for the coding of 602 visual objects. For effective implementation of the standard, subsets 603 of the MPEG-4 Visual tool sets have been provided for use in specific 604 applications. These subsets, called 'Profiles', limit the size of the 605 tool set a decoder is required to implement. In order to restrict 606 computational complexity, one or more Levels are set for each Profile. 607 A Profile@Level combination allows: 609 o a codec builder to implement only the subset of the standard he 610 needs, while maintaining interworking with other MPEG-4 devices 611 included in the same combination, and 613 o checking whether MPEG-4 devices comply with the standard 614 ('conformance testing'). 616 The visual stream SHALL be compliant with the MPEG-4 Visual 617 Profile@Level specified by the parameter "profile-level-id". 618 Interoperability between a sender and a receiver may be achieved by 619 specifying the parameter "profile-level-id" in MIME content, or by 620 arranging in the capability exchange/announcement procedure to set this 621 parameter mutually to the same value. 623 Applications which use this media type: 624 Audio and visual streaming and conferencing tools, Internet messaging 625 and Email applications. 627 Additional information: none 629 Person & email address to contact for further information: 630 The authors of RFCXXXX. (See section 8) 632 Intended usage: COMMON 634 Author/Change controller: 635 The authors of RFCXXXX. (See section 8) 637 5.2 SDP usage of MPEG-4 Visual 638 The MIME media type video/MP4V string is mapped to fields in the Session 639 Description Protocol (SDP), RFC 2327, as follows: 641 o The MIME type (video) goes in SDP "m=" as the media name. 643 o The MIME subtype (MP4V) goes in SDP "a=rtpmap" as the encoding name. 645 o The optional parameter "rate" goes in "a=rtpmap" as the clock rate. 647 o The optional parameter "profile-level-id" and "config" MAY go in the 648 "a=fmtp" line to indicate the coder capability and configuration, 649 respectively. These parameters are expressed as a MIME media type string, 650 in the form of as a semicolon separated list of parameter=value pairs. 652 The following are some examples of media representation in SDP: 654 Simple Profile/Level 1, rate=90000(90KHz), "profile-level-id" and 655 "config" are present in "a=fmtp" line: 656 m=video 49170/2 RTP/AVP 98 657 a=rtpmap:98 MP4V/90000 658 a=fmtp:98 profile-level-id=1;config=000001B001000001B50900000100 659 00000120008440FA282C2090A21F 661 Core Profile/Level 2, rate=90000(90KHz), "profile-level-id" is present in 662 "a=fmtp" line: 663 m=video 49170/2 RTP/AVP 98 664 a=rtpmap:98 MP4V/90000 665 a=fmtp:98 profile-level-id=34 667 Advance Real Time Simple Profile/Level 1, rate=25(25Hz), "profile-level- 668 id" is present in "a=fmtp" line: 669 m=video 49170/2 RTP/AVP 98 670 a=rtpmap:98 MP4V/25 671 a=fmtp:98 profile-level-id=145 673 5.3 MIME type registration of MPEG-4 Audio 675 MIME media type name: audio 677 MIME subtype name: MP4A 679 Required parameters: 680 rate: the rate parameter indicates the RTP time stamp clock rate. The 681 default value is 90000. Other rates CAN be specified only if they are 682 set to the same value as the audio sampling rate (number of samples 683 per second). 685 Optional parameters: 687 profile-level-id: a decimal representation of MPEG-4 Audio Profile 688 Level indication value defined in ISO/IEC 14496-1 [10]. This parameter 689 indicates which MPEG-4 Audio tool subsets the decoder is capable of 690 using. If this parameter is not specified in the capability exchange 691 or session setup procedure, its default value of 30 (Natural Audio 692 Profile/Level 1) is used. 694 object: a decimal representation of the MPEG-4 Audio Object Type value 695 defined in ISO/IEC 14496-3 [5]. This parameter specifies the tool to 696 be used by the coder. It CAN be used to limit the capability within 697 the specified "profile-level-id". 699 bitrate: the data rate for the audio bit stream. 701 cpresent: this parameter indicates whether audio payload configuration 702 data has been multiplexed into an RTP payload (See section 4.1 in this 703 document). The default value is 1. 705 config: a hexadecimal representation of an octet string that expresses 706 the audio payload configuration data "StreamMuxConfig", as defined in 707 ISO/IEC 14496-3 [5]. Configuration data is mapped onto the octet 708 string in an MSB-first basis. The first bit of the configuration data 709 SHALL be located at the MSB of the first octet. In the last octet, 710 zero-padding bits, if necessary, shall follow the configuration data. 711 If the size of the configuration data is quite large, such large 712 config data is RECOMMENDED to be indicated by in-band mode (cpresent 713 is set to 1). 715 ptime: RECOMMENDED duration of each packet in milliseconds. 717 Published specification: 718 Payload format specifications are described in this document. Encoding 719 specifications are provided in ISO/IEC 14496-3 [3][5]. 721 Encoding considerations: 722 This type is only defined for transfer via RTP. 724 Security considerations: 725 See Section 6 of RFCXXXX. 727 Interoperability considerations: 728 MPEG-4 Audio provides a large and rich set of tools for the coding of 729 audio objects. For effective implementation of the standard, subsets of 730 the MPEG-4 Audio tool sets similar to those used in MPEG-4 Visual have 731 been provided (see section 5.1). 733 The audio stream SHALL be compliant with the MPEG-4 Audio 734 Profile@Level specified by the parameter "profile-level-id". 735 Interoperability between a sender and a receiver may be achieved by 736 specifying the parameter "profile-level-id" in MIME content, or by 737 arranging in the capability exchange procedure to set this parameter 738 mutually to the same value. Furthermore, the "object" parameter can be 739 used to limit the capability within the specified Profile@Level in 740 capability exchange. 742 Applications which use this media type: 743 Audio and video streaming and conferencing tools. 745 Additional information: none 747 Personal & email address to contact for further information: 748 See Section 8 of RFCXXXX. 750 Intended usage: COMMON 752 Author/Change controller: 753 See Section 8 of RFCXXXX. 755 5.4 SDP usage of MPEG-4 Audio 757 The MIME media type audio/MP4A string is mapped to fields in the Session 758 Description Protocol (SDP), RFC 2327, as follows: 760 o The MIME type (audio) goes in SDP "m=" as the media name. 762 o The MIME subtype (MP4A) goes in SDP "a=rtpmap" as the encoding name. 764 o The required parameter "rate" goes in "a=rtpmap" as the clock rate. 766 o The optional parameter "ptime" goes in SDP "a=ptime" attribute. 768 o The optional parameter "profile-level-id" goes in the "a=fmtp" line to 769 indicate the coder capability. The "object" parameter goes in the 770 "a=fmtp" attribute. The payload-format-specific parameters "bitrate", 771 "cpresent" and "config" go in the "a=fmtp" line. If the string after 772 "config=" is quite large, such large config data should not be 773 transmitted by SDP but should be transmitted by in-band mode. These 774 parameters are expressed as a MIME media type string, in the form of as a 775 semicolon separated list of parameter=value pairs. 777 The following are some examples of the media representation in SDP: 779 For 6 kb/s CELP bitstreams (with an audio sampling rate of 8 kHz), 780 m=audio 49230 RTP/AVP 96 781 a=rtpmap:96 MP4A/8000 782 a=fmtp:96 profile-level-id=9;object=8;cpresent=0;config=9128B1071070 783 a=ptime:20 785 For 64 kb/s AAC LC stereo bitstreams (with an audio sampling rate of 24 786 kHz), 787 m=audio 49230 RTP/AVP 96 788 a=rtpmap:96 MP4A/24000 789 a=fmtp:96 profile-level-id=1; bitrate=64000; cpresent=0; 790 config=9122620000 792 In the above two examples, audio configuration data is not multiplexed 793 into the RTP payload and is described only in SDP. Furthermore, the 794 "clock rate" is set to the audio sampling rate. 796 If the clock rate has been set to its default value and it is necessary 797 to obtain the audio sampling rate, this can be done by parsing the 798 "config" parameter (see the following example). 800 m=audio 49230 RTP/AVP 96 801 a=rtpmap:96 MP4A/90000 802 a=fmtp:96 object=8; cpresent=0; config=9128B1071070 804 The following example shows that the audio configuration data appears in 805 the RTP payload. 807 m=audio 49230 RTP/AVP 96 808 a=rtpmap:96 MP4A/90000 809 a=fmtp:96 object=13; cpresent=1 811 6. Security Considerations 813 RTP packets using the payload format defined in this specification are 814 subject to the security considerations discussed in the RTP specification 815 [8]. This implies that confidentiality of the media streams is achieved 816 by encryption. Because the data compression used with this payload format 817 is applied end-to-end, encryption may be performed on the compressed data 818 so there is no conflict between the two operations. 820 The complete MPEG-4 system allows for transport of a wide range of 821 content, including Java applets (MPEG-J) and scripts. Since this payload 822 format is restricted to audio and video streams, it is not possible to 823 transport such active content in this format. 825 7. References 827 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, 828 RFC 2026, October 1996. 830 2 ISO/IEC 14496-2:1999, "Information technology - Coding of audio-visual 831 objects - Part2: Visual", December 1999. 833 3 ISO/IEC 14496-3:1999, "Information technology - Coding of audio-visual 834 objects - Part3: Audio", December 1999. 836 4 ISO/IEC 14496-2:1999/FDAM1:2000, December 1999. 838 5 ISO/IEC 14496-3:1999/FDAM1:2000, December 1999. 840 6 ISO/IEC 14496-1:1999, "Information technology - Coding of audio-visual 841 objects - Part1: Systems", December 1999. 843 7 Bradner, S., "Key words for use in RFCs to Indicate Requirement 844 Levels", BCP 14, RFC 2119, March 1997 846 8 H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson "RTP: A Transport 847 Protocol for Real Time Applications", RFC 1889, Internet Engineering 848 Task Force, January 1996. 850 9 ISO/IEC 14496-2:1999/COR1:2000, "Information technology - Coding of 851 audio-visual objects - Part2: Visual, Technical corrigendum 1", August 852 2000. 854 10 ISO/IEC 14496-1:1999/FDAM1:2000, December 1999. 856 8. Author's Addresses 858 Yoshihiro Kikuchi 859 Toshiba corporation 860 1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki, 212-8582, Japan 861 Email: yoshihiro.kikuchi@toshiba.co.jp 863 Yoshinori Matsui 864 Matsushita Electric Industrial Co., LTD. 865 1006, Kadoma, Kadoma-shi, Osaka, Japan 866 Email: matsui@drl.mei.co.jp 868 Toshiyuki Nomura 869 NEC Corporation 870 4-1-1,Miyazaki,Miyamae-ku,Kawasaki,JAPAN 871 Email: t-nomura@ccm.cl.nec.co.jp 873 Shigeru Fukunaga 874 Oki Electric Industry Co., Ltd. 875 1-2-27 Shiromi, Chuo-ku, Osaka 540-6025 Japan. 876 Email: fukunaga444@oki.co.jp 878 Hideaki Kimata 879 Nippon Telegraph and Telephone Corporation 880 1-1, Hikari-no-oka, Yokosuka-shi, Kanagawa, Japan 881 Email: kimata@nttvdt.hil.ntt.co.jp 883 Full Copyright Statement 885 "Copyright (C) The Internet Society (date). All Rights Reserved. 887 This document and translations of it may be copied and furnished to 888 others, and derivative works that comment on or otherwise explain it 889 or assist in its implementation may be prepared, copied, published 890 and distributed, in whole or in part, without restriction of any 891 kind, provided that the above copyright notice and this paragraph 892 are included on all such copies and derivative works. However, this 893 document itself may not be modified in any way, such as by removing 894 the copyright notice or references to the Internet Society or other 895 Internet organizations, except as needed for the purpose of 896 developing Internet standards in which case the procedures for 897 copyrights defined in the Internet Standards process must be 898 followed, or as required to translate it into languages other than 899 English. 901 The limited permissions granted above are perpetual and will not be 902 revoked by the Internet Society or its successors or assigns.