idnits 2.17.1 

draft-ietf-avt-rtp-mpeg4-es-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 251 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 18, 2000) is 8620 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? '1' on line 14 looks like a reference

  -- Missing reference section? '3' on line 719 looks like a reference

  -- Missing reference section? '5' on line 719 looks like a reference

  -- Missing reference section? '2' on line 587 looks like a reference

  -- Missing reference section? '4' on line 587 looks like a reference

  -- Missing reference section? '6' on line 44 looks like a reference

  -- Missing reference section? '7' on line 153 looks like a reference

  -- Missing reference section? '9' on line 587 looks like a reference

  -- Missing reference section? '8' on line 815 looks like a reference

  -- Missing reference section? '10' on line 688 looks like a reference


     Summary: 9 errors (**), 0 flaws (~~), 1 warning (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force                 Yoshihiro Kikuchi - Toshiba
2	Internet Draft                                       Toshiyuki Nomura - NEC
3	Document: draft-ietf-avt-rtp-mpeg4-es-04.txt         Shigeru Fukunaga - Oki
4	                                              Yoshinori Matsui - Matsushita
5	                                                       Hideaki Kimata - NTT
6	                                                         September 18, 2000

8	             RTP payload format for MPEG-4 Audio/Visual streams

10	Status of this Memo

12	   This document is an Internet-Draft and is in full conformance with all
13	      provisions of Section 10 of RFC2026 [1].

15	   Internet-Drafts are working documents of the Internet Engineering Task
16	   Force (IETF), its areas, and its working groups. Note that other groups
17	   may also distribute working documents as Internet-Drafts. Internet-Drafts
18	   are draft documents valid for a maximum of six months and may be updated,
19	   replaced, or obsoleted by other documents at any time. It is
20	   inappropriate to use Internet- Drafts as reference material or to cite
21	   them other than as "work in progress."
22	   The list of current Internet-Drafts can be accessed at
23	   http://www.ietf.org/ietf/1id-abstracts.txt
24	   The list of Internet-Draft Shadow Directories can be accessed at
25	   http://www.ietf.org/shadow.html.

27	                                   Abstract

29	   This document describes respective RTP payload formats for carrying each
30	   of MPEG-4 Audio and MPEG-4 Visual bitstreams without using MPEG-4
31	   Systems. For the purpose of directly mapping MPEG-4 Audio/Visual
32	   bitstreams onto RTP packets, it provides specifications for the use of
33	   RTP header fields and also specifies fragmentation rules. It also
34	   provides specifications for MIME type registrations and the use of SDP.

36	1. Introduction

38	   The RTP payload formats described in this document specify a way of how
39	   MPEG-4 Audio [3][5] and MPEG-4 Visual streams [2][4] are to be fragmented
40	   and mapped directly onto RTP packets.

42	   These RTP payload formats enable to carry MPEG-4 Audio/Visual streams
43	   without using the synchronization and stream management functionality of
44	   MPEG-4 Systems [6]. Such RTP payload format will be used in systems that
45	   have intrinsic stream management functionality and thus require no such
46	   functionality in MPEG-4 Systems. H.323 terminals are an example of such
47	   systems. MPEG-4 Audio/Visual streams are not managed by MPEG-4 Systems
48	   Object Descriptors but by H.245. The streams are directly mapped onto RTP
49	   packets without using MPEG-4 Systems Sync Layer. Other examples are SIP
50	   and RTSP where MIME and SDP are used. MIME types and SDP usages of the
51	   RTP payload formats described in this document are defined to directly
52	   specify the attribute of Audio/Visual streams (e.g. media type,
53	   packetization format and codec configuration) without using MPEG-4
54	   Systems. It is basically the same approach as those taken by RTP payload
55	   formats for the existing audio/video codecs. The obvious benefit is that
56	   these MPEG-4 Audio/Visual RTP payload formats can be handled in an
57	   unified way together with those formats defined for non-MPEG-4 codecs.

59	   The semantics of RTP headers in such cases need to be clearly defined,
60	   including the association with MPEG-4 Audio/Visual data elements. In
61	   addition, it would be beneficial to define the fragmentation rules of RTP
62	   packets for MPEG-4 Video streams so as to enhance error resiliency by
63	   utilizing the error resilience tools provided inside the MPEG-4 Video
64	   stream.  These issues, however, have yet to be addressed by other MPEG-4
65	   RTP payload format specifications.

67	1.1 MPEG-4 Visual RTP payload format

69	   MPEG-4 Visual is a visual coding standard with many new features: high
70	   coding efficiency; high error resiliency; multiple, arbitrary shape
71	   object-based coding; etc. [2]. It covers a wide range of bitrate from
72	   scores of Kbps to several Mbps. It also covers a wide variety of
73	   networks, ranging from those guaranteed to be almost error-free to mobile
74	   networks with high error rates.

76	   With respect to the fragmentation rules for an MPEG-4 visual bitstream
77	   defined in this document, since MPEG-4 Visual is used for a wide variety
78	   of networks, it is desirable not to apply too much restriction on
79	   fragmentation, and a fragmentation rule such as "a single video packet
80	   shall always be mapped on a single RTP packet" may be inappropriate. On
81	   the other hand, careless, media unaware fragmentation may cause
82	   degradation in error resiliency and bandwidth efficiency. The
83	   fragmentation rules described in this document are flexible but manage to
84	   define the minimum rules for preventing meaningless fragmentation while
85	   utilizing the error resilience functionalities of MPEG-4 Visual.

87	   The fragmentation rule recommends not to map more than one VOP in an RTP
88	   packet so that RTP timestamp uniquely indicates the VOP time framing. On
89	   the other hand, MPEG-4 video may generate VOPs of very small size, in
90	   cases with a not coded VOP containing only VOP header or an arbitrary
91	   shaped VOP with a small number. To reduce the overhead for such cases,
92	   the fragmentation rule permits concatenating multiple VOPs in an RTP
93	   packet. (See fragmentation rule (4) in section 3.2 and marker bit and
94	   timestamp in section 3.1.)

96	   While the additional media specific RTP header defined for such video
97	   coding tools as H.261 or MPEG-1/2 is effective in helping to recover
98	   picture headers corrupted by packet losses, MPEG-4 Visual has already
99	   error resilience functionalities for recovering corrupt headers, and
100	   these can be used on RTP/IP networks as well as on other networks
101	   (H.223/mobile, MPEG-2/TS, etc.). Therefore, no extra RTP header fields
102	   are defined in this MPEG-4 Visual RTP payload format.

104	1.2 MPEG-4 Audio RTP payload format

106	   MPEG-4 Audio is a new kind of audio standard that integrates many
107	   different types of audio coding tools. It also supports a mechanism for
108	   representing synthesized sounds. Low-overhead MPEG-4 Audio Transport
109	   Multiplex (LATM) manages the sequences of audio data with relatively
110	   small overhead. In audio-only applications, then, it is desirable for
111	   LATM-based MPEG-4 Audio bitstreams to be directly mapped onto the RTP
112	   packets without using MPEG-4 Systems.

114	   While LATM has several multiplexing features as follows;
115	   - Carrying configuration information with audio data,
116	   - Concatenation of multiple audio frames in one audio stream,
117	   - Multiplexing multiple objects (programs),
118	   - Multiplexing scalable layers,
119	   in RTP transmission there is no need for the last two features that
120	   multiplex payloads of different objects and scalable layers into one RTP
121	   packet. Therefore, these two features SHOULD NOT be used in applications
122	   based on RTP packetization specified by this document.

124	   For transmission of scalable streams, audio data of each layer should be
125	   packetized onto different RTP packets. On the other hand, all
126	   configuration data of the scalable streams are contained in one LATM
127	   configuration data "StreamMuxConfig" and every scalable layer shares the
128	   StreamMuxConfig. The mapping between each layer and its configuration
129	   data is achieved by LATM header information attached to the audio data.
130	   In order to indicate the dependency information of the scalable streams,
131	   a restriction is applied to the dynamic assignment rule of payload type
132	   (PT) values (see section 4.2).

134	   For MPEG-4 Audio coding tools except synthesis tools, as is true for
135	   other audio coders, if the payload of a packet is a single audio frame,
136	   packet loss will not impair the decodability of adjacent packets.  On the
137	   other hands, MPEG-4 Audio synthesis tools may be sensitive to error. For
138	   example, an SA_access_unit in the payload may set a global value to a new
139	   value, which is then references throughout the audio content to make a
140	   macro change in the performance. In this case, an error in the payload
141	   influences all audio data produced after the error. In order to enhance
142	   error resiliency, the element of SA_access_unit that makes the above
143	   macro change should be transmitted across several SA_access_unit
144	   repeatedly. The number of repetition will be dependent on the network
145	   condition. Therefore, the additional media specific header for recovering
146	   errors will not be required for MPEG-4 Audio.

148	2. Conventions used in this document

150	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
151	   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
152	   document are to be interpreted as described in RFC-2119 [7].

154	3. RTP Packetization of MPEG-4 Visual bitstream

156	   This section specifies RTP packetization rules for MPEG-4 Visual content.
157	   An MPEG-4 Visual bitstream is mapped directly onto the RTP payload
158	   without any addition of extra header fields or any removal of Visual
159	   syntax elements. The Combined Configuration/Elementary stream mode is
160	   used so that configuration information will be carried to the same RTP
161	   port as the elementary stream. (see 6.2.1 "Start codes" of ISO/IEC 14496-
162	   2 [2][9][4]) The configuration information MAY additionally be specified
163	   by some out-of-band means; in H.323 terminals, H.245 codepoint
164	   "decoderConfigurationInformation" MAY be used for this purpose; in
165	   systems using MIME content type and SDP parameters, e.g. SIP and RTSP,
166	   the optional parameter "config" MAY be used to specify the configuration
167	   information. (see 5.1 and 5.2)

169	   When the short video header mode is used, the RTP payload format used MAY
170	   be that specified for H.263 in the relevant RFCs or in other relevant
171	   standards. (e.g., RFC 2190 or RFC 2429)
172	   0                   1                   2                   3
173	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
174	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
175	   |V=2|P|X|  CC   |M|     PT      |       sequence number         | RTP
176	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
177	   |                           timestamp                           | Header
178	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
179	   |           synchronization source (SSRC) identifier            |
180	   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
181	   |            contributing source (CSRC) identifiers             |
182	   |                             ....                              |
183	   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
184	   |                                                               | RTP
185	   |       MPEG-4 Visual stream (byte aligned)                     | Payload
186	   |                                                               |
187	   |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
188	   |                               :...OPTIONAL RTP padding        |
189	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

191	        Figure 1 - An RTP packet for MPEG-4 Visual stream

193	3.1 Use of RTP header fields for MPEG-4 Visual

195	   Payload Type (PT): Payload type is to be specifically assigned as the
196	   MPEG-4 Visual RTP payload format. If this assignment is to be carried out
197	   dynamically, it can be performed by such out-of-band means as H.245, SDP,
198	   etc.

200	   Extension (X) bit: Defined by the RTP profile used.

202	   Sequence Number: Incremented by one for each RTP data packet sent,
203	   starting, for security reasons, with a random initial value.

205	   Marker (M) bit: The marker bit is set to one to indicate the last RTP
206	   packet (or only RTP packet) of a VOP. When multiple VOPs are carried in
207	   the same RTP packet, the marker bit is set to 1.

209	   Timestamp: The timestamp indicates the composition time, or the
210	   presentation time in a no-compositor decoder. A constant offset, which is
211	   random, is added for security reasons. The detailed definition of the
212	   timestamp is as follows:
213	   - For a video object plane, it is defined as vop_time_increment (in units
214	     of 1/vop_time_increment_resolution seconds) plus the cumulative number
215	     of whole seconds specified by modulo_time_base and, if present,
216	     time_code of Group_of_VideoObjectPlane() fields.

218	   - In the case of interlaced video, a VOP will consist of lines from two
219	     fields, and the timestamp will indicate the composition time of the
220	     first field.
221	   - For a video object plane with short header, the timestamps (after the
222	     first random timestamp) are equal to the presentation time sequence
223	     associated with the semantics of the temporal_reference field.
224	     Specifically, each timestamp value SHALL be calculated by rounding the
225	     value of a precise clock that advances delta_time with each successive
226	     video object plane with short header. The time increment SHOULD be
227	     calculated as delta_time = (((temporal_reference + 256 -
228	     (temporal_reference of previous VOP) modulo 256) * 1001/30000) for each
229	     successive video object plane with short header. The RTP timestamp
230	     should be consistently rounded or truncated to the resolution of the
231	     RTP timestamp field.
232	   - When multiple VOPs are carried in the same RTP packet, the timestamp
233	     indicates the earliest of the composition times within the VOPs carried
234	     in the RTP packet. Timestamp information of the rest of the VOPs are
235	     derived from the timestamp fields in the VOP header (modulo_time_base
236	     and vop_time_increment), or from the temporal_reference field in the
237	     case of short video header.
238	   - If the RTP packet contains only configuration information and/or
239	     Group_of_VideoObjectPlane() fields, the composition time of the next
240	     VOP in the coding order is used.
241	   - If the RTP packet contains only visual_object_sequence_end_code
242	     information, the composition time of the immediately preceding VOP in
243	     the coding order is used.

245	   The resolution of the timestamp is set to its default value of 90KHz,
246	   unless specified by an out-of-band means (e.g. SDP parameter or MIME
247	   parameter as defined in section 5).

249	   SSRC, CC and CSRC fields are used as described in RFC 1889 [8].

251	3.2 Fragmentation of MPEG-4 Visual bitstream

253	   A fragmented MPEG-4 Visual bitstream is mapped directly onto the RTP
254	   payload without any addition of extra header fields or any removal of
255	   Visual syntax elements. The Combined Configuration/Elementary streams
256	   mode is used. The following rules apply for the fragmentation.

258	   (1) Configuration information and Group_of_VideoObjectPlane() fields
259	   SHALL be placed at the beginning of the RTP payload (just after the RTP
260	   header) or just after the header of the syntactically upper layer
261	   function.

263	   (2) If one or more headers exist in the RTP payload, the RTP payload
264	   SHALL begin with the header of the syntactically highest function.
265	   Note: The visual_object_sequence_end_code is regarded as the lowest
266	   function.

268	   (3) A header SHALL NOT be split into a plurality of RTP packets.

270	   (4) Different VOPs SHOULD be fragmented into different RTP packets so
271	   that one RTP packet consists of the data bytes associated with a unique
272	   presentation time (that is indicated in the timestamp field in the RTP
273	   packet header), with the exception that more than one integral number of
274	   consecutive VOPs MAY be carried within one RTP packet in the decoding
275	   order if the size of the VOPs is small.
276	   Note: When multiple VOPs are carried in one RTP payload, the presentation
277	   time of the VOPs after the first one may be calculated by the decoder.
278	   This operation is necessary only for RTP packets in which the marker bit
279	   equals to one and the beginning of RTP payload corresponds to a start
280	   code. (See timestamp and marker bit in section 3.1)

282	   (5) A single video packet SHOULD NOT be split into a plurality of RTP
283	   packets. The size of a video packet SHOULD be adjusted in such a way that
284	   the resulting RTP packet is not larger than the path-MTU. A video packet
285	   MAY be split into a plurality of RTP packets when the size of the video
286	   packet is large.
287	   Note: Rule (5) does not apply when the video packet is disabled by the
288	   coder configuration (by setting resync_marker_disable in the VOL header
289	   to 1), or in coding tools where the video packet is not supported. In
290	   this case, a VOP MAY be split at arbitrary byte-positions.

292	   Here, header means:
293	   - Configuration information (Visual Object Sequence Header, Visual Object
294	     Header and Video Object Layer Header)
295	   - visual_object_sequence_end_code
296	   - The header of the entry point function for an elementary stream
297	     (Group_of_VideoObjectPlane() or the header of VideoObjectPlane(),
298	     video_plane_with_short_header(), MeshObject() or FaceObject())
299	   - The video packet header (video_packet_header() excluding
300	     next_resync_marker())
301	   - The header of gob_layer()
302	   See 6.2.1 "Start codes" of ISO/IEC 14496-2[2][9][4] for the definition of
303	   the configuration information and the entry point functions.

305	   The video packet starts with the VOP header or the video packet header,
306	   followed by motion_shape_texture(), and ends with next_resync_marker() or
307	   next_start_code().

309	3.3 Examples of packetized MPEG-4 Visual bitstream

311	   Considering the fact that MPEG-4 Visual covers a wide variety of networks
312	   ranging from scores of Kbps to several Mbps, and from those guaranteed to
313	   be almost error-free to mobile networks with high error rates, it is
314	   desirable not to apply too much restriction on fragmentation. On the
315	   other hand, careless, media unaware fragmentation will cause degradation
316	   in error resiliency and bandwidth efficiency. The fragmentation criteria
317	   described in 3.2 are flexible but serve to define the minimum rules to
318	   prevent meaningless fragmentation.

320	   Figure 2 shows examples of RTP packets generated based on the criteria
321	   described in 3.2

323	   (a) is an example of the first RTP packet or the random access point of
324	   an MPEG-4 visual bitstream containing the configuration information.
325	   According to criterion (1), the Visual Object Sequence Header(VS header)
326	   is placed at the beginning of the RTP payload, preceding the Visual
327	   Object Header and the Video Object Layer Header(VO header, VOL header).
328	   Since the fragmentation rule defined in 3.2 guarantees that the
329	   configuration information, starting with
330	   visual_object_sequence_start_code, is always placed at the beginning of
331	   the RTP payload, RTP receivers can detect the random access point by
332	   checking if the first 32-bit field of the RTP payload is
333	   visual_object_sequence_start_code.

335	   (b) is another example of the RTP packet containing the configuration
336	   information. It differs from example (a) in that the RTP packet also
337	   contains a video packet in the VOP following the configuration
338	   information. Since the length of the configuration information is
339	   relatively short (typically scores of bytes) and an RTP packet containing
340	   only the configuration information may thus increase the overhead, the
341	   configuration information and the immediately following GOV and/or (a
342	   part of) VOP can be effectively packetized into a single RTP packet as in
343	   this example.

345	   (c) is an example of the RTP packet that contains
346	   Group_of_VideoObjectPlane(GOV). Following criterion (1), the GOV is
347	   placed at the beginning of the RTP payload. It would be a waste of RTP/IP
348	   header overhead to generate an RTP packet containing only a GOV whose
349	   length is 7 bytes. Therefore, (a part of) the following VOP can be placed
350	   in the same RTP packet as shown in (c).

352	   (d) is an example of the case where one video packet is packetized into
353	   one RTP packet. When the packet-loss rate of the underlying network is
354	   high, this kind of packetization is recommended. It is recommended to set
355	   resync_marker_disable to 0 in the VOL header to enable the adjustment of
356	   the video packet size. Even when the RTP packet containing the VOP header
357	   is discarded by a packet loss, the other RTP packets can be decoded by
358	   using the HEC(Header Extension Code) information in the video packet
359	   header. No extra RTP header field is necessary.

361	   (e) is an example of the case where more than one video packets are
362	   packetized into one RTP packet. This kind of packetization is effective
363	   to save the overhead of RTP/IP headers when the bit-rate of the
364	   underlying network is low. However, it will decrease the packet-loss
365	   resiliency because multiple video packets are discarded by a single RTP
366	   packet loss. The optimal number of video packets in an RTP packet and the
367	   length of the RTP packet can be determined considering the packet-loss
368	   rate and the bit-rate of the underlying network.

370	   (f) is an example of the case when the video packet is disabled by
371	   setting resync_marker_disable in the VOL header to 1. In this case, a VOP
372	   may be split into a plurality of RTP packets at arbitrary byte-positions.
373	   For example, it is possible to split a VOP into fixed-length packets.
374	   This kind of coder configuration and RTP packet fragmentation may be used
375	   when the underlying network is guaranteed to be error-free. On the other
376	   hand, it is not recommended to use it in error-prone environment since it
377	   provides only poor packet loss resiliency.

379	   Figure 3 shows examples of RTP packets prohibited by the criteria of 3.2.

381	   Fragmentation of a header into multiple RTP packets, as in (a), will not
382	   only increase the overhead of RTP/IP headers but also decrease the error
383	   resiliency. Therefore, it is prohibited by the criterion (3).

385	   When concatenating more than one video packets into an RTP packet, VOP
386	   header or video_packet_header() shall not be placed in the middle of the
387	   RTP payload. The packetization as in (b) is not allowed by criterion (2)
388	   due to the aspect of the error resiliency. Comparing this example with
389	   Figure 2(d), although two video packets are mapped onto two RTP packets
390	   in both cases, the packet-loss resiliency is not identical. Namely, if
391	   the second RTP packet is lost, both video packets 1 and 2 are lost in the
392	   case of Figure 3(b) whereas only video packet 2 is lost in the case of
393	   Figure 2(d).

395	       +------+------+------+------+
396	   (a) | RTP  |  VS  |  VO  | VOL  |
397	       |header|header|header|header|
398	       +------+------+------+------+

400	       +------+------+------+------+------------+
401	   (b) | RTP  |  VS  |  VO  | VOL  |Video Packet|
402	       |header|header|header|header|            |
403	       +------+------+------+------+------------+

405	       +------+-----+------------------+
406	   (c) | RTP  | GOV |Video Object Plane|
407	       |header|     |                  |
408	       +------+-----+------------------+

410	       +------+------+------------+  +------+------+------------+
411	   (d) | RTP  | VOP  |Video Packet|  | RTP  |  VP  |Video Packet|
412	       |header|header|    (1)     |  |header|header|    (2)     |
413	       +------+------+------------+  +------+------+------------+

415	       +------+------+------------+------+------------+------+------------+
416	   (e) | RTP  |  VP  |Video Packet|  VP  |Video Packet|  VP  |Video Packet|
417	       |header|header|     (1)    |header|    (2)     |header|    (3)     |
418	       +------+------+------------+------+------------+------+------------+

420	       +------+------+------------+  +------+------------+
421	   (f) | RTP  | VOP  |VOP fragment|  | RTP  |VOP fragment|
422	       |header|header|    (1)     |  |header|    (2)     | ___
423	       +------+------+------------+  +------+------------+

425	        Figure 2 - Examples of RTP packetized MPEG-4 Visual bitstream

427	       +------+-------------+  +------+------------+------------+
428	   (a) | RTP  |First half of|  | RTP  |Last half of|Video Packet|
429	       |header|  VP header  |  |header|  VP header |            |
430	       +------+-------------+  +------+------------+------------+

432	       +------+------+----------+  +------+---------+------+------------+
433	   (b) | RTP  | VOP  |First half|  | RTP  |Last half|  VP  |Video Packet|
434	       |header|header| of VP(1) |  |header| of VP(1)|header|    (2)     |
435	       +------+------+----------+  +------+---------+------+------------+

437	   Figure 3 - Examples of prohibited RTP packetization for MPEG-4 Visual
438	   bitstream

440	4. RTP Packetization of MPEG-4 Audio bitstream

442	   This section specifies RTP packetization rules for MPEG-4 Audio
443	   bitstreams. MPEG-4 Audio streams are formatted by LATM (Low-overhead
444	   MPEG-4 Audio Transport Multiplex) tool[5], and the LATM-based streams are
445	   then mapped onto RTP packets as described the three sections below.

447	4.1 RTP Packet Format

449	   LATM-based streams consist of a sequence of audioMuxElements that include
450	   one or more audio frames. A complete audioMuxElement or a part of one
451	   SHALL be mapped directly onto an RTP payload without any removal of
452	   audioMuxElement syntax elements (see Figure 4). The first byte of each
453	   audioMuxElement SHALL be located at the first payload location in an RTP
454	   packet.

456	   0                   1                   2                   3
457	   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
458	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
459	   |V=2|P|X|  CC   |M|     PT      |       sequence number         |RTP
460	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
461	   |                           timestamp                           |Header
462	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
463	   |           synchronization source (SSRC) identifier            |
464	   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
465	   |            contributing source (CSRC) identifiers             |
466	   |                             ....                              |
467	   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
468	   |                                                               |RTP
469	   :                 audioMuxElement (byte aligned)                :Payload
470	   |                                                               |
471	   |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
472	   |                               :...OPTIONAL RTP padding        |
473	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
474	                Figure 4 - An RTP packet for MPEG-4 Audio

476	   In order to decode the audioMuxElement, the following muxConfigPresent
477	   information is required to be indicated by an out-of-band means.

479	   muxConfigPresent: If this value is set to 1, the audioMuxElement SHALL
480	   include an indication bit "useSameStreamMux" and MAY include the
481	   configuration information for audio compression "StreamMuxConfig". The
482	   useSameStreamMux bit indicates whether the StreamMuxConfig element in the
483	   previous frame is applied in the current frame.

485	4.2 Use of RTP Header Fields for MPEG-4 Audio

487	   Payload Type (PT): Payload type is to be specifically assigned as the
488	   MPEG-4 Audio RTP payload format. If this assignment is to be carried out
489	   dynamically, it can be performed by such out-of-band means as H.245, SDP,
490	   etc. In the dynamic assignment of RTP payload types for scalable streams,
491	   a different value should be assigned to each layer. The assigned values
492	   should be in order of enhance layer dependency, where the base layer has
493	   the smallest value.

495	   Marker (M) bit: The marker bit indicates audioMuxElement boundaries. It
496	   is set to one to indicate that the RTP packet contains a complete
497	   audioMuxElement or the last fragment of an audioMuxElement.

499	   Timestamp: The timestamp indicates composition time, or presentation time
500	   in a no-compositor decoder. Timestamps are recommended to start at a
501	   random value for security reasons.

503	   Unless specified by an out-of-band means, the resolution of the timestamp
504	   is set to its default value of 90 kHz.

506	   Sequence Number: Incremented by one for each RTP packet sent, starting,
507	   for security reasons, with a random value.

509	   SSRC, CC and CSRC fields are used as described in RFC 1889 [8].

511	4.3 Fragmentation of MPEG-4 Audio bitstream

513	   It is desirable to put one audioMuxElement in each RTP packet. If the
514	   size of an audioMuxElement can be kept small enough that the size of the
515	   RTP packet containing it does not exceed the size of the path-MTU, this
516	   will be no problem. If it cannot, the audioMuxElement MAY be fragmented
517	   and spread across multiple packets, following the rules below:

519	   (1) "payloadMux", which consists of payload elements, MAY be fragmented
520	   across several RTP packets, so that each of those RTP packets will
521	   contain one or more payload elements. Individual payload elements
522	   themselves SHOULD NOT be fragmented.

524	   (2) If the audioMuxElement includes StreamMuxConfig, StreamMuxConfig
525	   SHALL be included in the RTP packet that contains the first payload
526	   element.

528	5. MIME type registration for MPEG-4 Audio/Visual streams

530	   The following sections describe the MIME type registrations for MPEG-4
531	   Audio/Visual streams. MIME type registration and SDP usage for the MPEG-4
532	   Visual stream are described in Sections 5.1 and 5.2, respectively, while
533	   MIME type registration and SDP usage for MPEG-4 Audio stream are
534	   described in Sections 5.3 and 5.4, respectively.

536	   (In the following sections, the RFC number "XXXX" represents the RFC
537	   number, which should be assigned for this document.)

539	5.1 MIME type registration for MPEG-4 Visual

541	   MIME media type name: video

543	   MIME subtype name: MP4V

545	   Required parameters: none

547	   Optional parameters:
548	     rate: This parameter is used only for RTP transport. It indicates the
549	     resolution of the timestamp field in the RTP header. If this parameter
550	     is not specified, its default value of 90000 (90KHz) is used.

552	     profile-level-id: A decimal representation of MPEG-4 Visual Profile
553	     Level indication value (profile_and_level_indication) defined in Table
554	     G-1 of ISO/IEC 14496-2 [2][4]. This parameter MAY be used in the
555	     capability exchange or session setup procedure to indicate MPEG-4
556	     Visual Profile and Level combination of which the MPEG-4 Visual codec
557	     is capable. If this parameter is not specified by the procedure, its
558	     default value of 1 (Simple Profile/Level 1) is used.

560	     config: This parameter indicates the configuration of the
561	     corresponding MPEG-4 visual bitstream. It SHALL NOT be used to
562	     indicate the codec capability in the capability exchange procedure. It
563	     is a hexadecimal representation of an octet string that expresses the
564	     MPEG-4 Visual configuration information, as defined in subclause 6.2.1
565	     Start codes of ISO/IEC14496-2[2][4][9]. The configuration information
566	     is mapped onto the octet string in an MSB-first basis. The first bit
567	     of the configuration information SHALL be located at the MSB of the
568	     first octet. The configuration information indicated by this parameter
569	     SHALL be the same as the configuration information in the
570	     corresponding MPEG-4 Visual stream, except for
571	     first_half_vbv_occupancy and latter_half_vbv_occupancy, if exist,
572	     which may vary in the repeated configuration information inside an
573	     MPEG-4 Visual stream (See 6.2.1 Start codes of ISO/IEC14496-2).

575	     Example usages for these parameters are:
576	       - MPEG-4 Visual Simple Profile/Level 1:
577	          Content-type: video/mp4v; profile-level-id=1

579	       - MPEG-4 Visual Core Profile/Level 2:
580	          Content-type: video/mp4v; profile-level-id=34

582	       - MPEG-4 Visual Advanced Real Time Simple Profile/Level 1:
583	          Content-type: video/mp4v; profile-level-id=145

585	   Published specification:
586	     The specifications for MPEG-4 Visual streams are presented in ISO/IEC
587	     14469-2[2][4][9]. The RTP payload format is described in RFCXXXX.

589	   Encoding considerations:
590	     Video bitstreams must be generated according to MPEG-4 Visual
591	     specifications (ISO/IEC 14496-2). A video bitstream is binary data and
592	     must be encoded for non-binary transport (for Email, the Base64
593	     encoding is sufficient).  This type is also defined for transfer via
594	     RTP. The RTP packets MUST be packetized according to the MPEG-4 Visual
595	     RTP payload format defined in RFCXXXX.

597	   Security considerations:
598	     See section 6 of RFCXXXX.

600	   Interoperability considerations:
601	     MPEG-4 Visual provides a large and rich set of tools for the coding of
602	     visual objects. For effective implementation of the standard, subsets
603	     of the MPEG-4 Visual tool sets have been provided for use in specific
604	     applications. These subsets, called 'Profiles', limit the size of the
605	     tool set a decoder is required to implement. In order to restrict
606	     computational complexity, one or more Levels are set for each Profile.
607	     A Profile@Level combination allows:

609	     o a codec builder to implement only the subset of the standard he
610	     needs, while maintaining interworking with other MPEG-4 devices
611	     included in the same combination, and

613	     o checking whether MPEG-4 devices comply with the standard
614	     ('conformance testing').

616	     The visual stream SHALL be compliant with the MPEG-4 Visual
617	     Profile@Level specified by the parameter "profile-level-id".
618	     Interoperability between a sender and a receiver may be achieved by
619	     specifying the parameter "profile-level-id" in MIME content, or by
620	     arranging in the capability exchange/announcement procedure to set this
621	     parameter mutually to the same value.

623	   Applications which use this media type:
624	     Audio and visual streaming and conferencing tools, Internet messaging
625	     and Email applications.

627	   Additional information: none

629	   Person & email address to contact for further information:
630	     The authors of RFCXXXX. (See section 8)

632	   Intended usage: COMMON

634	   Author/Change controller:
635	     The authors of RFCXXXX. (See section 8)

637	5.2 SDP usage of MPEG-4 Visual
638	   The MIME media type video/MP4V string is mapped to fields in the Session
639	   Description Protocol (SDP), RFC 2327, as follows:

641	   o The MIME type (video) goes in SDP "m=" as the media name.

643	   o The MIME subtype (MP4V) goes in SDP "a=rtpmap" as the encoding name.

645	   o The optional parameter "rate" goes in "a=rtpmap" as the clock rate.

647	   o The optional parameter "profile-level-id" and "config" MAY go in the
648	   "a=fmtp" line to indicate the coder capability and configuration,
649	   respectively. These parameters are expressed as a MIME media type string,
650	   in the form of as a semicolon separated list of parameter=value pairs.

652	   The following are some examples of media representation in SDP:

654	   Simple Profile/Level 1, rate=90000(90KHz), "profile-level-id" and
655	   "config" are present in "a=fmtp" line:
656	     m=video 49170/2 RTP/AVP 98
657	     a=rtpmap:98 MP4V/90000
658	     a=fmtp:98 profile-level-id=1;config=000001B001000001B50900000100
659	        00000120008440FA282C2090A21F

661	   Core Profile/Level 2, rate=90000(90KHz), "profile-level-id" is present in
662	   "a=fmtp" line:
663	     m=video 49170/2 RTP/AVP 98
664	     a=rtpmap:98 MP4V/90000
665	     a=fmtp:98 profile-level-id=34

667	   Advance Real Time Simple Profile/Level 1, rate=25(25Hz), "profile-level-
668	   id" is present in "a=fmtp" line:
669	     m=video 49170/2 RTP/AVP 98
670	     a=rtpmap:98 MP4V/25
671	     a=fmtp:98 profile-level-id=145

673	5.3 MIME type registration of MPEG-4 Audio

675	   MIME media type name: audio

677	   MIME subtype name: MP4A

679	   Required parameters:
680	     rate: the rate parameter indicates the RTP time stamp clock rate. The
681	     default value is 90000. Other rates CAN be specified only if they are
682	     set to the same value as the audio sampling rate (number of samples
683	     per second).

685	   Optional parameters:

687	     profile-level-id: a decimal representation of MPEG-4 Audio Profile
688	     Level indication value defined in ISO/IEC 14496-1 [10]. This parameter
689	     indicates which MPEG-4 Audio tool subsets the decoder is capable of
690	     using. If this parameter is not specified in the capability exchange
691	     or session setup procedure, its default value of 30 (Natural Audio
692	     Profile/Level 1) is used.

694	     object: a decimal representation of the MPEG-4 Audio Object Type value
695	     defined in ISO/IEC 14496-3 [5]. This parameter specifies the tool to
696	     be used by the coder. It CAN be used to limit the capability within
697	     the specified "profile-level-id".

699	     bitrate: the data rate for the audio bit stream.

701	     cpresent: this parameter indicates whether audio payload configuration
702	     data has been multiplexed into an RTP payload (See section 4.1 in this
703	     document). The default value is 1.

705	     config: a hexadecimal representation of an octet string that expresses
706	     the audio payload configuration data "StreamMuxConfig", as defined in
707	     ISO/IEC 14496-3 [5]. Configuration data is mapped onto the octet
708	     string in an MSB-first basis. The first bit of the configuration data
709	     SHALL be located at the MSB of the first octet. In the last octet,
710	     zero-padding bits, if necessary, shall follow the configuration data.
711	     If the size of the configuration data is quite large, such large
712	     config data is RECOMMENDED to be indicated by in-band mode (cpresent
713	     is set to 1).

715	     ptime: RECOMMENDED duration of each packet in milliseconds.

717	   Published specification:
718	     Payload format specifications are described in this document. Encoding
719	     specifications are provided in ISO/IEC 14496-3 [3][5].

721	   Encoding considerations:
722	     This type is only defined for transfer via RTP.

724	   Security considerations:
725	     See Section 6 of RFCXXXX.

727	   Interoperability considerations:
728	     MPEG-4 Audio provides a large and rich set of tools for the coding of
729	     audio objects. For effective implementation of the standard, subsets of
730	     the MPEG-4 Audio tool sets similar to those used in MPEG-4 Visual have
731	     been provided (see section 5.1).

733	     The audio stream SHALL be compliant with the MPEG-4 Audio
734	     Profile@Level specified by the parameter "profile-level-id".
735	     Interoperability between a sender and a receiver may be achieved by
736	     specifying the parameter "profile-level-id" in MIME content, or by
737	     arranging in the capability exchange procedure to set this parameter
738	     mutually to the same value. Furthermore, the "object" parameter can be
739	     used to limit the capability within the specified Profile@Level in
740	     capability exchange.

742	   Applications which use this media type:
743	     Audio and video streaming and conferencing tools.

745	   Additional information: none

747	   Personal & email address to contact for further information:
748	     See Section 8 of RFCXXXX.

750	   Intended usage: COMMON

752	   Author/Change controller:
753	     See Section 8 of RFCXXXX.

755	5.4 SDP usage of MPEG-4 Audio

757	   The MIME media type audio/MP4A string is mapped to fields in the Session
758	   Description Protocol (SDP), RFC 2327, as follows:

760	   o The MIME type (audio) goes in SDP "m=" as the media name.

762	   o The MIME subtype (MP4A) goes in SDP "a=rtpmap" as the encoding name.

764	   o The required parameter "rate" goes in "a=rtpmap" as the clock rate.

766	   o The optional parameter "ptime" goes in SDP "a=ptime" attribute.

768	   o The optional parameter "profile-level-id" goes in the "a=fmtp" line to
769	   indicate the coder capability. The "object" parameter goes in the
770	   "a=fmtp" attribute. The payload-format-specific parameters "bitrate",
771	   "cpresent" and "config" go in the "a=fmtp" line. If the string after
772	   "config=" is quite large, such large config data should not be
773	   transmitted by SDP but should be transmitted by in-band mode. These
774	   parameters are expressed as a MIME media type string, in the form of as a
775	   semicolon separated list of parameter=value pairs.

777	   The following are some examples of the media representation in SDP:

779	   For 6 kb/s CELP bitstreams (with an audio sampling rate of 8 kHz),
780	     m=audio 49230 RTP/AVP 96
781	     a=rtpmap:96 MP4A/8000
782	     a=fmtp:96 profile-level-id=9;object=8;cpresent=0;config=9128B1071070
783	     a=ptime:20

785	   For 64 kb/s AAC LC stereo bitstreams (with an audio sampling rate of 24
786	   kHz),
787	     m=audio 49230 RTP/AVP 96
788	     a=rtpmap:96 MP4A/24000
789	     a=fmtp:96 profile-level-id=1; bitrate=64000; cpresent=0;
790	     config=9122620000

792	   In the above two examples, audio configuration data is not multiplexed
793	   into the RTP payload and is described only in SDP. Furthermore, the
794	   "clock rate" is set to the audio sampling rate.

796	   If the clock rate has been set to its default value and it is necessary
797	   to obtain the audio sampling rate, this can be done by parsing the
798	   "config" parameter (see the following example).

800	     m=audio 49230 RTP/AVP 96
801	     a=rtpmap:96 MP4A/90000
802	     a=fmtp:96 object=8; cpresent=0; config=9128B1071070

804	   The following example shows that the audio configuration data appears in
805	   the RTP payload.

807	   m=audio 49230 RTP/AVP 96
808	   a=rtpmap:96 MP4A/90000
809	   a=fmtp:96 object=13; cpresent=1

811	6. Security Considerations

813	   RTP packets using the payload format defined in this specification are
814	   subject to the security considerations discussed in the RTP specification
815	   [8]. This implies that confidentiality of the media streams is achieved
816	   by encryption. Because the data compression used with this payload format
817	   is applied end-to-end, encryption may be performed on the compressed data
818	   so there is no conflict between the two operations.

820	   The complete MPEG-4 system allows for transport of a wide range of
821	   content, including Java applets (MPEG-J) and scripts.  Since this payload
822	   format is restricted to audio and video streams, it is not possible to
823	   transport such active content in this format.

825	7. References

827	   1  Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9,
828	      RFC 2026, October 1996.

830	   2 ISO/IEC 14496-2:1999, "Information technology - Coding of audio-visual
831	      objects - Part2: Visual", December 1999.

833	   3 ISO/IEC 14496-3:1999, "Information technology - Coding of audio-visual
834	      objects - Part3: Audio", December 1999.

836	   4 ISO/IEC 14496-2:1999/FDAM1:2000, December 1999.

838	   5 ISO/IEC 14496-3:1999/FDAM1:2000, December 1999.

840	   6 ISO/IEC 14496-1:1999, "Information technology - Coding of audio-visual
841	      objects - Part1: Systems", December 1999.

843	   7  Bradner, S., "Key words for use in RFCs to Indicate Requirement
844	      Levels", BCP 14, RFC 2119, March 1997

846	   8 H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson "RTP: A Transport
847	      Protocol for Real Time Applications",  RFC 1889, Internet Engineering
848	      Task Force, January 1996.

850	   9  ISO/IEC 14496-2:1999/COR1:2000, "Information technology - Coding of
851	      audio-visual objects - Part2: Visual, Technical corrigendum 1", August
852	      2000.

854	   10 ISO/IEC 14496-1:1999/FDAM1:2000, December 1999.

856	8. Author's Addresses

858	   Yoshihiro Kikuchi
859	   Toshiba corporation
860	   1, Komukai Toshiba-cho, Saiwai-ku, Kawasaki, 212-8582, Japan
861	   Email: yoshihiro.kikuchi@toshiba.co.jp

863	   Yoshinori Matsui
864	   Matsushita Electric Industrial Co., LTD.
865	   1006, Kadoma, Kadoma-shi, Osaka, Japan
866	   Email: matsui@drl.mei.co.jp

868	   Toshiyuki Nomura
869	   NEC Corporation
870	   4-1-1,Miyazaki,Miyamae-ku,Kawasaki,JAPAN
871	   Email: t-nomura@ccm.cl.nec.co.jp

873	   Shigeru Fukunaga
874	   Oki Electric Industry Co., Ltd.
875	   1-2-27 Shiromi, Chuo-ku, Osaka 540-6025 Japan.
876	   Email: fukunaga444@oki.co.jp

878	   Hideaki Kimata
879	   Nippon Telegraph and Telephone Corporation
880	   1-1, Hikari-no-oka, Yokosuka-shi, Kanagawa, Japan
881	   Email: kimata@nttvdt.hil.ntt.co.jp

883	Full Copyright Statement

885	   "Copyright (C) The Internet Society (date). All Rights Reserved.

887	   This document and translations of it may be copied and furnished to
888	   others, and derivative works that comment on or otherwise explain it
889	   or assist in its implementation may be prepared, copied, published
890	   and distributed, in whole or in part, without restriction of any
891	   kind, provided that the above copyright notice and this paragraph
892	   are included on all such copies and derivative works. However, this
893	   document itself may not be modified in any way, such as by removing
894	   the copyright notice or references to the Internet Society or other
895	   Internet organizations, except as needed for the purpose of
896	   developing Internet standards in which case the procedures for
897	   copyrights defined in the Internet Standards process must be
898	   followed, or as required to translate it into languages other than
899	   English.

901	   The limited permissions granted above are perpetual and will not be
902	   revoked by the Internet Society or its successors or assigns.