idnits 2.17.1 

draft-wenger-avt-rtp-jvt-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The abstract seems to contain references ([2]), which it shouldn't. 
     Please replace those with straight textual mentions of the documents in
     question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 327: '...and, hence, PSIs SHOULD NOT be used in...'
     RFC 2119 keyword, line 330: '... such cases PSIs MAY be used.  Severe ...'
     RFC 2119 keyword, line 333: '...his reason, PSIs MUST NOT be used in a...'
     RFC 2119 keyword, line 336: '...rotocol messages MUST NOT be used that...'
     RFC 2119 keyword, line 343: '...he PSIs (when used) SHOULD be conveyed...'
     (39 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD',
     or 'RECOMMENDED' is not an accepted usage according to RFC 2119.  Please
     use uppercase 'NOT' together with RFC 2119 keywords (if that is what you
     mean).
     
     Found 'MUST not' in this paragraph:
     
     Marker bit (M): 1 bit Set for the very last packet of the picture
     indicated by the RTP timestamp, in line with the normal use of the M bit
     and to allow an efficient playout buffer handling.  Decoders MAY use this
     bit as an early indication of the last packet of a coded picture, but
     MUST not rely on this property because the last packet of the picture may
     get lost, and because the use of MTAPs does not always preserve the M bit.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (December 2002) is 7803 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? '2' on line 815 looks like a reference

  -- Missing reference section? '1' on line 812 looks like a reference

  -- Missing reference section? '3' on line 817 looks like a reference

  -- Missing reference section? '4' on line 818 looks like a reference

  -- Missing reference section? '5' on line 819 looks like a reference

  -- Missing reference section? '6' on line 822 looks like a reference

  -- Missing reference section? '7' on line 824 looks like a reference

  -- Missing reference section? '8' on line 826 looks like a reference

  -- Missing reference section? '9' on line 827 looks like a reference

  -- Missing reference section? '10' on line 830 looks like a reference

  -- Missing reference section? '11' on line 831 looks like a reference

  -- Missing reference section? '12' on line 834 looks like a reference

  -- Missing reference section? '13' on line 836 looks like a reference


     Summary: 6 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                               S. Wenger
2	Document: draft-wenger-avt-rtp-jvt-01.txt                M. Hannuksela
3	Expires: December 2002                                  T. Stockhammer
4	                                                             June 2002
5	                                                 Expires December 2002

7	                   RTP payload Format for JVT Video

9	Status of this Memo

11	This document is an Internet-Draft and is in full conformance with all
12	provisions of Section 10 of RFC2026.  Internet-Drafts are working
13	documents of the Internet Engineering Task Force (IETF), its areas, and
14	its working groups.  Note that other groups may also distribute working
15	documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time.  It is inappropriate to use Internet-Drafts as reference material
20	or to cite them other than as "work in progress."

22	The list of current Internet-Drafts can be accessed at
23	http://www.ietf.org/1id-abstracts.txt

25	The list of Internet-Draft Shadow Directories can be accessed at
26	http://www.ietf.org/shadow.html

28	Abstract

30	   This memo describes an RTP Payload format for the JVT codec.  This
31	   codec is designed as a joint project of the ITU-T SG 16 VCEG, and
32	   the ISO/IEC JTC1/SC29/WG11 MPEG groups.  The most up-to-date draft
33	   of the video codec was specified in early May 2002, is due for
34	   revision in late July 2002, and is available for public review [2].

36	1. The JVT codec

38	   This memo specifies an RTP payload specification for a new video
39	   codec that is currently under development by the Joint Video Group
40	   (JVT), which is formed of video coding experts of MPEG and the ITU-
41	   T.  After the likely approval by the two parent bodies, the codec
42	   specification will have the status of an ITU-T Recommendation
43	   (likely H.264) and become part of the MPEG-4 specification (ISO/IEC
44	   14496 Part 10).  The current project timeline of the JVT project is
45	   such that a technically frozen specification (pending bug fixes) is
46	   expected in July 2002 in the form of an ISO/IEC Final Committee
47	   Draft (FCD).  Before JVT was formed in late 2001, this project used
48	   the ITU-T project name H.26L and the JVT project inherited all the
49	   technical concepts of the H.26L project.

51	   The JVT video codec has a very broad application range that covers
52	   the whole range from low bit rate Internet Streaming applications to
53	   HDTV broadcast and Digital Cinema applications with near loss-less
54	   coding.  Most, if not all, relevant companies in all of these fields
55	   (including TV broadcast) have participated in the standardization,
56	   which gives hope that this wide application range is more than an
57	   illusion and may materialize, probably in a relatively short time
58	   frame.  The overall performance of the JVT codec is as such that bit
59	   rate savings of 50% or more, compared to the current state of
60	   technology, are reported.  Digital Satellite TV quality, for
61	   example, was reported to be achievable at 1.5 Mbit/s, compared to
62	   the current operation point of MPEG 2 video at around 3.5 Mbit/s
63	   [1].

65	   The codec specification [2] itself distinguishes between a video
66	   coding layer (VCL), and a network abstraction layer (NAL).  The VCL
67	   contains the signal processing functionality of the codec, things
68	   such as transform, quantization, motion search/compensation, and the
69	   loop filter.  It follows the general concept of most of today's
70	   video codecs, a macroblock based coder that utilized inter picture
71	   prediction with motion compensation, and transform coding of the
72	   residual signal.  The output of the VCL are slices: a bit string
73	   that contains the macroblock data of an integer number of
74	   macroblocks, and the information of the slice header (containing the
75	   spatial address of the first macroblock in the slice, the initial
76	   quantization parameter, and similar).  Macroblocks in slices are
77	   ordered in scan order unless a different macroblock allocation is
78	   specified, using the so-called Flexible Macroblock Ordering syntax.
79	   In-picture prediction is used only within a slice.

81	   The NAL encapsulates the slice output of the VCL into Network
82	   Abstraction Layer Units (NALUs), which are suitable for the
83	   transmission over packet networks or the use in packet oriented
84	   multiplex environments.  JVT's Annex B defines an encapsulation
85	   process to transmit such NALUs over byte-stream oriented networks.
86	   In the scope of this memo Annex B is not relevant.

88	   Neither VCL nor NAL are claimed to be media or network independent -
89	   the VCL needs to know transmission characteristics in order to
90	   appropriately select the error resilience strength, slice size,
91	   etc., whereas the NAL needs information like the importance of a bit
92	   string provided by the VCL to select the appropriate application
93	   layer protection.

95	   Internally, the NAL uses NAL Units or NALUs.  A NALU consists of a
96	   one-byte header and the payload byte string.  The header co-serves
97	   as the RTP payload header and indicates the type of the NALU, the
98	   (potential) presence of bit errors in the NALU payload, and
99	   information whether this NALU is required for maintaining the
100	   synchronicity of the encoder/decoder loops.  This RTP payload
101	   specification is designed to be unaware of the bit string in the
102	   NALU payload.

104	   One of the main properties of the JVT codec is the possibility of
105	   the use of Reference Picture Selection.  For each macroblock the
106	   reference picture to be used can be selected independently.  The
107	   reference pictures may be used in a first-in, first-out fashion, but
108	   it is also possible to handle the reference picture buffers
109	   explicitly.  A consequence of this new feature (it was available
110	   before only in H.263++ [3]) is the complete decoupling of the
111	   transmission time, the decoding time, and the sampling or
112	   presentation time of slices and pictures.  For this reason, the
113	   handling of the RTP timestamp requires some special considerations
114	   for those NALUs for which the sampling or presentation time is not
115	   defined, or, at transmission time, unknown.

117	2. Status of JVT, and Changes relative to the -00 version

119	   [This section will be removed in a future version of this draft.]

121	2.1. Status of the JVT standardization, and recent changes to JVT

123	   Since the last draft, JVT has met and a new JVT working draft was
124	   produced.  This JVT working draft is currently in the first stage of
125	   the ISO/IEC approval process, the ballot on the so-called Committee
126	   Draft.  Procedural provisions are taken by interested ISO/IEC
127	   members to ensure that changes relative to this draft are still
128	   possible, even after the ballot.

130	   The meeting brought a lot of changes in the VCL, which do not have a
131	   direct influence to this memo.  However, there were also numerous
132	   changes introduced to the NAL.  They somehow break the clean design
133	   of the NAL as it was presented at the Minneapolis IETF, in favor to
134	   save bits in a byte stream environment.  This memo reflects the
135	   current JVT working draft, but please see the following section on
136	   our expectations regarding future changes of the NAL design.

138	   The main changes of the JVT NAL relative to the pre-Fairfax design
139	   are as follows:

141	   - Introduction of a picture header
142	   - A means to carry redundant copies of the picture header
143	   - Adding of a "Disposable Flag" to the NALU type.

145	   - Adding many more slice types to the NALU type (were 8, now 30)

147	   The next JVT meeting will take place in the week after the Japan
148	   IETF in Klagenfurt, Austria.  This will be the last meeting in which
149	   significant changes (anything but bug fixes) can be done.

151	2.2. Authors' comments and expectation regarding JVT NAL design

153	   The authors deem many of the changes to the NAL as technically
154	   problematic, and are working within JVT to fix the freshly
155	   introduced and, from the RTP point-of-view, problematic features.
156	   The re-introduction of the picture header concept will lead to an
157	   undesirable overhead in packet network environments, by making
158	   mechanisms such as header repetition necessary.  It also breaks the
159	   clean Parameter Set concept, making it easier for people to take
160	   shortcuts.

162	   We know that we can show that the number of bits that can be saved
163	   in a byte stream environment through the picture header concept is
164	   negligible, and insignificant when compared to the problems the
165	   packet world has with this concept.  We are confident that we can
166	   replace the picture header mechanism with something like a
167	   hierarchical Parameter Set concept.

169	   If we can convince JVT to go back to the clean JVT NAL design, the
170	   number of NALU types (30, plus one for the aggregation packets now)
171	   would go down to something more reasonable and freeing codepoint
172	   space for future extensions.  Otherwise, the draft will require
173	   language that recommends the amount of redundant picture header data
174	   to be sent.

176	2.3. Changes relative to draft-wenger-avt-rtp-jvt-00.txt

178	   This memo reflects the current JVT WD, and hence required alignment
179	   with this draft.  In addition to editorial changes (mostly to
180	   reflect the changed terminology in the JVT draft), the discussion of
181	   the NAL unit types was aligned.

183	   As a response to the last IETF meeting's request, the RTP timestamp
184	   is now the sampling/presentation timestamp.  (It is unclear to us
185	   how to distinguish between the two).

187	   The RTP clock is now fixed at 90 kHz.

189	   Compound Packets are renamed to Aggregation Packets.

191	   Since the timestamp now carries vital information, a second type of
192	   an aggregation packet is necessary.  The compound packet of draft-
193	   wenger-avt-rtp-jvt-00.txt can now be used only to aggregate packets
194	   that share the same RTP timestamp, and is now called Single-Time
195	   Aggregation Packet (STAP).  Usually, this packet type can only be
196	   used to aggregate packets belonging to the same picture.  The second
197	   aggregated packet type adds a 16-bit timestamp offset to the
198	   aggregated packet data structure for each of the aggregated NALUs,
199	   and is called Multi-Time Aggregation Packet (MTAP).  At 90 kHz clock
200	   this packet type allows to aggregate NALUs that are roughly 2/3rd's
201	   of a second apart.  It is believed that such a distance is a good
202	   compromise between the requirements of the streaming industry (they
203	   want to packetize NALUs belonging to more than one picture into one
204	   packet) and the overhead constraints (16 bits per NALU).  See
205	   section 11 (Open issues) for a more flexible concept.

207	   In the JVT meeting a "Disposable Flag" was introduced in the NALU
208	   header.  That bit is documented here as well.

210	3. Scope

212	   This payload specification can only be used to carry the "naked" JVT
213	   NALU stream over RTP.  Likely, the first applications of a Standard
214	   Track RFC resulting from this draft will be in the conversational
215	   multimedia field, video telephone or video conference.  The draft is
216	   not intended for the use in conjunction with the Byte Stream format
217	   of Annex B of the JVT working draft, the MPEG 4 system layer [4] or
218	   other multiplexing schemes.

220	4. NAL basics

222	   Tutorial information on the NAL design can be found in [5] and
223	   [6].  For the precise definition of the NAL it is referred to [2].
224	   This section tries to provide a very short overview of the concepts
225	   used.

227	4.1. Parameter Set Concept

229	   One very fundamental design concept of the JVT codec is to generate
230	   self-contained packets, to make mechanisms such as the header
231	   duplication of RFC2429 [7] or MPEG-4's HEC [8] unnecessary.  (Please
232	   see section 2.2 regarding the authors' opinion re the Picture
233	   header.) The way how this was achieved is to decouple information
234	   that is relevant for more than one slice from the media stream.
235	   This higher layer meta information should be sent reliably and
236	   asynchronously from the RTP packet stream that contains the slice
237	   packets.  The combination of the higher level parameters is called a
238	   Parameter Set.  The Parameter Set contains information such as

240	     o picture size,
241	     o display window,
242	     o optional coding modes employed,
243	     o and others.

245	   In order to be able to change picture parameters (such as the
246	   picture size), without having the need to transmit Parameter Set
247	   updates synchronously to the slice packet stream, the encoder and
248	   decoder can maintain a list of more than one Parameter Set.  Each
249	   slice header contains a codeword that indicates the Parameter Set to
250	   be used.

252	   This mechanism allows to decouple the transmission of the Parameter
253	   Sets from the packet stream, and transmit them by external means,
254	   e.g. as a side effect of the capability exchange, or through a
255	   (reliable or unreliable) control protocol. It may even be possible
256	   that they get never transmitted but are fixed by an application
257	   design specification.

259	   Although, conceptually, the Parameter Set updates are not designed
260	   to be sent in the synchronous packet stream, this memo contains a
261	   means to convey them in the RTP packet stream.

263	4.2. Network Abstraction Layer Packet (NALU) Types

265	   All NALUs consist of a single NALU Type octet, which also serves as
266	   the payload header.  The payload of a NALU follows immediately.

268	   The NALU type octet has the following format:

270	   +---------------+
271	   |0|1|2|3|4|5|6|7|
272	   +-+-+-+-+-+-+-+-+
273	   |E|  Type   |P|D|
274	   +---------------+

276	   E: 1 bit
277	      The Error Indication bit, when cleared assures a bit-error free
278	      payload of the NALU and of the NALU type octet.  When set, the
279	      decoder is advised that bit errors may be present in the payload
280	      or in the NALU type octet.  A prudent reaction of decoders that
281	      are incapable of handling bit errors is to discard such packets.

283	   Type: 5 bits
284	      The NAL Unit payload type as defined in table 8.2 of [2].

286	   P: 1 bit
287	      Picture Header Flag.  Indicates the presence of a Picture Header
288	      at the beginning of the payload.

290	   D: 1 bit
291	      The Disposable Flag indicates that the payload of the NALU, after
292	      decoding, will not be used for future prediction.  Hence, the
293	      decoder and/or media aware network elements can discard such
294	      packets without hurting the codec performance or start error
295	      propagation due to predicted coding.  However, the user
296	      experience will suffer (most likely due to lower frame rates).

298	   For a reference of all currently defined NALU types and their
299	   semantics please see section 8.2 in [2].  Because we anticipate
300	   significant changes to this table, only a few remarks on those NALU
301	   types shall be provided here.

303	   NAL Units of the type X Picture Header (where X is Intra, Inter, B,
304	   SI, or SP) indicate a payload that consists of a picture header of
305	   the indicated type.

307	   All NAL Unit types called X slice contain exactly one coded slice of
308	   the specified type.  In some cases it is also assured that not only
309	   this slice, but also all other slices of the coded picture are of
310	   the same slice type.  This can help the resource allocation process
311	   at the decoder.  An instantaneous decoder refresh picture (IDER
312	   picture) is an I or SI picture that can be used as a random access
313	   point.

315	   The NAL unit of the types DPB and DPC carry Data Partitions
316	   consisting only of Intra and Inter CPBs and coefficients.

318	   The Supplemental Enhancement Information type (SEI) is used to carry
319	   metadata that is not necessary to keep the loops in encoder and
320	   decoder synchronized.  A prime example for SEI information is the
321	   presentation time in such networks that do not have a time property
322	   comparable to the RTP timestamp.

324	   Parameter Set Information NALUs (PSIs) are used to carry new
325	   Parameter Sets or updates to previous Parameter Sets.  Normally, the
326	   transmission and update of Parameter Sets is a function of a control
327	   protocol and, hence, PSIs SHOULD NOT be used in such systems where
328	   adequate protocol support is available.  However, there are
329	   applications where the packet stream has to be self-contained.  In
330	   such cases PSIs MAY be used.  Severe synchronization problems
331	   between the RTP stream containing PSIs and control protocol messages
332	   can occur if PSIs and control protocol messages are used in the same
333	   RTP session.  For this reason, PSIs MUST NOT be used in an RTP
334	   session whose Parameter Sets were already changed by control
335	   protocol messages during the lifetime of the RTP session.
336	   Similarly, control protocol messages MUST NOT be used that affect
337	   any RTP session on which at least one PSI was sent.

339	   The Parameter Set mechanism is designed to decouple the transmission
340	   of picture/GOP/sequence header information from the picture data
341	   that is composed of the other NALU types.  To successfully decode a
342	   picture, all Parameter Sets (referenced by the slice Header) need to
343	   be available.  Hence, the PSIs (when used) SHOULD be conveyed
344	   significantly before their content is first referenced.

346	4.3. Aggregation Packets
347	   Aggregation packets are the packet aggregation scheme of this
348	   payload specification.  The scheme is introduced to reflect the
349	   dramatically different MTU sizes of two target networks -- wireline
350	   IP networks (with an MTU size that is often limited by the Ethernet
351	   MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. H.324/M)
352	   based wireless networks with preferred transmission unit sizes of
353	   254 bytes or less.  In order to prevent media transcoding between
354	   the two worlds, and to avoid undesirable packetization overhead, a
355	   packet aggregation scheme is introduced.

357	   Two types of Aggregation packets are defined by this specification:

359	   o Single-Time Aggregation Packet (STAP) aggregate NALUs with
360	     identical NALU-time.
361	   o Multi-Time Aggregation Packet (MTAP) aggregate NALUs with
362	     potentially differing NALU-time.

364	   The term NALU-time is defined as the value the RTP timestamp would
365	   have if that NALU would be transported in its own RTP packet.

367	   MTAP and STAP share the following packetization rules:

369	   The disposable flag MUST be set if it is set in all aggregated
370	   NALUs, otherwise it MUST be cleared.  The Type field of the NALU
371	   type octet MUST be zero.  The E bit MUST be cleared if all E bits of
372	   the aggregated NALUs are zero, otherwise it MUST be set.

374	   For MTAPs and STAPs (identified by type = 0 in the NALU type byte)
375	   the Picture Header flag is overloaded with a new semantic.  A zero
376	   in the Picture Header flag indicates a STAP, a one indicates an
377	   MTAP.

379	   The Marker bit in the RTP header MUST be set to the value the marker
380	   bit of the last NALU of the aggregated packet would have if it were
381	   transported in its own RTP packet.

383	   The NALU Payload of an aggregation packet consists of one or more
384	   aggregation units.  See section 4.3.1 and 4.3.2 for the two
385	   different types of aggregation units.  An aggregation packet can
386	   carry as many aggregation units as necessary, however the total
387	   amount of data in an aggregation packet obviously MUST fit into an
388	   IP packet, and the size SHOULD be chosen such that the resulting IP
389	   packet is smaller than the MTU size.

391	4.3.1. Single-Time Aggregation Packet

393	   Single-Time Aggregation Packet (STAP) SHOULD be used when
394	   aggregating NALUs that share the same NALU-time.  The Picture Header
395	   Flag MUST be set to zero in order to distinguish an STAP from an
396	   MTAP.

398	   The NALU payload of an STAP consists of Single-Picture Aggregation
399	   units.

401	   A Single-Picture Aggregation Unit consists of 16-bit unsigned size
402	   information that indicates the size of the following NALU in bytes
403	   (excluding these two octets, but including the NALU type octet of
404	   the NALU), followed by the NALU itself including its NALU type
405	   byte.

407	4.3.2. Multi-Time Aggregation Packet (MTAP)

409	   An MTAP has a similar architecture as an STAP.  It consists of the
410	   NALU header byte and one or more Multi-Picture Aggregation Units.
411	   The Picture Header flag in the MTAP NALU type byte is set to 1 to
412	   distinguish an MTAP from an STAP.

414	   This Memo does not specify how the NALUs within an MTAP are
415	   ordered.  In most cases, the natural "decoding order" SHOULD be
416	   used, in particular in conjunction with bi-predicted pictures that
417	   use a forward reference picture.  However, all other NALU ordering
418	   schemes that are legal in JVT video MAY be used as well.

420	   A Multi-Picture Aggregation Unit consists of 16 bits unsigned size
421	   information of the following NALU (same as the size information of
422	   in the STAP).  These 16 bits are followed by 16 bits of timing
423	   information for this NALU.  The timing information field MUST be set
424	   so that the RTP timestamp of an RTP packet of each NALU in the MTAP
425	   (the NALU-time) can be generated by subtracting the timing
426	   information from the RTP timestamp of the MTAP.

428	   For the "latest" multi-picture Aggregation Unit in an MTAP the
429	   timing offset MUST be zero.  Hence, the RTP timestamp of the MTAP
430	   itself is identical to the latest NALU-time.

432	5. RTP Packetization Process

434	   The RTP packetization process of the JVT codec is straightforward
435	   and follows the general principles outlined in RFC1889.  When using
436	   one NALU per RTP packet, the RTP payload consists of the bit buffer
437	   containing the NALU.  The RTP payload (and the settings for some RTP
438	   header bits) for aggregation packets were already defined in section
439	   4.3 above.  There is no specific RTP payload header -- the NALU type
440	   byte double-functions in this task.  The RTP header information is
441	   set as follows:

443	   Timestamp: 32 bits
444	      The RTP timestamp is set to the presentation/sampling timestamp
445	      of the content.  If the NALU has no own timing properties (e.g.
446	      PSIs, SEI), or if the presentation/sampling time is unknown, the
447	      RTP timestamp is set to the RTP timestamp of the last transmitted
448	      RTP packet in the session.  The setting of the RTP Timestamp for
449	      MTAPs is defined in section 4.3.2 above.

451	   Marker bit (M): 1 bit
452	      Set for the very last packet of the picture indicated by the RTP
453	      timestamp, in line with the normal use of the M bit and to allow
454	      an efficient playout buffer handling.  Decoders MAY use this bit
455	      as an early indication of the last packet of a coded picture, but
456	      MUST not rely on this property because the last packet of the
457	      picture may get lost, and because the use of MTAPs does not
458	      always preserve the M bit.

460	   Sequence No (Seq): 16 bit
461	      Increased by one for each sent packet.  Set to a random value
462	      during startup as per RFC1889

464	   Version (V): 2 bits
465	      set to 2

467	   Padding (P): 1 bit
468	      set to 0

470	   Extension (X): 1 bit
471	      set to 0

473	   Payload Type (PT): 8 bits
474	      established dynamically during connection establishment

476	   All other RTP header fields are set as per RFC1889.

478	6. Packetization Rules

480	   Two cases of packetization rules have to be distinguished by the
481	   possibility to put packets belonging to more than a single picture
482	   into a single aggregated packet (using STAPs or MTAPs).

484	6.1. Unrestricted Mode (Multiple Picture Model)

486	   This mode MAY be supported by some receivers.  Usually, the
487	   capability of a receiver to support this mode is indicated by one of
488	   the profiles of the JVT codec (this is not yet defined in [2]). The
489	   following packetization rules MUST be enforced by the sender:

491	   o Single slice packets belonging to the same picture (and hence
492	     share the same RTP timestamp value) MAY be sent in any order,
493	     although, for delay critical systems, they SHOULD be sent in their
494	     original coding order to minimize the delay.  Note that the coding
495	     order is not necessarily the scan order, but the order the NAL
496	     packets become available to the RTP stack.

498	   o Both MTAPs and STAPs MAY be used.

500	   o SEI packets MAY be sent anytime.

502	   o PSIs MUST NOT be sent in an RTP session whose Parameter Sets were
503	     already changed by control protocol messages during the lifetime
504	     of the RTP session.  If PSIs are allowed by this condition, they
505	     MAY be sent at any time.

507	   o All NALU types MAY be mixed freely, provided that above
508	     rules are obeyed.  In particular, it is allowed to mix slices in
509	     data-partitioned and single-slice mode.

511	   o Network elements MAY convert multiple RTP packets carrying
512	     individual NALUs into one aggregated RTP packet, convert an
513	     aggregated RTP packet into several RTP packets carrying individual
514	     NALUs, or mix both concepts.  However, when doing so they SHOULD
515	     take into account at least the following parameters: path MTU
516	     size, unequal protection mechanisms (e.g. through packet
517	     duplication, packet-based FEC carried by RFC2198, especially for
518	     header and Type A Data Partitioning packets), bearable latency of
519	     the system, and buffering capabilities of the receiver.

521	   o NALUs of all types MAY be conveyed as aggregation units of an STAP
522	     or MTAP rather than individual RTP packets.  Special care SHOULD
523	     be taken (particularly in gateways) to avoid more than a single
524	     copy of identical NALUs in a single STAP/MTAP in order to avoid
525	     unnecessary data transfers without any improvements of QoS.

527	6.2. Restricted Mode (Single Picture Model)

529	   This mode MUST be supported by all receivers.  It is primarily
530	   intended for low delay applications.  Its main difference from the
531	   Unrestricted Mode is to forbid the packetization of data belonging
532	   to more than one picture in a single RTP packet.  Hence, MTAPs MUST
533	   NOT be used.  The following packetization rules MUST be enforced by
534	   the sender:

536	   o All rules of the Unrestricted Mode above, with the following
537	     additions

539	   o only STAPs MAY be used, MTAPs MUST NOT be used.  This implies that
540	     aggregated packets MUST NOT include slices or data partitions
541	     belonging to different pictures.

543	7. De-Packetization Process

545	   The de-packetization process is implementation dependent.  Hence,
546	   the following description should be seen as an example of a suitable
547	   implementation.  Other schemes MAY be used as well.  Optimizations
548	   relative to the described algorithms are likely possible.

550	   The general concept behind these de-packetization rules is to
551	   collect all packets belonging to a picture, bringing them into a
552	   reasonable order, discard anything that is unusable, and pass the
553	   rest to the decoder.  Aggregation packets are handled by unloading
554	   their payload into individual RTP packets carrying NALUs.  Those
555	   NALUs are processed as if they were received in separate RTP
556	   packets, in the order they were arranged in the Aggregation Packet.

558	   The following de-packetization rules MAY be used to implement an
559	   operational JVT de-packetizer:

561	   o NALUs are presented to the JVT decoder in the order of the
562	     RTP sequence number.

564	   o NALUs carried in an Aggregation Packet are presented in their
565	     order in the Aggregation packet.  All NALUs of the Aggregation
566	     packet are processed before the next RTP packet is processed.

568	   o Intelligent RTP receivers (e.g. in Gateways) MAY identify lost
569	     DPAs. If a lost DPA is found, the Gateway MAY decide not to send
570	     the DPB and DPC partitions, as their information is meaningless
571	     for the JVT Decoder.  In this way a network element can reduce
572	     network load by discarding useless packets, without parsing a
573	     complex bit stream

575	   o Intelligent receivers MAY discard all packets that have the
576	     Disposable Flag set.  However, they SHOULD process those packets
577	     if possible, because the user experience may suffer if the packets
578	     are discarded.

580	8. MIME Considerations

582	   This section is to be completed later.

584	9. Security Considerations

586	   So far, no security considerations beyond those of RFC1889 have been
587	   identified.

589	   Currently, the JVT CD does not allow carrying any type of active
590	   payload.  However, the inclusion of a "user data" mechanism is under
591	   consideration, which could potentially be used for mechanisms such
592	   as remote software updates of the video decoder and similar tasks.

594	10. Informative Appendix: Application Examples

596	   This payload specification is very flexible in its use, to cover the
597	   extremely wide application space that is anticipated for the JVT
598	   codec.  However, such a great flexibility also makes it difficult
599	   for an implementer to decide on a reasonable packetization scheme.
600	   Some information how to apply this specification to real-world
601	   scenarios is likely to appear in the form of academic publications
602	   and a Test Model in the near future.  However, some preliminary
603	   usage scenarios should be described here as well.

605	10.1. Video Telephony, no Data Partitioning, no packet aggregation

607	   The RTP part of this scheme is implemented and tested (though not
608	   the control-protocol part, see below).

610	   In most real-world video telephony applications, the picture
611	   parameters such as picture size or optional modes never change
612	   during the lifetime of a connection.  Hence, all necessary Parameter
613	   Sets (usually only one) are sent as a side effect of the capability
614	   exchange/announcement process.  An example for such a capability
615	   exchange with an SDP-like syntax can be found in [9], but other
616	   schemes such as ASN.1 are possible as well.  Since all necessary
617	   Parameter Set information is established before the RTP session
618	   starts, there is no need for sending any PSIs.  Data Partitioning is
619	   not used either.  Hence, the RTP packet stream consists basically of
620	   NALUs that carry single slices of video information.

622	   The size of those single-slice NALUs is chosen by the encoder such
623	   that they offer the best performance.  Often, this is done by
624	   adapting the coded slice size to the MTU size of the IP network.
625	   For small picture sizes this may result in a one-picture-per-one-
626	   packet strategy.  The loss of packets and the resulting drift-
627	   related artifacts are cleaned up by Intra refresh algorithms.

629	10.2. Video Telephony, Interleaved Packetization using Packet
630	Aggregation

632	   This scheme allows better error concealment and is widely used in
633	   H.263 based designed using RFC2429 packetization.  It is also
634	   implemented and good results were reported [5].

636	   The source picture is coded by the VCL such that all MBs of one MB
637	   line are assigned to one slice.  All slices with even MB row
638	   addresses are combined into one STAP, and all slices with odd MB row
639	   addresses into another STAP.  Those STAPs are transmitted as RTP
640	   packets.  The establishment of the Parameter Sets is performed as
641	   discussed above.

643	   Note that the use of STAPs is essential here, because the high
644	   number of individual slices (18 for a CIF picture) would lead to
645	   unacceptably high IP/UDP/RTP header overhead (unless the source
646	   coding tool FMO is used, which is not assumed in this scenario).
647	   Furthermore, some wireless video transmission systems, such as
648	   H.324M and the IP-based video telephony specified in 3GPP, are
649	   likely to use relatively small transport packet size.  For example,
650	   a typical MTU size of H.223 AL3 SDU is around 100 bytes [10].
651	   Coding individual slices according to this packetization scheme
652	   provides a further advantage in communication between wired and
653	   wireless networks, as individual slices are likely to be smaller
654	   than the preferred maximum packet size of wireless systems.
655	   Consequently, a gateway can convert the STAPs used in a wired
656	   network to several RTP packets with only one NALU that are preferred
657	   in a wireless network and vice versa.

659	10.3. Video Telephony, with Data Partitioning

661	   This scheme is implemented and was shown to offer good performance
662	   especially at higher packet loss rates [5].
663	   Data Partitioning is known to be useful only when some form of
664	   unequal error protection is available.  Normally, in single-session
665	   RTP environments, even error characteristics are assumed --
666	   statistically, the packet loss probability of all packets of the
667	   session is the same.  However, there are means to reduce the packet
668	   loss probability of individual packets in an RTP session.  One
669	   simple way is known as Packet Duplication: simply send the to-be-
670	   protected packet twice, with the same sequence number.  If both
671	   packets survive, the receiver will assume a packet duplication by
672	   UDP and discard one of the two packets.  Other means of unequal
673	   protection within the same RTP session include the use of RFC 2198
674	   [11] (for this application it is essentially a packet duplication
675	   process as well, with some saved bytes for the second RTP header),
676	   or packet-based Forward Error Correction [12] carried in RFC2198.

678	   The implemented software uses the simple packet duplication process
679	   to increase the probability of all DPA NALUs.  The incurred overhead
680	   is substantial, but in the same order of magnitude as the number of
681	   bits that have otherwise be spent for intra information.  However,
682	   this mechanism is not adding any delay to the system.

684	   Again, the complete Parameter Set establishment is performed through
685	   control protocol means.

687	10.4. MPEG-2 Transport to RTP Gateway

689	   This example is not implemented completely, but the basic mechanisms
690	   are part of the interim file format the JVT group uses and, hence,
691	   well tested.

693	   When using JVT video in satellite/cable broadcast environments,
694	   there is no control protocol available that can be used for the
695	   transmission of Parameter Sets.  Furthermore, a receiver has to be
696	   able to "tune" into an ongoing packet stream at any time, without
697	   much delay and artifacts.  For this reason, PSIs that contain all
698	   Parameter Set information are included in the packet stream at any
699	   Instantaneous Decoder Refresh Point (which are similar to Key Frames
700	   in earlier coding standards).  IDERP packets are used to signal
701	   these "key frames" so that a decoder can most easily determine where
702	   to start in its decoding process.

704	   Since the byte stream format used in satellite/cable broadcast
705	   environments does not include timing information in the video
706	   stream, the gateway needs to use external timing information (e.g.
707	   from the MPEG-2 system layer) to generate the RTP timestamp.  Please
708	   note that this timestamp is also a 90 kHz clock -- hence, in most
709	   cases, the conversion should be relatively simple.

711	   The simplest possible MPEG-2 transport to RTP gateway could take the
712	   NALUs as they come from the MPEG-2 transport stream (after de-
713	   framing), and send them, each NALU in one RTP packet, with
714	   increasing RTP sequence numbers.  However, less than perfect packet
715	   loss rates would lead to a very poor performance of such a system.
716	   However, a Gateway could use the protection mechanisms discussed
717	   above to unequally protect the most important packets, e.g. all PSIs
718	   (very strong protection) IDERPs (weak protection), and transmit
719	   everything else best effort.  The Gateway can do this without
720	   parsing the bit stream, by simply using the NALU type byte.
721	   A more sophisticated Gateway may be able to combine some small NALUs
722	   to a big STAP or MTAP in order to save the bytes used for the
723	   IP/UDP/RTP headers.

725	   A similar mechanism is, of course, also possible in H.320 to RTP
726	   gateways.  Here, however, the system environment does not include
727	   any timing information, and exact presentation timing is carried in
728	   the form of SEIs.  Hence, in the H.320 to IP data path, the gateway
729	   has the additional duty to filter out SEIs containing timing
730	   information and setting the RTP timestamp of the following video
731	   packets accordingly.  In the reverse direction, SEIs need to be
732	   generated using the RTP timestamp as a guideline.

734	10.5. Low-Bit-Rate Streaming

736	   This scheme has been implemented with H.263 and gave good results
737	   [13].  There is no technical reason why similarly good results could
738	   not be achievable using the JVT codec.

740	   In today's Internet streaming, some of the offered bit-rates are
741	   relatively low in order to allow terminals with dial-up modems to
742	   access the content.  In wired IP networks, relatively large packets,
743	   say 500 - 1500 bytes, are preferred to smaller and more frequently
744	   occurring packets in order to reduce network congestion.  Moreover,
745	   use of large packets decreases the amount of RTP/UDP/IP header
746	   overhead.  For low-bit-rate video, the use of large packets means
747	   that sometimes up to few pictures should be encapsulated in one
748	   packet.

750	   However, loss of such a packet would have drastic consequences in
751	   visual quality, as there is practically no other way to conceal a
752	   loss of an entire picture than to repeat the previous one.  One way
753	   to construct relatively large packets and maintain possibilities for
754	   successful loss concealment is to construct MTAPs that contain
755	   slices from several pictures in an interleaved manner.  An MTAP
756	   should not contain spatially adjacent slices from the same picture
757	   or spatially overlapping slices from any picture.  If a packet is
758	   lost, it is likely that a lost slice is surrounded by spatially
759	   adjacent slices of the same picture and spatially corresponding
760	   slices of the temporally previous and succeeding pictures.
761	   Consequently, concealment of the lost slice is likely to succeed
762	   relatively well.

764	11. Open Issues
765	   There are several open issues on which the authors would like to
766	   receive opinions.  They are listed below.

768	   MTAPs: are they efficient enough?  And, is 16 bit unsigned offset to
769	   a 90 kHz timestamp enough?  Need input from the streaming industry.
770	   One solution would be to create five different xTAP, with 0, 8, 16,
771	   24, and 32 bit timestamps per aggregation unit.  Another option
772	   would be a more complex payload header that signals presence (and
773	   size) of the timing information per aggregation unit.

775	   Since JVT will likely be approved as the advanced video codec of
776	   MPEG-4, it may be desirable to align this payload specification with
777	   other payload specifications for MPEG 4.  The authors of this I-D
778	   and some authors of the MPEG-4 packetization I-Ds are discussing the
779	   issue, and there is a chance that in the future changes to this I-D
780	   will be proposed to AVT to reflect the outcome of these discussions.

782	12. Full Copyright Statement
783	   Copyright (C) The Internet Society (2002). All Rights Reserved.

785	   This document and translations of it may be copied and furnished to
786	   others, and derivative works that comment on or otherwise explain it
787	   or assist in its implementation may be prepared, copied, published
788	   and distributed, in whole or in part, without restriction of any
789	   kind, provided that the above copyright notice and this paragraph
790	   are included on all such copies and derivative works.

792	   However, this document itself may not be modified in any way, such
793	   as by removing the copyright notice or references to the Internet
794	   Society or other Internet organizations, except as needed for the
795	   purpose of developing Internet standards in which case the
796	   procedures for copyrights defined in the Internet Standards process
797	   must be followed, or as required to translate it into languages
798	   other than English.

800	   The limited permissions granted above are perpetual and will not be
801	   revoked by the Internet Society or its successors or assigns.

803	   This document and the information contained herein is provided on an
804	   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
805	   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
806	   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
807	   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
808	   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

810	13. Bibliography

812	   [1]  P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-N57r2,
813	        available from ftp://standard.pictel.com/video-
814	        site/0109_San/VCEG-N57r2.doc, September 2001
815	   [2]  JVT Joint Committee Draft, available from ftp://ftp.imtc-
816	        files.org/jvt-experts/2002_05_Fairfax/JVT-C167.doc
817	   [3]  ITU-T Recommendation H.263-2000
818	   [4]  ISO/IEC IS 14496-1
819	   [5]  S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and
820	        Systems for Video technology, to appear (April 2002)

822	   [6]  S. Wenger, "H.26L over IP: The IP Network Adaptation Layer",
823	        Proceedings Packet Video Workshop 02, April 2002, to appear.
824	   [7]  C. Borman et. Al., "RTP Payload Format for the 1998 Version of
825	        ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998
826	   [8]  ISO/IEC IS 14496-2
827	   [9] S. Wenger, T. Stockhammer, "H.26L over IP and H.324 Framework",
828	        VCEG-N52, available from ftp://standard.pictel.com/video-
829	        site/0109_San/VCEG-N52.doc, September 2001
830	   [10] ITU-T Recommendation H.223 (1999)
831	   [11] C. Perkins et. al., "RTP Payload for Redundant Audio Data", RFC
832	        2198, September 1997

834	   [12] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for
835	        Generic Forward Error Correction", RFC 2733, December 1999
836	   [13] V Varsa, M. Karczewicz, "Slice interleaving in compressed video
837	        packetization", Packet Video Workshop 2000

839	   Author's Addresses

841	   Stephan Wenger                     Phone: +49-172-300-0813
842	   TU Berlin / Teles AG               Email: stewe@cs.tu-berlin.de
843	   Franklinstr. 28-29
844	   D-10587 Berlin
845	   Germany

847	   Thomas Stockhammer                 Phone: +49-89-28923474
848	   Institute for Communications Eng.  Email: stockhammer@ei.tum.de
849	   Munich University of Technology
850	   D-80290 Munich
851	   Germany

853	   Miska M. Hannuksela                Phone: +358 40 5212845
854	   Nokia Corporation                  Email: miska.hannuksela@nokia.com
855	   P.O. Box 68
856	   33721 Tampere
857	   Finland