|
Hi, My suggestion is to start today's meeting according to
Ye-Kui suggestion on the cross layer decoding and if time permits go over the
other issues from Mike's suggestion. We will need to continue with those items
via email but I think it will be good if we can come to some agreement at least
about how to present the two solutions for achieving consensus Here are the inputs to
make it easier for you
To help people make their judgments,
let me try to summarize the pros and cons of the two modes. Many of these have
already been discussed earlier in the mailing list - but they still need to be
considered for the group to make a conscious decision. Cons of "the CL-DON mode"
and pros of "the classical mode": - The single NAL unit packetization
mode cannot be used in "the CL-DON mode", while all the three
packetization modes can be used in "the classical mode". This
backward compatibility related issue is the major constraint of "the
CL-DON mode". However, this constraint is diluted by the following facts.
A) As discussed earlier, it is still possible to encapsulate only one video NAL
unit into one packet, together with a small PACSI NAL unit that carries the
CL-DON information. B) Currently we are no aware of existing implementations of
RFC 3984 that are capable of doing multicast AND use only the single NAL unit
packetization mode. Pros of "the CL-DON mode"
and cons of "the classical mode": - "The CL-DON mode" does
not while "the classical mode" does rely on timestamp (be it RTP
timestamp or anything else) values to synchronize NAL units carried in
different RTP sessions. Using RTP timestamp for the purpose requires that the
RTP packets from the same access unit but carried in different RTP sessions to
have the same RTP timestamp value, which seems have been basically concluded as
not viable in the previous AVT meeting. As for using other existing RTP
synchronization mechanisms for this purpose, there is an issue of the initial
delay before the synchronization among all the RTP sessions are done. - "The CL-DON mode" does
not need to run separate NAL unit decoding order recovery processes for the all
the sessions before recovering the decoding order for all the sessions. Only
one process is needed for all the sessions for "the CL-DON mode", and
the process is the almost exactly the same as the de-interleaving process in
RFC - Generation and insertion of
additional video NAL units to the original bitstream is never needed by
"the CL-DON mode" but may be needed by "the classical
mode". This generation and insertion of additional NAL units may be
required to be done by the RTP packetizer (not the video encoder) when
pre-encoded contents are used. Moreover, these additional NAL units may make
the original bitstream or a subset of it non-conforming to the standard because
of conflict in buffering (i.e. HRD - hypothetical reference decoder)
parameters. - "The CL-DON mode"
outputs all received video NAL units to the video decoder, while "the
classical mode" has to discard some received video NAL units in some cases
of packet losses. This in some cases will degrade the decoded video quality
because some useful video NAL units are discarded. Furthermore, from an
architecture point of view, handling (including discarding) of video NAL units
should be the video decoder's task not the RTP payload depacketization process.
- "The CL-DON mode" allows
to carry non-VCL NAL units such as parameter sets and SEI messages only when
they are required (i.e. non-VCL NAL units can be carried in the RTP session
that carries the VCL NAL units with which the non-VCL NAL units are associated
with), while in "the classical mode" all the non-VCL NAL units have
to always be carried in the base RTP session. Carrying of these additional data
in the base RTP session requires additional bandwidth for transmission of the
base RTP session, and the additional data may make the bistream subset carried
in the base RTP session non-conforming to the standard because of conflict in
buffering (i.e. HRD - hypothetical reference decoder) parameters. Finally, here are what I think still
needed to clarified in subsection 8.1.1 ("the classical
mode") for people to better understand the process. The text for
subsection 8.1.2 (the CL-DON mode), which I should be responsible for, is clear
enough to me. However, please let me know if I still need to improvement any
parts therein. - What exactly is the
"timestamp"? If not the RTP timestamp, what is the exact process to
derive that "timestamp"? - What is the size of the
de-session-multiplexing buffer that contains the output packets from the RTP
process as specified in subsection 8.1 before the NAL units are depacketized
and sent to the video decoder? - When does the decoding order
recovery between RTP streams of different RTP sessions start (e.g. after what
number of packets in the de-session-multiplexing buffer or how long an initial
time after the first packet has entered the de-session-multiplexing buffer)? - When generation and insertion of
additional video NAL units to the original bitstream is needed, what is the
exact process for generation and insertion of the additional video NAL units?
How to ensure the received bitstream subsets to conform to the SVC coding
standard? - Packet loss may requires to
discard received NAL units. When should that happen? What is the exact process?
The exact process is needed because the size of the de-session-multiplexing
buffer and the time when the decoding order recovery between RTP streams of
different RTP sessions start will be affected by the NAL units discarding
process.
The problem to be solved
can be summarised as follows. The video encoder, or other source of coded video
data, produces a sequence of chunks of data known as NAL units. These are to be
transmitted over two or more RTP sessions. At the receiver, the data is to be
put back into a single sequence with the same order as in the original sequence
from the encoder or data source. This data is then input to the video decoder
and is decoded and output. There are other variants, where for example, the
receiver is not a decoder, but some other device such as a MANE, but the core
problem of re-establishing the original order of NAL units is the same. One of the solutions to
this problem, the CL-DON solution, allocates a monotonic increasing sequence
number to each of the NAL units from the encoder, transports these numbers
through the network, and uses these numbers to re-establish the original order
of NAL units. The NAL units received on the multiple RTP sessions are simply
ordered according to this monotonic increasing sequence of numbers. The other solution to
this problem, the classical solution, when operated as in the rules in the
current version of the draft operates as follows. The NAL units from the
encoder are grouped into NAL units and associated with a non- monotonic number
(the timestamp representing output (display) order rather than decoding order).
Effectively the NAL units are being labelled with (almost) arbitrary labels.
These labelled NAL units are then separated into multiple RTP streams, and a monotonic
increasing sequence is applied independently in each RTP stream. Note both of
these steps are performed in the CL-DON solution, but do not have to be used to
restore decoding order. At the receiver, the independent monotonic increasing
sequence numbers are used to re-order packets in each RTP stream. These are
then grouped according to label (timestamp) in each RTP stream. Then the NAL
units from the lower layers are “merged” with the NAL units in the
highest enhancement layer, grouping together NAL units with the same label
(timestamp). Finally, SEI NAL units must be moved to the start of each group
(access unit), if they were transmitted anywhere other than the base RTP
session. This suffers from the
need for the highest layer to have NAL units at every time instant for which
there is a NAL unit in any lower layer. And due to the need for this process to
work regardless of how many of the RTP sessions are received, the same has to
apply to any layer with regards to the layers below it. While this can be
overcome by inserting filler data NAL units, it does seem to have a problem
with packet loss, as this situation can not be guaranteed after loss. Given
that the highest layer may often be transmitted with the least error
protection, this is a major limitation of this approach. But the classical
solution can be operated in a different way at the receiver to overcome this
limitation, but with additional complexity. As before, at the receiver, the
independent monotonic increasing sequence numbers are used to re-order packets
in each RTP stream, and then these are grouped according to label (timestamp)
in each RTP stream. Then the sequences of labels (timestamps) in each stream
can be analysed, and in many (but not all) cases, the decoding order of the labels
(timestamps) can be deduced, and then used to restore the decoding order. In the example below, the
top two RTP sessions operate at a given frame rate and the base layer is
operating at half the frame rate. Packet loss has affected one access unit in the
top layer and one access unit in the middle layer. 4
1 3 8 6 5 7 12 10 4
2 1 3 8 5 7 12 10 4
2 8
6 12 10 However, decoding order
can be restored by noticing from the middle layer that NAL units with label =2
are to be decoded before those with label=1. Similarly, the top layer tells us
that NAL units with label =6 are to be decoded before those with label=5. But if both middle and
top layers lost their NAL units with label=2, as shown below, it would be more
difficult to re-establish decoding order as from the RTP and payload layer it
is not possible to determine if label=2 comes before or after label=1. It may
be possible to determine order by looking into pic_timing SEI messages, if
present (not guaranteed), or a best guess could be made by making assumptions
based on previous GOP structures (and the order of timestamps). Alternatively
it may be better to discard all NAL units with labels 1 and 3 rather than to
risk feeding data to the decoder in the wrong order. 4
1 3 8 6 5 7 12 10 4
1 3 8 5 7 12 10 4
2 8
6 12 10 My conclusion is that
while using a non-monotonic set of numbers (timestamps) to re-establish
decoding order is possible in many but not all cases, it is a fairly complex
process, particularly if it is to make best use of all packets received when
some are lost, as in the second method above. And in practice I feel that the
second method would be implemented because the performance of the first in the
case of packet loss could be unacceptably poor. The major weakness of the
CL-DON method is that it is not backwards compatible with the single NAL unit
mode of RFC 3984. One way to overcome this
would be to use some backward compatible mechanism to transport the CL-DON information
in the base RTP session operating in single NAL unit mode. The RTP header
extension mechanism is one way that this could be done, but I know that there
are objections to doing this. However, the single NAL
unit mode was introduced into RFC 3984 primarily “for low-delay
applications that are compatible with systems using ITU-T Recommendation
H.241”. Hence, if there is a need
for backwards compatibility with the single NAL unit mode, and this is itself
very debatable, then this need would seem to be restricted to low delay
applications, where it is unlikely that access units would be encoded in a
different order to output (display) order. Consequently, a solution
to the whole problem of restoring the decoding order of NAL units is define a
class of receiver that supports the full CL-DON method, and the classical
method restricted to cases where the timestamps are monotonic increasing. This
restricted case of the classical method is much simpler to implement than the
general case, and provides backwards compatibility with the intended
applications of the single NAL unit mode.
More architecturally, I wonder if
cross-layer decoding is a question that should be addressed in a
generic manner rather than per-media. Several drafts have been presented
recently for audio codecs which do session multiplexing in a very
similar way to H.264/SVC, and they also need to be able to indicate a global
decoding order. Should this be addressed as a generic problem? Roni Even |
_______________________________________________ Audio/Video Transport Working Group avt at ietf.org http://www.ietf.org/mailman/listinfo/avt