[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[AVT] SVC ad-hoc meeting February 19th



Hi,

My suggestion is to start today's meeting according to Ye-Kui suggestion on the cross layer decoding and if time permits go over the other issues from Mike's suggestion. We will need to continue with those items via email but I think it will be good if we can come to some agreement at least about how to present the two solutions for achieving consensus

 

 

 

Here are the inputs to make it easier for you

 

 

  1. From Ye-Kui

 

To help people make their judgments, let me try to summarize the pros and cons of the two modes. Many of these have already been discussed earlier in the mailing list - but they still need to be considered for the group to make a conscious decision.

 

Cons of "the CL-DON mode" and pros of "the classical mode":

 

- The single NAL unit packetization mode cannot be used in "the CL-DON mode", while all the three packetization modes can be used in "the classical mode". This backward compatibility related issue is the major constraint of "the CL-DON mode". However, this constraint is diluted by the following facts. A) As discussed earlier, it is still possible to encapsulate only one video NAL unit into one packet, together with a small PACSI NAL unit that carries the CL-DON information. B) Currently we are no aware of existing implementations of RFC 3984 that are capable of doing multicast AND use only the single NAL unit packetization mode.

 

Pros of "the CL-DON mode" and cons of "the classical mode":

 

- "The CL-DON mode" does not while "the classical mode" does rely on timestamp (be it RTP timestamp or anything else) values to synchronize NAL units carried in different RTP sessions. Using RTP timestamp for the purpose requires that the RTP packets from the same access unit but carried in different RTP sessions to have the same RTP timestamp value, which seems have been basically concluded as not viable in the previous AVT meeting. As for using other existing RTP synchronization mechanisms for this purpose, there is an issue of the initial delay before the synchronization among all the RTP sessions are done.

 

- "The CL-DON mode" does not need to run separate NAL unit decoding order recovery processes for the all the sessions before recovering the decoding order for all the sessions. Only one process is needed for all the sessions for "the CL-DON mode", and the process is the almost exactly the same as the de-interleaving process in RFC 3984. In contrast, "the classical mode" needs to run separate NAL unit decoding order recovery processes for the all the sessions before recovering the decoding order for all the sessions, and therefore, totally N+1 processes are needed for all the N sessions in "the classical mode", wherein the N separate processes are exactly the same as in RFC 3984, and the last process is totally new.

 

- Generation and insertion of additional video NAL units to the original bitstream is never needed by "the CL-DON mode" but may be needed by "the classical mode". This generation and insertion of additional NAL units may be required to be done by the RTP packetizer (not the video encoder) when pre-encoded contents are used. Moreover, these additional NAL units may make the original bitstream or a subset of it non-conforming to the standard because of conflict in buffering (i.e. HRD - hypothetical reference decoder) parameters.

 

- "The CL-DON mode" outputs all received video NAL units to the video decoder, while "the classical mode" has to discard some received video NAL units in some cases of packet losses. This in some cases will degrade the decoded video quality because some useful video NAL units are discarded. Furthermore, from an architecture point of view, handling (including discarding) of video NAL units should be the video decoder's task not the RTP payload depacketization process.

 

- "The CL-DON mode" allows to carry non-VCL NAL units such as parameter sets and SEI messages only when they are required (i.e. non-VCL NAL units can be carried in the RTP session that carries the VCL NAL units with which the non-VCL NAL units are associated with), while in "the classical mode" all the non-VCL NAL units have to always be carried in the base RTP session. Carrying of these additional data in the base RTP session requires additional bandwidth for transmission of the base RTP session, and the additional data may make the bistream subset carried in the base RTP session non-conforming to the standard because of conflict in buffering (i.e. HRD - hypothetical reference decoder) parameters.

 

Finally, here are what I think still needed to clarified in subsection

8.1.1 ("the classical mode") for people to better understand the process. The text for subsection 8.1.2 (the CL-DON mode), which I should be responsible for, is clear enough to me. However, please let me know if I still need to improvement any parts therein.

 

- What exactly is the "timestamp"? If not the RTP timestamp, what is the exact process to derive that "timestamp"?

 

- What is the size of the de-session-multiplexing buffer that contains the output packets from the RTP process as specified in subsection 8.1 before the NAL units are depacketized and sent to the video decoder?

 

- When does the decoding order recovery between RTP streams of different RTP sessions start (e.g. after what number of packets in the de-session-multiplexing buffer or how long an initial time after the first packet has entered the de-session-multiplexing buffer)?

 

- When generation and insertion of additional video NAL units to the original bitstream is needed, what is the exact process for generation and insertion of the additional video NAL units? How to ensure the received bitstream subsets to conform to the SVC coding standard?

 

- Packet loss may requires to discard received NAL units. When should that happen? What is the exact process? The exact process is needed because the size of the de-session-multiplexing buffer and the time when the decoding order recovery between RTP streams of different RTP sessions start will be affected by the NAL units discarding process.

 

  1. From Mike

 

The problem to be solved can be summarised as follows. The video encoder, or other source of coded video data, produces a sequence of chunks of data known as NAL units. These are to be transmitted over two or more RTP sessions. At the receiver, the data is to be put back into a single sequence with the same order as in the original sequence from the encoder or data source. This data is then input to the video decoder and is decoded and output. There are other variants, where for example, the receiver is not a decoder, but some other device such as a MANE, but the core problem of re-establishing the original order of NAL units is the same.

 

One of the solutions to this problem, the CL-DON solution, allocates a monotonic increasing sequence number to each of the NAL units from the encoder, transports these numbers through the network, and uses these numbers to re-establish the original order of NAL units. The NAL units received on the multiple RTP sessions are simply ordered according to this monotonic increasing sequence of numbers.

 

The other solution to this problem, the classical solution, when operated as in the rules in the current version of the draft operates as follows. The NAL units from the encoder are grouped into NAL units and associated with a non- monotonic number (the timestamp representing output (display) order rather than decoding order). Effectively the NAL units are being labelled with (almost) arbitrary labels. These labelled NAL units are then separated into multiple RTP streams, and a monotonic increasing sequence is applied independently in each RTP stream. Note both of these steps are performed in the CL-DON solution, but do not have to be used to restore decoding order. At the receiver, the independent monotonic increasing sequence numbers are used to re-order packets in each RTP stream. These are then grouped according to label (timestamp) in each RTP stream. Then the NAL units from the lower layers are “merged” with the NAL units in the highest enhancement layer, grouping together NAL units with the same label (timestamp). Finally, SEI NAL units must be moved to the start of each group (access unit), if they were transmitted anywhere other than the base RTP session.

 

This suffers from the need for the highest layer to have NAL units at every time instant for which there is a NAL unit in any lower layer. And due to the need for this process to work regardless of how many of the RTP sessions are received, the same has to apply to any layer with regards to the layers below it. While this can be overcome by inserting filler data NAL units, it does seem to have a problem with packet loss, as this situation can not be guaranteed after loss. Given that the highest layer may often be transmitted with the least error protection, this is a major limitation of this approach.

 

But the classical solution can be operated in a different way at the receiver to overcome this limitation, but with additional complexity. As before, at the receiver, the independent monotonic increasing sequence numbers are used to re-order packets in each RTP stream, and then these are grouped according to label (timestamp) in each RTP stream. Then the sequences of labels (timestamps) in each stream can be analysed, and in many (but not all) cases, the decoding order of the labels (timestamps) can be deduced, and then used to restore the decoding order.

 

In the example below, the top two RTP sessions operate at a given frame rate and the base layer is operating at half the frame rate. Packet loss has affected one access unit in the top layer and one access unit in the middle layer.

 

 4     1  3  8  6  5  7 12 10

 4  2  1  3  8     5  7 12 10

 4  2        8  6       12 10

 

However, decoding order can be restored by noticing from the middle layer that NAL units with label =2 are to be decoded before those with label=1. Similarly, the top layer tells us that NAL units with label =6 are to be decoded before those with label=5.

 

 

But if both middle and top layers lost their NAL units with label=2, as shown below, it would be more difficult to re-establish decoding order as from the RTP and payload layer it is not possible to determine if label=2 comes before or after label=1. It may be possible to determine order by looking into pic_timing SEI messages, if present (not guaranteed), or a best guess could be made by making assumptions based on previous GOP structures (and the order of timestamps). Alternatively it may be better to discard all NAL units with labels 1 and 3 rather than to risk feeding data to the decoder in the wrong order.

 

 4     1  3  8  6  5  7 12 10

 4     1  3  8     5  7 12 10

 4  2        8  6       12 10

 

My conclusion is that while using a non-monotonic set of numbers (timestamps) to re-establish decoding order is possible in many but not all cases, it is a fairly complex process, particularly if it is to make best use of all packets received when some are lost, as in the second method above. And in practice I feel that the second method would be implemented because the performance of the first in the case of packet loss could be unacceptably poor.

 

The major weakness of the CL-DON method is that it is not backwards compatible with the single NAL unit mode of RFC 3984.

 

One way to overcome this would be to use some backward compatible mechanism to transport the CL-DON information in the base RTP session operating in single NAL unit mode. The RTP header extension mechanism is one way that this could be done, but I know that there are objections to doing this.

 

However, the single NAL unit mode was introduced into RFC 3984 primarily “for low-delay applications that are compatible with systems using ITU-T Recommendation H.241”.

 

Hence, if there is a need for backwards compatibility with the single NAL unit mode, and this is itself very debatable, then this need would seem to be restricted to low delay applications, where it is unlikely that access units would be encoded in a different order to output (display) order.

 

Consequently, a solution to the whole problem of restoring the decoding order of NAL units is define a class of receiver that supports the full CL-DON method, and the classical method restricted to cases where the timestamps are monotonic increasing. This restricted case of the classical method is much simpler to implement than the general case, and provides backwards compatibility with the intended applications of the single NAL unit mode.

 

 

  1. Jonathan's suggestion

 

More architecturally, I wonder if cross-layer decoding is a question

that should be addressed in a generic manner rather than per-media.

Several drafts have been presented recently for audio codecs which do

session multiplexing in a very similar way to H.264/SVC, and they also

need to be able to indicate a global decoding order.  Should this be

addressed as a generic problem?

 

 

Roni Even

_______________________________________________
Audio/Video Transport Working Group
avt at ietf.org
http://www.ietf.org/mailman/listinfo/avt