[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [AVT] draft-ietf-avt-rtp-svc-07.txt: classical RTP Decoding Order Recovery Mode



I am still concurring with Randell on the following point:
- The solution for decoding order recovery process (in the final RFC)
must be able to deal with both constant variable frame rate use cases. 

I am also concurring with Yann on the following point: 
- The solution for decoding order recovery process (in the final RFC)
must be able to GRACEFULLY deal with packet losses. To the extreme, the
process must be able to pass all the received NAL units to the video
decoder in decoding order. Discarding NAL units by the payload format
(depacketization) process seems to have an architectural issue - NAL
unit discarding should be handled by the video decoder. 

Unfortunately, since before the previous IETF meeting, the biggest open
issue for this draft remains to be NAL unit decoder order recovery for
session multiplexing!  

Are people now ready to make judgment on this issue in terms of the
following?
- To have only one of the two options in subsections 8.1.1 ("the
classical mode") and 8.1.2 ("the CL-DON mode"), or both?
- If to have only one, which one?

If people are still not ready to think about the issue because of the
text clarity, please express which parts of the related text are still
not clear enough, e.g. as Randell commented in the bottom that "8.1.1 is
still more than a bit confusing, even with the example". 

To help people make their judgments, let me try to summarize the pros
and cons of the two modes. Many of these have already been discussed
earlier in the mailing list - but they still need to be considered for
the group to make a conscious decision. 

Cons of "the CL-DON mode" and pros of "the classical mode":

- The single NAL unit packetization mode cannot be used in "the CL-DON
mode", while all the three packetization modes can be used in "the
classical mode". This backward compatibility related issue is the major
constraint of "the CL-DON mode". However, this constraint is diluted by
the following facts. A) As discussed earlier, it is still possible to
encapsulate only one video NAL unit into one packet, together with a
small PACSI NAL unit that carries the CL-DON information. B) Currently
we are no aware of existing implementations of RFC 3984 that are capable
of doing multicast AND use only the single NAL unit packetization mode.

Pros of "the CL-DON mode" and cons of "the classical mode": 

- "The CL-DON mode" does not while "the classical mode" does rely on
timestamp (be it RTP timestamp or anything else) values to synchronize
NAL units carried in different RTP sessions. Using RTP timestamp for the
purpose requires that the RTP packets from the same access unit but
carried in different RTP sessions to have the same RTP timestamp value,
which seems have been basically concluded as not viable in the previous
AVT meeting. As for using other existing RTP synchronization mechanisms
for this purpose, there is an issue of the initial delay before the
synchronization among all the RTP sessions are done. 

- "The CL-DON mode" does not need to run separate NAL unit decoding
order recovery processes for the all the sessions before recovering the
decoding order for all the sessions. Only one process is needed for all
the sessions for "the CL-DON mode", and the process is the almost
exactly the same as the de-interleaving process in RFC 3984. In
contrast, "the classical mode" needs to run separate NAL unit decoding
order recovery processes for the all the sessions before recovering the
decoding order for all the sessions, and therefore, totally N+1
processes are needed for all the N sessions in "the classical mode",
wherein the N separate processes are exactly the same as in RFC 3984,
and the last process is totally new. 

- Generation and insertion of additional video NAL units to the original
bitstream is never needed by "the CL-DON mode" but may be needed by "the
classical mode". This generation and insertion of additional NAL units
may be required to be done by the RTP packetizer (not the video encoder)
when pre-encoded contents are used. Moreover, these additional NAL units
may make the original bitstream or a subset of it non-conforming to the
standard because of conflict in buffering (i.e. HRD - hypothetical
reference decoder) parameters.

- "The CL-DON mode" outputs all received video NAL units to the video
decoder, while "the classical mode" has to discard some received video
NAL units in some cases of packet losses. This in some cases will
degrade the decoded video quality because some useful video NAL units
are discarded. Furthermore, from an architecture point of view, handling
(including discarding) of video NAL units should be the video decoder's
task not the RTP payload depacketization process. 

- "The CL-DON mode" allows to carry non-VCL NAL units such as parameter
sets and SEI messages only when they are required (i.e. non-VCL NAL
units can be carried in the RTP session that carries the VCL NAL units
with which the non-VCL NAL units are associated with), while in "the
classical mode" all the non-VCL NAL units have to always be carried in
the base RTP session. Carrying of these additional data in the base RTP
session requires additional bandwidth for transmission of the base RTP
session, and the additional data may make the bistream subset carried in
the base RTP session non-conforming to the standard because of conflict
in buffering (i.e. HRD - hypothetical reference decoder) parameters.

Finally, here are what I think still needed to clarified in subsection
8.1.1 ("the classical mode") for people to better understand the
process. The text for subsection 8.1.2 (the CL-DON mode), which I should
be responsible for, is clear enough to me. However, please let me know
if I still need to improvement any parts therein. 

- What exactly is the "timestamp"? If not the RTP timestamp, what is the
exact process to derive that "timestamp"?

- What is the size of the de-session-multiplexing buffer that contains
the output packets from the RTP process as specified in subsection 8.1
before the NAL units are depacketized and sent to the video decoder?

- When does the decoding order recovery between RTP streams of different
RTP sessions start (e.g. after what number of packets in the
de-session-multiplexing buffer or how long an initial time after the
first packet has entered the de-session-multiplexing buffer)?

- When generation and insertion of additional video NAL units to the
original bitstream is needed, what is the exact process for generation
and insertion of the additional video NAL units? How to ensure the
received bitstream subsets to conform to the SVC coding standard? 

- Packet loss may requires to discard received NAL units. When should
that happen? What is the exact process? The exact process is needed
because the size of the de-session-multiplexing buffer and the time when
the decoding order recovery between RTP streams of different RTP
sessions start will be affected by the NAL units discarding process.

BR, YK 

>-----Original Message-----
>From: ext Randell Jesup [mailto:rjesup at wgate.com] 
>Sent: Wednesday, February 06, 2008 6:49 PM
>To: Thomas Schierl
>Cc: Wang Ye-Kui (Nokia-NRC/Tampere); avt at ietf.org; 
>Yann.Leprovost at alcatel-lucent.fr
>Subject: Re: [AVT] draft-ietf-avt-rtp-svc-07.txt: classical 
>RTPDecodingOrder Recovery Mode
>
>"Thomas Schierl" <schierl at hhi.fhg.de> writes:
>>Randell Jesup [mailto:rjesup at wgate.com] writes:
>>> "Thomas Schierl" <schierl at hhi.fhg.de> writes:
>>> >The main target for session multiplexing will be layered multicast
>>> e.g. over
>>> >a wireless broadcast/multicast channel as DVB-H. In that case 
>>> >somebody
>>> may
>>> >rely or may not rely on the frame rate. Anyway, when having SVC
>>> conformance
>>> >in mind, a session carrying an enhancement layer must have or must
>>> enhance a
>>> >lower session to the same or a higher frame rate than contained in 
>>> >the
>>> lower
>>> >session.
>>>
>>> Ok, though that just says that a major use of this will do that.
>>
>>Yes, and it further says that we may not need to cover the case 
>>described by Yann. And so, we may end the "frame rate discussion".
>
>But I disagree with that as a requirement for SVC:  Why must 
>an enhancement layer have the same or higher frame rate?  Even 
>ignoring the non-continuous/variable framerate issues (and 
>they exist too), I would think that there would be use for 
>something like a 60Hz SD layer with a 24Hz hi-resolution 
>layer.  I'm not sure of that, though.
>
>But this is straying from my main point, which is you need to 
>consider (or mandate no use of) variable framerates and 
>skipping during encoding.
>
>>> Important
>>> to remember, but as with most protocols it's better to not 
>tie it too 
>>> closely to one particular application.
>>
>>Sure, the current approach (section 8.1.1 in 
>>draft-ietf-avt-rtp-svc-07.txt) is supporting all use cases.
>
>This is where my not following the details of the draft hurts 
>me.  I'll try to catch up.  8.1.1. mandates no use of CL-DON.
>
>BTW, typo in 8.1.1: "decdoding oreder"
>
>>> >Another point: If we are talking about telephony, we typically do 
>>> >not
>>> have
>>> >frames with decoding order different from presentation order. This 
>>> >is
>>> due to
>>> >the mentioned delay constraints.
>>>
>>> Face-to-Face Video Telephony normally wouldn't run CL-DON 
>type modes, 
>>> true.  However other applications with variable frame rate 
>would, for 
>>> example security cameras, and not every source of a video 
>signal is a 
>>> fixed-framerate capture device; the "video" could be an 
>>> algorithmically-generated visual stream, for example for 
>>> visualization uses, and scalability might be very useful there when 
>>> the receivers (power or bandwidth) vary, or if it's a conference.
>>
>>Why should CL-DON be run by use cases different from video 
>telephony? I 
>>do not see the point. The discussion is about "why do we need CL-DON, 
>>when
>>8.1.1 covers all cases?" The slides, sent out by Yann yesterday, show 
>>that even in packet loss scenarios 8.1.1 works.
>
>Since I'm not up to speed on 8.1.1 (which, BTW, is still more 
>than a bit confusing, even with the example), I can't comment, 
>other than to say that no solution should require 'fixed' 
>framerates, and I don't think (but am not 100% sure) that 
>higher layers should be *required* to have same or higher 
>framerates than lower layers.
>
>--
>Randell Jesup
>rjesup at wgate.com
>"The fetters imposed on liberty at home have ever been forged 
>out of the weapons provided for defence against real, 
>pretended, or imaginary dangers from abroad."
>		- James Madison, 4th US president (1751-1836)
>
>
_______________________________________________
Audio/Video Transport Working Group
avt at ietf.org
http://www.ietf.org/mailman/listinfo/avt