[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [AVT] Submission and request for feedback on draft-valin-celt-rtp-profile-00.txt
Randell Jesup wrote:
Noted. We missed that when changing some SHOULDs for MUSTs. We hesitated a
lot in deciding what's a SHOULD and what's a MUST. Basically, CELT's
advantage is that it can operate with almost any sampling rate, frame size
or bit-rate. The disadvantage is that unless we specify "baseline
requirements", we might end up with several implementations that are unable
to inter-operate. Also, feel free to suggest a better baseline if you think
we didn't select the right one.
Is this intended to be used as a "speech" codec at all; as an alternative
to iLBC/G.729/G.722.x/etc? If so, then support for 8KHz and/or 16KHz may
be important to mandate for interoperability reasons.
CELT is (so far at least) not intended for lower sampling rates like 8
kHz or 16 kHz and doesn't operate in the same space as the codecs above
or Speex. It's closer to codecs such as AAC-LD, G.722.1C, G.719, and
ULD, though only ULD has a delay as short as CELT's.
You DON'T want to be
mixing sample rates if multiple codecs are accepted. (In theory it can be
done, but in practice it would be risky, especially in the face of packet
loss.) For example, if you have this:
Random example (I probably have the G722 media type wrong):
m=audio 4321 RTP/AVP 0 97 98
a=rtpmap:0 PCMU/8000
a=rtpmap:97 G722/16000
a=rtpmap:98 CELT/48000
To quote from an earlier AVT email I wrote on this subject on 3 Dec 2007:
Subject: Re: [AVT] I-D ACTION:draft-ietf-avt-rtcpxr-audio-01.txt
[big SNIP]
This means the timestamp rate can change at any point, on a
packet-by-packet basis. It's even theoretically allowable to alternate
G711 and G722 packets. Totally odd and non-useful, but it illustrates
the point. More realistic is a change from one to the other half-way
through an RTCP monitoring period.
So you mean that you'd need to maintain a coherent timestamp despite
changing between codecs that have different sampling rates? I wasn't
aware of that, so I guess it's something that needs to be addressed.
Maybe just by saying that different sampling rates SHOULD NOT be used
with the same m=
The wording there is very lawyer-ese, and really addressing how it's
affected by a multicast or conference setting.
If the b=AS is at the m= level of the SDP (not above all the m='s), then it
only applies to that one media stream. However, exactly what b=AS *means*
is very fuzzy. b=AS is not a codec parameter; it's a stream parameter, and
it's also a reception parameter, not a "I plan to send" parameter.
There's been a lot of discussion about b=TIAS (RFC 3890) as a better way to
specify bandwidth. Note that b=AS INCLUDES RTP/UDP/IP overhead, and thus
implicitly is dependent on packet rate.
I've seen many G.711 implementations using b=AS:64, so I thought we
could use it in a way that excludes the overhead.
More to the point, b=AS is just one way to specify bandwidth.
Another huge blocker for using b=AS (or b=anything) in this way: what if
another codec also offered in this stream wanted to re-use b=AS as well?
And what if the preferred bitrates (or max-bitrate) for each codec was
different?
How much do you *need* to specify bandwidth here? Realize that most
devices don't have a good idea what receive bandwidth is even theoretically
available, let alone practically. Most configuration is done at the sender
end, or by explicit choice of codec and bitrate.
I suggest reviewing how other multiple-bitrate codecs like G.722.x handle
this (AMR-WB, etc).
Also, isn't CELT true variable-bitrate? If so, the bitrate to use
(initially?) might be very different than the "maximum" bitrate.
CELT can change bit-rate at any time, but so far it only changes based
on what the senter wants to use, i.e. to adapt to congestion. Even if
used with b=AS:, I was thinking that it would be more like a max
bit-rate anyway.
Well, I didn't see that to be a problem considering that one would probably
want the same packetization time for any codec. Do you see a case where you
wouldn't want that? I'm not not quite sure how people use the ptime in
practice and how much it is followed.
Sure. You might prefer 10ms, which G.711 and some others support, but iLBC
only supports 20 and 30 ms frames, and thus only multiples of those for the
actual packetization time. You can still specify a ptime of 10 when using
iLBC.
ptime is merely a "I would prefer to receive" parameter. Do not rely on it
for anything. The actual packetization time does NOT have to be the same
in each direction, or even the same from one RTP packet to the next.
(packet 1 could have 1 frame and packet 2 could have 10).
iLBC is really negotiating the framesize, not the packetization time.
Well, I was thinking of using the ptime just for a preference. You
specify the frame size and ptime helps decide how many frames get sent
per packet.
Ok, but this really isn't part of the codec. Doesn't hurt to tell people.
(I'm assuming that the SDP spec for maxptime allows ignoring it (somewhat)
- if it doesn't, then you have to reject that payload.)
As far as I understand, maxptime is a SHOULD in the rfc, so I thought we'd
just mention its interpretation wrt CELT.
Ok, then say something like "per [RFC 4566], if the maximum is lower..."
OK, will do that.
OK, so maybe we should add a BNF. As for the examples, I think they're
adequate, but let me know if that's not the case.
BNF may not be *needed*, but if there's anything at all complex it's handy to
avoiding mistakes.
Noted.
The fundamental issue here is that one needs to know the frame size to be
able to initialise the decoder. Just like a codec like iLBC had two modes
(for 20 ms and 30 ms frame), CELT has a *very large* number of modes: one
for each combination of frame size and sampling rate. So the idea was that
one side offers a list of frame sizes and the other side responds with the
one it likes best and both sides use that. There is no way to decode media
without knowing the frame size. Any idea what's the best way to handle
that?
a) use different payloads. Clear, easy, wastes space in SDP
b) include framesize in the bitstream in *every* packet. Clear, easy,
no-fuss, wastes some bandwidth all the time. And isn't this required
anyways if there's more than one channel?
c) ignore media until you get an answer with a clear selection. Clear,
easy, could be a major loss of media on a delayed answer.
d) require media sent (at least until acknowledgment of an answer) be
clearly distinguishable - i.e. do not vary framesize from the offered
values until an ACK, and do not allow packet sizes that are common
multiples of offered framesizes.
For example, if you offer (say) 20 and 30, do not send a packet
containing 60, 120, etc. You can send 20, 30, 40, 80, 90, 100, since
they can't be mis-understood.
Complex, doesn't waste bits, artificial constraints, no adaptation to
congestion/etc until ACK.
OK, I'll need to give this a bit more thought.
We're also considering having a "configuration" packet to be sent at the
beginning of any stream and that includes even more mode-specific data to
increase flexibility. However, we haven't found a good way of doing that
yet (wrt loss of the configuration packet). Any thought on that?
Yes: you can't assume 0-loss. Also, if possible, the bitstream should be
decodable without the out-of-band channel information. Video people have
struggled with this with sprop-parameter-sets in H.264 (RFC 3984).
Downsize will be bandwidth used. You can amortize the overhead by sending
the config packets only periodically, but you'll probably need to send
them reasonably often to allow mid-stream join (think conferences).
Yes, I'm aware of the loss problem and I'm not sure how to handle that.
I'll have a look at the RFC you mention.
I don't see a reason to make it different from other codecs considering
that CELT can handle about any ptime. I wouldn't mind doing it though if
there's a use case for it.
Then specify a value, and let CELT round up (or down) as needed.
It might be nice to show how this might interact with alternative payloads
that might want multiples of 10.
I guess the rounding would just be a bit more. WIll add an example.
Actually, we're not yet sure the reduced overhead is worth the trouble of
adding another layout. Any thought?
Probably not. How big a saving is it? This mandates (effectively) CBR.
Too bad there's no way in-stream to know when you've finished decoding
the channel (stop token or the like). Then you don't need channel
sizes at all (though perhaps you might want them).
Well, the stop token would waste about the same space as the size value...
Well, we have the same fundamental issue everywhere. If we don't know what
frame size was selected, we cannot decode anything. So we always need the
answer. The only way I can see to go around that would be to use a
different payload type for every parameter combination, but I think that
would be ugly.
You really *need* to handle the media-before-answer case somehow, even if
it's separate payloads.
Hadn't realised it was an issue. Need to give more thought into this.
Other suggestions welcome.
That would be (I assume) really 1 byte per channel per frame, not 1 byte
per frame.
Well, technically you can increase the bit-rate by one byte for only one
channel...
I meant that the change rate is one byte per channel per frame (max). This
implies the frame-size rate of change could be anywhere in the range of
plus/minus the number of channels (in bytes).
I assume the decoder would handle large frame gaps with corresponding large
changes in framesize
Basically, any channel of any frame can have a different size and change
arbitrarily from one frame to the next. You can have one channel jumping
from 64 to 128 kbps while the other goes from 96 to 48 kbps. That's why
there's one byte per channel per frame used for the size.
Cheers,
Jean-Marc