[rtcweb] Number of samples (ptime) to be supported by required codecs (draft-ietf-rtcweb-audio-05)

Magnus Westerlund <magnus.westerlund@ericsson.com> Tue, 18 February 2014 08:59 UTC

Message-ID: <530320F7.4090300@ericsson.com>
Date: Tue, 18 Feb 2014 09:59:35 +0100
From: Magnus Westerlund <magnus.westerlund@ericsson.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: "rtcweb@ietf.org" <rtcweb@ietf.org>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
Archived-At: http://mailarchive.ietf.org/arch/msg/rtcweb/31j3JovsKWrXtdvXjBNmMKdesmg
Subject: [rtcweb] Number of samples (ptime) to be supported by required codecs (draft-ietf-rtcweb-audio-05)
Precedence: list

Hi,
(as individual)

I just reviewed the -05 of the audio draft and realized that it removed
all discussion of what packetization times are expected to be supported
by an implementation. For opus this is not that difficult as the range
is limited to multiples of the audio frames it can produce.

The current edit I think comes from Jean-Marc and Cullen's private
discussion who's outcome was communicated to the list on the 2014-01-31.
The main part of the message reads:

> We keep the part about what happens with RTP in
> draft-ietf-rtcweb-audio but move the parts about SDP off to JSEP. I
> think that means all we need here is basically MUST implement G.711 &
> Opus along with their RTP payload formats.
> 
> The ranges of size of packets, frames  and other things seem to be
> adequately covered by the specs for the codecs and WebRTC is not
> chaining theses codecs so seems good enough. The JSEP draft that is
> pointing at all the parts of SDP that need to be supported can deal
> with the ptime and maxptime in SDP.

I didn't get to comment this immediately as I went on vacation. But here
is my follow up on this thread. If you don't want to read what the
existing specs says and background motivations, jump to the end and read
from "Trying to conclude:"

First of all lets investigate what the two specs says about
packetization time.

Opus:
http://tools.ietf.org/id/draft-ietf-payload-rtp-opus-01.txt

4.2.  Payload Structure

   The Opus encoder can be set to output encoded frames representing
   2.5, 5, 10, 20, 40, or 60 ms of speech or audio data.  Further, an
   arbitrary number of frames can be combined into a packet.  The
   maximum packet length is limited to the amount of encoded data
   representing 120 ms of speech or audio data.

Section 6.1

   maxptime:  the decoder's maximum length of time in milliseconds
      rounded up to the next full integer value represented by the media
      in a packet that can be encapsulated in a received packet
      according to Section 6 of [RFC4566].  Possible values are 3, 5,
      10, 20, 40, and 60 or an arbitrary multiple of Opus frame sizes
      rounded up to the next full integer value up to a maximum value of
      120 as defined in Section 4.  If no value is specified, 120 is
      assumed as default.  This value is a recommendation by the
      decoding side to ensure the best performance for the decoder.  The
      decoder MUST be capable of accepting any allowed packet sizes to
      ensure maximum compatibility.

   ptime:  the decoder's recommended length of time in milliseconds
      rounded up to the next full integer value represented by the media
      in a packet according to Section 6 of [RFC4566].  Possible values
      are 3, 5, 10, 20, 40, or 60 or an arbitrary multiple of Opus frame
      sizes rounded up to the next full integer value up to a maximum
      value of 120 as defined in Section 4.  If no value is specified,
      20 is assumed as default.  If ptime is greater than maxptime,
      ptime MUST be ignored.  This parameter MAY be changed during a
      session.  This value is a recommendation by the decoding side to
      ensure the best performance for the decoder.  The decoder MUST be
      capable of accepting any allowed packet sizes to ensure maximum
      compatibility.

   minptime:  the decoder's minimum length of time in milliseconds
      rounded up to the next full integer value represented by the media
      in a packet that SHOULD be encapsulated in a received packet
      according to Section 6 of [RFC4566].  Possible values are 3, 5,
      10, 20, 40, and 60 or an arbitrary multiple of Opus frame sizes
      rounded up to the next full integer value up to a maximum value of
      120 as defined in Section 4.  If no value is specified, 3 is
      assumed as default.  This value is a recommendation by the
      decoding side to ensure the best performance for the decoder.  The
      decoder MUST be capable to accept any allowed packet sizes to
      ensure maximum compatibility.


Thus, I agree for Opus this is well-defined. An receiver MUST support
any combination of frames that the encoder can produce up to a total of
120 ms. And it has well defined usage of ptime and maxptime and also
defines a min ptime.

Lets then look at G.711:

This is the whole PCMA and PCMU payload format definition in RFC3551:

4.5.14 PCMA and PCMU

   PCMA and PCMU are specified in ITU-T Recommendation G.711.  Audio
   data is encoded as eight bits per sample, after logarithmic scaling.
   PCMU denotes mu-law scaling, PCMA A-law scaling.  A detailed
   description is given by Jayant and Noll [15].  Each G.711 octet SHALL
   be octet-aligned in an RTP packet.  The sign bit of each G.711 octet
   SHALL correspond to the most significant bit of the octet in the RTP
   packet (i.e., assuming the G.711 samples are handled as octets on the
   host machine, the sign bit SHALL be the most significant bit of the
   octet as defined by the host machine format).  The 56 kb/s and 48
   kb/s modes of G.711 are not applicable to RTP, since PCMA and PCMU
   MUST always be transmitted as 8-bit samples.

   See Section 4.1 regarding silence suppression.

This doesn't say anything about the packetization. Fortunately Section
4.2 of RFC 3551 do talk about this, and as the RTP/SAVPF profile used by
WebRTC derives from RTP/AVP (RFC3551) this do apply.

4.2  Operating Recommendations

   The following recommendations are default operating parameters.
   Applications SHOULD be prepared to handle other values.  The ranges
   given are meant to give guidance to application writers, allowing a
   set of applications conforming to these guidelines to interoperate
   without additional negotiation.  These guidelines are not intended to
   restrict operating parameters for applications that can negotiate a
   set of interoperable parameters, e.g., through a conference control
   protocol.

   For packetized audio, the default packetization interval SHOULD have
   a duration of 20 ms or one frame, whichever is longer, unless
   otherwise noted in Table 1 (column "ms/packet").  The packetization
   interval determines the minimum end-to-end delay; longer packets
   introduce less header overhead but higher delay and make packet loss
   more noticeable.  For non-interactive applications such as lectures
   or for links with severe bandwidth constraints, a higher
   packetization delay MAY be used.  A receiver SHOULD accept packets
   representing between 0 and 200 ms of audio data.  (For framed audio
   encodings, a receiver SHOULD accept packets with a number of frames
   equal to 200 ms divided by the frame duration, rounded up.)  This
   restriction allows reasonable buffer sizing for the receiver.

As can see this recommends that one per default support 20 ms, and that
receivers are capable of handling up to 200 ms.

So, for PCMA and PCMU the picture are less clear, there are
recommendations, but no hard requirements. Also they are sample based
codecs and thus can produce payloads of any length (bytes and samples).

When it comes the signalling, we do have ptime and maxptime defined in
the base-spec of SDP [RFC4566]

      a=ptime:<packet time>

         This gives the length of time in milliseconds represented by
         the media in a packet.  This is probably only meaningful for
         audio data, but may be used with other media types if it makes
         sense.  It should not be necessary to know ptime to decode RTP
         or vat audio, and it is intended as a recommendation for the
         encoding/packetisation of audio.  It is a media-level
         attribute, and it is not dependent on charset.

      a=maxptime:<maximum packet time>

         This gives the maximum amount of media that can be encapsulated
         in each packet, expressed as time in milliseconds.  The time
         SHALL be calculated as the sum of the time the media present in
         the packet represents.  For frame-based codecs, the time SHOULD
         be an integer multiple of the frame size.  This attribute is
         probably only meaningful for audio data, but may be used with
         other media types if it makes sense.  It is a media-level
         attribute, and it is not dependent on charset.  Note that this
         attribute was introduced after RFC 2327, and non-updated
         implementations will ignore this attribute.

Thus, these can be used to provide a single recommended packetization
interval and an upper limit if supported.

The fact that ptime only can indicate a single rate becomes a potential
issue as you can't determine a remote peer preferences for other rates,
if an WebRTC endpoint likes to modify its rate due to congestion control
reasons. Changing the packetization rate is one of the tools that give a
most significant bit-rate change for audio, and it can even be applied
without changing the encoding rate, something crucial for doing any
bit-rate adaptation for G.711.

For your notes, JSEP does currently do not discuss packetization times
or the ptime or maxptime SDP parameter at all.

Trying to conclude:

I see an issue that we don't provide firmer requirements on what
packetization intervals that should be supported by a WebRTC receiver.

I would propose that we actually write into the audio draft in general
that a WebRTC endpoint SHALL support receiving audio RTP payloads that
contain up to 200 ms of audio if the RTP payload format supports it.

When it comes to sending I would also like to provide some minimal
requirements, these may need to be on codec basis, and I think it is
G.711 that is lacking here. Thus, I think an WebRTC endpoint SHALL be
capable of producing packetization times in the RTP payloads with the
following amount of time: 10, 20, 40, 60 ms.

I also think we should formalize the requirement to support the ptime
and maxptime signalling to maximize the possibility for interop with any
legacy systems.

I do see a need for the audio draft to discuss the potential issues here
that can affect interoperability.

Cheers

Magnus Westerlund
(As individual)

----------------------------------------------------------------------
Services, Media and Network features, Ericsson Research EAB/TXM
----------------------------------------------------------------------
Ericsson AB                 | Phone  +46 10 7148287
Färögatan 6                 | Mobile +46 73 0949079
SE-164 80 Stockholm, Sweden | mailto: magnus.westerlund@ericsson.com
----------------------------------------------------------------------

[rtcweb] Number of samples (ptime) to be supporte… Magnus Westerlund
Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund
Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund
Re: [rtcweb] Number of samples (ptime) to be supp… DRAGE, Keith (Keith)
Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund