[rtcweb] Number of samples (ptime) to be supported by required codecs (draft-ietf-rtcweb-audio-05)
Magnus Westerlund <magnus.westerlund@ericsson.com> Tue, 18 February 2014 08:59 UTC
Return-Path: <magnus.westerlund@ericsson.com>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E76F51A0058 for <rtcweb@ietfa.amsl.com>; Tue, 18 Feb 2014 00:59:42 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 3.76
X-Spam-Level: ***
X-Spam-Status: No, score=3.76 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_SUMOF=5, HELO_EQ_SE=0.35, HOST_MISMATCH_NET=0.311, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CpRBoMZKJwVQ for <rtcweb@ietfa.amsl.com>; Tue, 18 Feb 2014 00:59:40 -0800 (PST)
Received: from sessmg20.mgmt.ericsson.se (sessmg20.ericsson.net [193.180.251.50]) by ietfa.amsl.com (Postfix) with ESMTP id 11AA61A00A6 for <rtcweb@ietf.org>; Tue, 18 Feb 2014 00:59:39 -0800 (PST)
X-AuditID: c1b4fb32-b7f4c8e0000012f5-c5-530320f86818
Received: from ESESSHC021.ericsson.se (Unknown_Domain [153.88.253.124]) by sessmg20.mgmt.ericsson.se (Symantec Mail Security) with SMTP id A4.70.04853.8F023035; Tue, 18 Feb 2014 09:59:36 +0100 (CET)
Received: from [127.0.0.1] (153.88.183.153) by smtp.internal.ericsson.com (153.88.183.83) with Microsoft SMTP Server id 14.2.347.0; Tue, 18 Feb 2014 09:59:35 +0100
Message-ID: <530320F7.4090300@ericsson.com>
Date: Tue, 18 Feb 2014 09:59:35 +0100
From: Magnus Westerlund <magnus.westerlund@ericsson.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: "rtcweb@ietf.org" <rtcweb@ietf.org>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrMJMWRmVeSWpSXmKPExsUyM+Jvje4PBeZgg80bJCzW/mtnd2D0WLLk J1MAYxSXTUpqTmZZapG+XQJXxtdJNxgL5oVUzPlQ38D4w6GLkZNDQsBEovnHexYIW0ziwr31 bF2MXBxCAicYJdb+P8QK4SxnlLi/qJkdpIpXQFviyfpGZhCbRUBV4symTkYQm03AQuLmj0Y2 EFtUIFhi54HfjBD1ghInZz4B2yAioC5x+eEFsDnCAkkSy9e2AdVwAG0Wl+hpDAIJMwvoSUy5 2sIIYctLNG+dDbZKCGhtQ1MH6wRG/llIps5C0jILScsCRuZVjJLFqcXFuelGBnq56bkleqlF mcnFxfl5esWpmxiBAXdwy2+jHYwn99gfYpTmYFES573OWhMkJJCeWJKanZpakFoUX1Sak1p8 iJGJg1OqgTHHaXN4s15k2Iui5IPvCjW812cdMTPO9Xz7/Un9VMk90jIl57ckl+lOUDSKO7DA Va35/0e5aw6OmzZof55ldurdLQFTL2av3Ev3HR7vv7P8LMf/3DexNw9W9xoy7Njy11VeiD2A KcNgxa2/Fmu3Sz5e38tcIv1PWy/+DPMV5VXHk7LLTvNHnFBiKc5INNRiLipOBAAvrAmRBgIA AA==
Archived-At: http://mailarchive.ietf.org/arch/msg/rtcweb/31j3JovsKWrXtdvXjBNmMKdesmg
Subject: [rtcweb] Number of samples (ptime) to be supported by required codecs (draft-ietf-rtcweb-audio-05)
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb/>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 18 Feb 2014 08:59:43 -0000
Hi, (as individual) I just reviewed the -05 of the audio draft and realized that it removed all discussion of what packetization times are expected to be supported by an implementation. For opus this is not that difficult as the range is limited to multiples of the audio frames it can produce. The current edit I think comes from Jean-Marc and Cullen's private discussion who's outcome was communicated to the list on the 2014-01-31. The main part of the message reads: > We keep the part about what happens with RTP in > draft-ietf-rtcweb-audio but move the parts about SDP off to JSEP. I > think that means all we need here is basically MUST implement G.711 & > Opus along with their RTP payload formats. > > The ranges of size of packets, frames and other things seem to be > adequately covered by the specs for the codecs and WebRTC is not > chaining theses codecs so seems good enough. The JSEP draft that is > pointing at all the parts of SDP that need to be supported can deal > with the ptime and maxptime in SDP. I didn't get to comment this immediately as I went on vacation. But here is my follow up on this thread. If you don't want to read what the existing specs says and background motivations, jump to the end and read from "Trying to conclude:" First of all lets investigate what the two specs says about packetization time. Opus: http://tools.ietf.org/id/draft-ietf-payload-rtp-opus-01.txt 4.2. Payload Structure The Opus encoder can be set to output encoded frames representing 2.5, 5, 10, 20, 40, or 60 ms of speech or audio data. Further, an arbitrary number of frames can be combined into a packet. The maximum packet length is limited to the amount of encoded data representing 120 ms of speech or audio data. Section 6.1 maxptime: the decoder's maximum length of time in milliseconds rounded up to the next full integer value represented by the media in a packet that can be encapsulated in a received packet according to Section 6 of [RFC4566]. Possible values are 3, 5, 10, 20, 40, and 60 or an arbitrary multiple of Opus frame sizes rounded up to the next full integer value up to a maximum value of 120 as defined in Section 4. If no value is specified, 120 is assumed as default. This value is a recommendation by the decoding side to ensure the best performance for the decoder. The decoder MUST be capable of accepting any allowed packet sizes to ensure maximum compatibility. ptime: the decoder's recommended length of time in milliseconds rounded up to the next full integer value represented by the media in a packet according to Section 6 of [RFC4566]. Possible values are 3, 5, 10, 20, 40, or 60 or an arbitrary multiple of Opus frame sizes rounded up to the next full integer value up to a maximum value of 120 as defined in Section 4. If no value is specified, 20 is assumed as default. If ptime is greater than maxptime, ptime MUST be ignored. This parameter MAY be changed during a session. This value is a recommendation by the decoding side to ensure the best performance for the decoder. The decoder MUST be capable of accepting any allowed packet sizes to ensure maximum compatibility. minptime: the decoder's minimum length of time in milliseconds rounded up to the next full integer value represented by the media in a packet that SHOULD be encapsulated in a received packet according to Section 6 of [RFC4566]. Possible values are 3, 5, 10, 20, 40, and 60 or an arbitrary multiple of Opus frame sizes rounded up to the next full integer value up to a maximum value of 120 as defined in Section 4. If no value is specified, 3 is assumed as default. This value is a recommendation by the decoding side to ensure the best performance for the decoder. The decoder MUST be capable to accept any allowed packet sizes to ensure maximum compatibility. Thus, I agree for Opus this is well-defined. An receiver MUST support any combination of frames that the encoder can produce up to a total of 120 ms. And it has well defined usage of ptime and maxptime and also defines a min ptime. Lets then look at G.711: This is the whole PCMA and PCMU payload format definition in RFC3551: 4.5.14 PCMA and PCMU PCMA and PCMU are specified in ITU-T Recommendation G.711. Audio data is encoded as eight bits per sample, after logarithmic scaling. PCMU denotes mu-law scaling, PCMA A-law scaling. A detailed description is given by Jayant and Noll [15]. Each G.711 octet SHALL be octet-aligned in an RTP packet. The sign bit of each G.711 octet SHALL correspond to the most significant bit of the octet in the RTP packet (i.e., assuming the G.711 samples are handled as octets on the host machine, the sign bit SHALL be the most significant bit of the octet as defined by the host machine format). The 56 kb/s and 48 kb/s modes of G.711 are not applicable to RTP, since PCMA and PCMU MUST always be transmitted as 8-bit samples. See Section 4.1 regarding silence suppression. This doesn't say anything about the packetization. Fortunately Section 4.2 of RFC 3551 do talk about this, and as the RTP/SAVPF profile used by WebRTC derives from RTP/AVP (RFC3551) this do apply. 4.2 Operating Recommendations The following recommendations are default operating parameters. Applications SHOULD be prepared to handle other values. The ranges given are meant to give guidance to application writers, allowing a set of applications conforming to these guidelines to interoperate without additional negotiation. These guidelines are not intended to restrict operating parameters for applications that can negotiate a set of interoperable parameters, e.g., through a conference control protocol. For packetized audio, the default packetization interval SHOULD have a duration of 20 ms or one frame, whichever is longer, unless otherwise noted in Table 1 (column "ms/packet"). The packetization interval determines the minimum end-to-end delay; longer packets introduce less header overhead but higher delay and make packet loss more noticeable. For non-interactive applications such as lectures or for links with severe bandwidth constraints, a higher packetization delay MAY be used. A receiver SHOULD accept packets representing between 0 and 200 ms of audio data. (For framed audio encodings, a receiver SHOULD accept packets with a number of frames equal to 200 ms divided by the frame duration, rounded up.) This restriction allows reasonable buffer sizing for the receiver. As can see this recommends that one per default support 20 ms, and that receivers are capable of handling up to 200 ms. So, for PCMA and PCMU the picture are less clear, there are recommendations, but no hard requirements. Also they are sample based codecs and thus can produce payloads of any length (bytes and samples). When it comes the signalling, we do have ptime and maxptime defined in the base-spec of SDP [RFC4566] a=ptime:<packet time> This gives the length of time in milliseconds represented by the media in a packet. This is probably only meaningful for audio data, but may be used with other media types if it makes sense. It should not be necessary to know ptime to decode RTP or vat audio, and it is intended as a recommendation for the encoding/packetisation of audio. It is a media-level attribute, and it is not dependent on charset. a=maxptime:<maximum packet time> This gives the maximum amount of media that can be encapsulated in each packet, expressed as time in milliseconds. The time SHALL be calculated as the sum of the time the media present in the packet represents. For frame-based codecs, the time SHOULD be an integer multiple of the frame size. This attribute is probably only meaningful for audio data, but may be used with other media types if it makes sense. It is a media-level attribute, and it is not dependent on charset. Note that this attribute was introduced after RFC 2327, and non-updated implementations will ignore this attribute. Thus, these can be used to provide a single recommended packetization interval and an upper limit if supported. The fact that ptime only can indicate a single rate becomes a potential issue as you can't determine a remote peer preferences for other rates, if an WebRTC endpoint likes to modify its rate due to congestion control reasons. Changing the packetization rate is one of the tools that give a most significant bit-rate change for audio, and it can even be applied without changing the encoding rate, something crucial for doing any bit-rate adaptation for G.711. For your notes, JSEP does currently do not discuss packetization times or the ptime or maxptime SDP parameter at all. Trying to conclude: I see an issue that we don't provide firmer requirements on what packetization intervals that should be supported by a WebRTC receiver. I would propose that we actually write into the audio draft in general that a WebRTC endpoint SHALL support receiving audio RTP payloads that contain up to 200 ms of audio if the RTP payload format supports it. When it comes to sending I would also like to provide some minimal requirements, these may need to be on codec basis, and I think it is G.711 that is lacking here. Thus, I think an WebRTC endpoint SHALL be capable of producing packetization times in the RTP payloads with the following amount of time: 10, 20, 40, 60 ms. I also think we should formalize the requirement to support the ptime and maxptime signalling to maximize the possibility for interop with any legacy systems. I do see a need for the audio draft to discuss the potential issues here that can affect interoperability. Cheers Magnus Westerlund (As individual) ---------------------------------------------------------------------- Services, Media and Network features, Ericsson Research EAB/TXM ---------------------------------------------------------------------- Ericsson AB | Phone +46 10 7148287 Färögatan 6 | Mobile +46 73 0949079 SE-164 80 Stockholm, Sweden | mailto: magnus.westerlund@ericsson.com ----------------------------------------------------------------------
- [rtcweb] Number of samples (ptime) to be supporte… Magnus Westerlund
- Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
- Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund
- Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
- Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund
- Re: [rtcweb] Number of samples (ptime) to be supp… DRAGE, Keith (Keith)
- Re: [rtcweb] Number of samples (ptime) to be supp… Harald Alvestrand
- Re: [rtcweb] Number of samples (ptime) to be supp… Magnus Westerlund