Internet Engineering Task Force Ari Lakaniemi, Nokia Audio Video Transport WG Pasi Ojala, Nokia INTERNET-DRAFT Johan Sj÷berg, Ericsson February 23, 2001 Magnus Westerlund, Ericsson Expires: August 23, 2001 RTP payload format for AMR-WB Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/lid-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document is an individual submission to the IETF. Comments should be directed to the authors. Abstract This document specifies a real-time transport protocol (RTP) payload format for Adaptive Multi-Rate Wideband (AMR-WB) speech encoded signals. The AMR-WB payload format is designed to be able to interoperate with existing AMR-WB transport formats. This document also includes a MIME type registration for AMR-WB. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 1] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 1. Introduction The Adaptive Multi-Rate Wideband (AMR-WB) speech codec [1] was originally developed by the Third Generation Partnership Project (3GPP) to be used in GSM and 3G systems. I.e. the AMR-WB codec will be widely used in cellular systems. The AMR-WB codec is developed to preserve high speech quality under a wide range of transmission conditions. The AMR-WB codec is a multi-mode speech codec with 9 wideband speech coding modes with bit-rates between 6.6 and 23.85 kbps. The sampling frequency is 16000 Hz and processing is performed on 20 ms frames, i.e. 320 speech samples per frame. The AMR-WB modes are closely related to each other and employ the same coding framework. Mode adaptation functionality is one valuable aspect of the AMR-WB operation. In mobile radio systems (GSM) it allows the system to adapt the balance between speech coding and error protection to enable best possible speech quality in prevailing transmission conditions. On the other hand, AMR-WB mode adaptation can be also utilized to adapt to the varying available transmission bandwidth. Basically the mode change can occur to any mode at any time. The name and operational principles of the AMR-WB codec largely resemble those of the Adaptive Multi-Rate (AMR-NB) codec [2,12]. However, these are two separate speech codecs, the principal difference being that AMR-NB is so-called narrow band speech coding, using 8000 Hz sampling frequency, compared to 16000 Hz of the AMR-WB. The AMR-WB codec is designed with a voice activity detector (VAD) [6] and generation of comfort noise (CN) parameters during silence periods [5]. Hence, the AMR-WB codec can reduce the number of transmitted bits and packets during silence periods to a minimum. The operation to send silence descriptor (SID) frames containing CN parameters at regular intervals non-speech periods is usually called discontinuous transmission (DTX) or source controlled rate (SCR) operation [4]. AMR-WB implementations must support all 9 speech coding modes. AMR-WB mode switching can occur between any speech frames, and current mode must be indicated by transmitting the mode information together with the speech encoded bits. The objective of AMR-WB design has been to enable highest possible speech quality under a variety of transmission channel conditions. To realize the mode adaptation the receiver needs to signal the AMR-WB mode it prefers to receive to the transmitter. Due to the flexibility and robustness of AMR-WB, it is suitable also for other purposes than circuit switched cellular systems. Other suitable applications are real-time services over packet switched networks. The payload format should be designed for robustness against both bit errors and packet loss. The speech encoded bits have Lakaniemi/Ojala/Sjoberg/Westerlund [Page 2] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 different perceptual sensitivity to bit errors and cellular systems exploit this by using unequal error protection and detection (UEP and UED). The UED/UEP mechanism focus the correction and detection of corrupted bits to the perceptually most sensitive bits. A speech frame is only declared damaged if there are bit errors in the most sensitive bits, i.e. class A bits. It is acceptable to have some bit errors in the other bits, i.e. class B and C. Also a damaged frame is still useful for error concealment in the decoding, which uses some of the less sensitive bits of the damaged data. This improves the speech quality compared to discarding the data. Today there exist some link layers that do not discard packets with bit errors, e.g. SLIP and some wireless links (with the Internet traffic pattern shifting towards a more media-centric one, more link layers of such nature may emerge in the future). With transport layer support for partial checksums, for example those supported by UDP- Lite [14], bit error tolerant AMR-WB traffic could achieve better performance over these types of links. There are at least two basic approaches for carrying AMR-WB traffic over bit error tolerant networks: 1) Utilizing the a partial checksum to cover headers and the most important AMR-WB speech bits of the payload. It is recommended that at least all class A bits are covered by the checksum. 2) Utilizing the a partial checksum to only cover headers, but a frame CRC to cover the class A bits of each AMR-WB frame in the payload. In either approach, at least part of the class B/C bits are left without error-check and thus bit error tolerance is achieved. It is still important that the network designer pays attention to the class B and C residual bit error rate. Though less sensitive to error than class A bits, class B and C bits are not insignificant and undetected errors in these bits cause degradation in speech quality. An example of residual error rates considered acceptable for AMR-WB in UMTS can be found in [17]. Approach 1 is bit efficient, flexible and simple way, but comes with two disadvantages, namely, a) bit errors in protected speech bits will cause the payload to be discarded, and b) when transporting multiple frames in a payload there is the possibility that a single bit error in protected bits gets all the frames discarded. These disadvantages can be avoided if needed, with some overhead in the form of a frame-wise CRC (Approach 2). In problem a), the CRC makes it possible to detect bit errors in class A bits and use the Lakaniemi/Ojala/Sjoberg/Westerlund [Page 3] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 frame for error concealment, which gives a small improvement in speech quality. Secondly b), when transporting multiple frames in a payload the CRCs remove the possibility that a single bit error in a class A bit gets all the frames discarded. Avoiding that gives an improvement in speech quality when transporting multiple frames and subject to bit errors. The choice between the two approaches must be made based on the available bandwidth, and desired tolerance to bit errors. Neither solution is appropriate to all cases. To achieve better robustness against packet loss the payload supports Forward Error Correction (FEC). The simple scheme of repetition of previously sent data is one possibility. Another possible scheme, which is more bandwidth efficient, is to use payload external FEC, e.g. RFC 2733, which generates extra packets containing repair data. The whole payload can also be sorted in sensitivity order to support external FEC schemes using UEP. There is work in progress on a generic version of such a scheme [15]. Yet another mechanism to enhance error robustness is the interleaving of AMR-WB speech frames. Sometimes several frames can be encapsulated into single RTP packet to decrease protocol overhead. One of the drawbacks of such approach is that in case of packet loss this means loss of several consecutive speech frames, which usually causes clearly audible distortion in reconstructed speech. The interleaving of frames can improve the speech quality in such cases by distributing the consecutive losses into series of single frame losses. However, interleaving and bundling several frames per payload will also increase end-to-end delay and is therefore not applicable to all usage scenarios. However, e.g. streaming applications are likely to be able to exploit interleaving to improve speech quality in lossy transmission conditions. 2. Requirements The AMR-WB RTP payload format was designed to meet the following requirements: o Different levels of robustness must be supported, from no redundant data to extreme robustness capable of handling very high packet loss rates with no or small speech quality degradation. o Fast, bandwidth efficient, frame-wise AMR-WB mode adaptation must be supported. This means that it must be possible to send Codec Mode Requests back from the receiving side to the transmitting side with information on the preferred mode. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 4] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 o Source controlled rate operation (SCR) (also called DTX) and comfort noise parameter (CN) transmission defined in AMR-WB must be supported. 3. Payload format The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [9]. The AMR-WB payload format supports transmission of multiple frames per payload, the use of fast codec mode adaptation, and robustness against packet losses and bit errors. The AMR-WB payload format consists of one payload header, a table of content, optionally one CRC per payload frame, and zero or more AMR- WB payload frames. The payload format is made as bandwidth efficient as possible by not using octet alignment for the payload header, table of content or the payload frames. However, the full payload is octet aligned. Therefore any unused bits in the last octet MUST be padded with zeros. If the option to transmit a robust sorted payload is enabled by the receiver, the transmitted may choose to sort the bits in the payload according to descending bit error sensitivity in order to enable UEP/UED outside RTP (e.g. UDP-lite). The sensitivity order for AMR-WB encoded speech bits for each mode is defined in Annex B of [3], the original bit order being as delivered by the AMR-WB speech encoder [1]. The AMR-WB frame types, or modes, are defined in [3]. Robustness against packet loss can be accomplished by using the possibility to retransmit previously transmitted frames together with the current (new) frame or frames. Another approach is using interleaving to reduced the speech quality effect of packet losses. Note that the usage of these options can be restricted by the MIME parameters during the session set-up. The AMR-WB performance over error tolerant links can be improved by delivering also the speech frames that have been corrupted with bit errors. However, UEP/UED MUST be used in such a way that the bit errors are allowed only in the least error sensitive bits. Bit errors in class A bits MUST NOT be allowed in any circumstances. This payload format provides two alternative methods to implement UED: A. CRC calculation over the class A speech bits If several consecutive speech frames are encapsulated into each payload, the optional CRC may be used to protect the class A speech bits of each frame, see table 1. The number of class A bits is specified as informative in [3] and therefore copied into table 1 as normative for this payload format. Speech frames with errors in Lakaniemi/Ojala/Sjoberg/Westerlund [Page 5] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 class A bits MUST be marked with SPEECH_BAD for corrupted speech frames (FT=0..8) or SID_BAD for corrupted SID frames (FT=9), and be sent to the speech decoder to assist error concealment, see [7]. In this case the RTP header, payload header, and table of content should be covered by a transport layer CRC, e.g. UDP-lite. A packet MUST be discarded if the transport layer CRC detects errors in these bits. B. Robust sorting of payload bits Robust behavior can also be accomplished by robust sorting of the payload. This enables the use of UED (e.g. UDP-lite) and UEP (e.g. ULP [15]). Note that payloads containing a single frame are sorted in the same robust way regardless of the use of simple or robust sorting. The UED and/or UEP is recommended to cover at least the RTP header, payload header, table of content and all class A bits from all frames in the payload. Support for unequal error detection is OPTIONAL. If either scheme is to be used, it MUST be signaled out of band (see section 8). Class A total speech Index Mode bits bits ---------------------------------------- 0 AMR-WB 6.6 54 78 1 AMR-WB 8.85 64 113 2 AMR-WB 12.65 72 181 3 AMR-WB 14.25 72 213 4 AMR-WB 15.85 72 245 5 AMR-WB 18.25 72 293 6 AMR-WB 19.85 72 325 7 AMR-WB 23.05 72 389 8 AMR-WB 23.85 72 405 9 AMR-WB SID 40 40 Table 1. Specification of the number of class A bits for AMR-WB. The speech quality in channel error conditions can be improved by delivering also the frames corrupted e.g. in transmission over a radio link to the receiver. Despite the bit-errors, providing damaged frames to the error concealment unit can improve the speech quality compared to case where corrupted frames are dropped. However, to accomplish this, a frame quality indicator is needed to mark the corrupted frames for the decoder. In many communication scenarios the AMR-WB frames will be transmitted from one IP/UDP/RTP terminal to a terminal in a system with another transport format and/or vice versa. The transport format transcoding will be done in a gateway. A second likely scenario is that IP/UDP/RTP is used as transport between other systems, i.e. IP is originated and terminated in gateways on both sides of the IP transport. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 6] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 AMR-WB over +------+ +----------+ 3G Iu or | | IP/UDP/RTP/AMR-WB | | -------------->| GW |----------------------->| TERMINAL | GSM Abis | | | | etc. +------+ +----------+ Figure 1: GW to VoIP terminal scenario. AMR-WB over +------+ +------+ AMR-WB over 3G Iu or | | IP/UDP/RTP/AMR-WB | | 3G Iu or -------------->| GW |-------------------->| GW |---------------> GSM Abis | | | | GSM Abis etc. +------+ +------+ etc. Figure 2. GW to GW scenario. The speech quality in case of packet losses when transmitting several AMR-WB frames per packet can be improved by using OPTIONAL frame interleaving. The interleaving improves perceived speech quality since it introduces single frame errors instead of several consecutive frame errors. Note that interleaving can be applied only if the receiver has signaled support for it in capability description. 3.1. The payload header The length of the payload header is either 7 or 15 bits, depending on whether the interleaving is used or not. Figures 3a and 3b illustrate the header structure. Header bits are specified in following two subclauses. 3.1.1. Required fields of the payload header S (1 bit): Indicates, if set, that the bits in the payload is robust sorted. If not set, simple payload sorting is employed. Note that this bit can be set only if the receiver has signaled support for the OPTIONAL robust payload sorting. C (1 bit): Indicates the existence of OPTIONAL CRC fields in the payload table of content. Note that this bit can be set only if the receiver has signaled support for the OPTIONAL CRC. I (1 bit): Indicates, if set, that frames in this payload are interleaved, and that ILL and ILP fields are present in the payload header. If not set, frames in this payload are successive frames and ILL and ILP fields are not present in the payload header. Note that Lakaniemi/Ojala/Sjoberg/Westerlund [Page 7] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 this bit can be set only if the receiver has signaled support for interleaving. CMR (4 bits): Indicates Codec Mode Requested for the other communication direction. It is only allowed to request one of the AMR-WB speech modes (frame type index 0...8, see Table 1a in [3]). CMR value 15 indicates that no mode request is present, other values are for future use. 3.1.2. Optional fields of the payload header ILL (4 bits): OPTIONAL field that is present only if I=1. The value of this field specifies the interleaving length used for frames in this payload. ILP (4 bits): OPTIONAL field that is present only if I=1. The value of this field indicates the interleaving index for frames in this payload. The value of ILP MUST be smaller than or equal to the value of ILL. Erroneous value of ILP SHOULD cause the payload to be discarded. The value of the ILL field defines the length of an interleave group: ILL=L implies that frames in (L+1)-frame intervals are picked into the same interleaved payload, and the interleave group consists of L+1 payloads. The value of ILP=p in payloads belonging to the same group runs from 0 to L. The interleaving is meaningful only when number of frames per payload N is greater than or equal to 2. Thus, when N frames are transmitted in each payload of a group, the interleave group consists of payloads with sequence numbers s...s+L, and frames encapsulated into these payloads are f...f+N*(L+1)-1. To put this in a form of an equation, let's assume that the first frame of an interleave group is n, the first payload of the group is s, number of frames per payload is N, ILL=L and ILP=p (p in range 0...L), the frames contained by the payload s+p are n + p + k*(L+1), where k runs from 0 to N-1. I.e. The first packet of an interleave group: ILL=L, ILP=0 Payload: s Frames: n, n+(L+1), n+2*(L+1), ..., n+(N-1)*(L+1) The second packet of an interleave group: ILL=L, ILP=1 Payload: s+1 Frames: n+1, n+1+(L+1), n+1+2*(L+1), ..., n+(N-1)*(L+1) ... The last packet of an interleave group: ILL=L, ILP=L Payload: s+L Frames: n+L, n+L+(L+1), n+L+2*(L+1), ..., n+L+(N-1)*(L+1) Lakaniemi/Ojala/Sjoberg/Westerlund [Page 8] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 Interleaved frames MUST be stored in the payload in timestamp- increasing order. Furthermore, the interleaved payloads within an interleave group MUST be sent according to increasing order of ILP field, and each payload of an interleave group MUST contain equal number of frames. It is RECOMMENDED that ILL remains constant throughout the session. If ILL is to be changed, the change SHOULD be done between interleaving groups, i.e. the ILP of the previous packet was L. Furthermore, because of the inter-frame dependent nature of AMR-WB coding, it is RECOMMENDED that ILL values greater than or equal to 2 are used to enable better error recovery in the decoder in case of lost interleaved payload. Note also that using value ILL=0 or using interleaving for payload carrying only one frame is not meaningful. 0 0 1 2 3 4 5 6 +-+-+-+-+-+-+-+ |S|C|I| CMR | +-+-+-+-+-+-+-+ Figure 3a: AMR-WB payload header, I=0. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S|C|I| CMR | ILL | ILP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3b: AMR-WB payload header, I=1. 3.2. The payload table of content and CRCs The table of content (ToC) consists of one table of content entry for each speech frame in the payload. A table of content entry includes several specified fields as follows: F (1 bit): Indicates if this frame is followed by further frames in this payload. F=1 further frames follow, F=0 last frame. FT (4 bits): Frame type indicator, indicating the AMR-WB speech coding mode or comfort noise (CN) mode. The mapping of AMR-WB modes to FT is given in Table 1a in [3]. If FT=14 (lost frame) or FT=15 (no transmission/no reception), no CRC or payload frame is present. Q (1 bit): The frame quality bit indicates, if not set, that the payload is corrupted and the receiver should set the RX_TYPE (see [4]) to SPEECH_BAD or SID_BAD depending on the frame type (FT). Lakaniemi/Ojala/Sjoberg/Westerlund [Page 9] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 0 0 1 2 3 4 5 +-+-+-+-+-+-+ |F| FT |Q| +-+-+-+-+-+-+ Figure 4: Table of content (ToC) entry field. CRC (8 bits): OPTIONAL field, exists if the payload header bit C is set (C=1). The 8 bit CRC is used for error detection. These 8 parity bits are generated according to section 4.1.4 in [3]. 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | CRC | +-+-+-+-+-+-+-+-+ Figure 5: CRC field. The ToC and CRCs are arranged with all table of content entries fields first followed by all CRC fields. The ToC starts with the frame data belonging to the oldest speech frame in the payload. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| FT |Q|F| FT |Q|F| FT |Q| CRC | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | CRC | +-+-+-+-+-+-+-+-+-+-+ Figure 6: The ToC and CRCs for a payload with three speech frames. 3.3. AMR-WB speech frame An AMR-WB speech frame represents one encoded speech frame encoded using the mode according to the FT field in ToC entry corresponding to this frame. The length of this field is implicitly defined by the AMR-WB mode in the FT field. The AMR-WB speech bits SHALL be sorted according to Appendix B of [3]. 3.4. Compound AMR-WB payload The compound AMR-WB payload consists of one AMR-WB payload header, the table of content, and one or more AMR-WB payload frames, see section 3.1., 3.2 and 3.3. These can be combined either by using robust or simple payload sorting. The S-bit in the AMR-WB payload header indicates which method is used. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 10] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 Definitions for describing the compound AMR-WB payload: b(m) - bit m of the compound AMR-WB payload t(n,m) - bit m in the table of content entry for speech frame n p(n,m) - bit m in the CRC for speech frame n f(n,m) - bit m in speech frame n F(n) - number of bits in speech frame n, defined by FT h(m) - bit m of payload header H - number of bits in payload header, 7 or 15 bits C - number of CRC bits , 0 or 8 bits N - number of payload frames in the payload S - number of unused bits in the last octet of the payload Payload frames f(n,m) are ordered in the order they are delivered by the AMR-WB speech encoder, i.e. frame n is preceding frame n+1. All frames between the oldest one and the most recent one MUST be present in the payload, the only exception is interleaving, when the frame order are defined in section 3.1.2. If some of the frames are not available because of a frame loss or they are not transmitted, e.g. due to DTX, those MUST be replaced by lost speech or by no transmission/no reception type frames, respectively. 3.4.1. Robust payload sorting As described earlier, a bit error in a more sensitive bit is subjectively more annoying than in a less sensitive bit. Therefore, to enable protection of only the most sensitive bits of a payload with a forward error detection code, e.g. a CRC outside RTP, the bits inside a payload can be ordered into sensitivity order. The protection SHOULD cover an appropriate number of octets from the beginning of the payload, covering at least the AMR-WB payload header, ToC, and class A bits (see Table 1). Exactly how many octets that needs protection depends on the network and application. To maintain sensitivity ordering inside the AMR-WB payload, when more than one speech frame is transmitted in one payload, reordering of the bits in the payload is needed. The AMR-WB payload header, ToC and CRCs SHALL still be placed unchanged in the beginning of the robust sorted payload. Thereafter, the payload frames are sorted with one bit alternating from each AMR- WB payload frame. The robust payload sorting algorithm is defined in C-style as: /* payload header */ k=0; for (i = 0; i < H; i++){ b(k++) = h(i); } /* table of content */ for (j = 0; j < N; j++){ Lakaniemi/Ojala/Sjoberg/Westerlund [Page 11] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 for (i = 0; i < 6; i++){ b(k++) = t(j,i); } } /* CRCs */ for (j = 0; j < N; j++){ for (i = 0; i < C; i++){ b(k++) = p(j,i); } } /* payload frames */ max = max(F(0),..,F(N-1)); for (i = 0; i < max; i++){ for (j = 0; j < N; j++){ if (i < F(j)){ b(k++) = f(j,i); } } } /* padding */ S = 8 - k%8; if (S < 8){ for (i = 0; i < S; i++){ b(k++) = 0; } } 3.4.2. Simple payload sorting If multiple frames are encapsulated into the payload and robust payload sorting is not used, the payload is formed as concatenation of the AMR-WB payload header, ToC, possibly optional CRC fields, and the AMR-WB speech frames. However, the bits inside each AMR-WB payload frame are ordered into sensitivity order as defined in Annex B of [3]. The simple payload sorting algorithm is defined in C-style as: /* payload header */ k=0; for (i = 0; i < H; i++){ b(k++) = h(i); } /* table of content */ for (j = 0; j < N; j++){ for (i = 0; i < 6; i++){ b(k++) = t(j,i); } } /* CRCs */ for (j = 0; j < N; j++){ Lakaniemi/Ojala/Sjoberg/Westerlund [Page 12] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 for (i = 0; i < C; i++){ b(k++) = p(j,i); } } /* payload frames */ for (j = 0; j < N; j++){ for (i = 0; i < F(j); i++){ b(k++) = f(j,i); } } } /* padding */ S = 8 - k%8; if (S < 8){ for (i = 0; i < S; i++){ b(k++) = 0; } } 3.5. Decoding security consideration If the payload length calculation based on C, I, F and FT fields does not indicate the same length as the actually received payload size, the payload should be dropped as erroneous. Decoding AMR-WB frames that are parsed based on erroneous header information could severely degrade the speech quality. 4. RTP header usage The RTP header marker bit (M) is used to mark (M=1) the payloads containing the first speech frame after a CN period. For all other payloads the marker bit is set to 0 (M=0). The timestamp corresponds to the sampling time of the first sample of the first encoded AMR-WB frame in the payload. A frame can either be encoded speech, comfort noise parameters, LOST_FRAME, or NO_TRANSMISSION. The unit used to compute timestamp is one sample. The duration of one AMR-WB speech frame is 20 ms and the sampling frequency is 16 kHz, corresponding to 320 speech samples per frame. Thus, the timestamp is increased by 320 for each consecutive frame. If the optional interleaving functionality is not used, all frames in a packet MUST be successive frames, stored in the same order as delivered by the AMR-WB speech encoder. If the interleaving is employed, the frames encapsulated into a payload MUST be picked as defined in section 3.1.2. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 13] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 5. Congestion Control The need of congestion control for data transported with RTP has to be considered. AMR-WB speech data have some elastic properties due to the different bandwidth demand for each mode. Another parameter that can reduce the bandwidth demand for AMR-WB are how many frames of speech data that are encapsulated in each payload. This will reduce the number of packets and the overhead from IP/UDP/RTP headers. If using forward error correction (FEC) there is also the need to regulate the amount, so that the FEC itself does not worsen the problem. Therefore, it is RECOMMENDED that applications using this payload implements congestion control. The actual mechanism for congestion control is not specified but should be suitable for real- time flows, e.g. [16]. 6. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [10]. This implies that confidentiality of the media streams is achieved by encryption. Because the payload format is arranged end-to-end, encryption MAY be performed after encapsulation so there is no conflict between the two operations. This payload type does not exhibit any significant non-uniformity in the receiver side computational complexity for packet processing to cause a potential denial-of-service threat. As this format transports encoded speech data, the main security issues are confidentiality and authentication of the speech itself. Some other smaller issues also exist. The payload format itself does not have any support for security. These issues have to be solved by a payload external mechanism. 6.1. Confidentiality To achieve confidentiality of the encoded speech all speech data bits must be encrypted. There is less need to encrypt the payload header or the frame header as they only carry information about the requested AMR-WB mode, AMR-WB frame type, and frame quality. This information could be useful to some third party, e.g. quality monitoring. The type of encryption used can not only have impact on the confidentiality but also on error robustness. The robustness against bit errors will be non, unless an encryption method without error-propagation is used, e.g. a stream cipher. This is only an issue when using UEP/UED, when bit errors can be accepted in some part of the payload. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 14] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 6.2. Authentication To authenticate the sender of the speech an external mechanism have to be added. It is recommended that such a mechanism protects all the speech data bits. Note that the use of UED/UEP is difficult to combine with authentication. To prevent a man in the middle to tamper with the packetization of the speech data, some extra data could be protected. The data is: RTP timestamp, RTP sequence number, RTP marker bit. Tampering could result in erroneous decapsulation/decoding that could lower speech quality. Tampering with the AMR-WB mode request field can result in that the sender receives speech in a different quality than desired. 7. Examples 7.1. Simple example In the simple example one AMR-WB frame is encapsulated into the payload. Simple payload sorting is used (S=0), no CRC fields are present (C=0), and interleaving is not used (I=0). A 23.05 kbps mode is requested for the reverse link (CMR=7), and the payload was not damaged at IP origin (Q=1). The AMR-WB mode is the 12.65 kbps mode (FT=2). The speech encoded bits are put into f(0...252) in descending sensitivity order according to [3]. | Bit no. | Oct| 0 1 2 3 4 5 6 7 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 0 | S=0 | C=0 | I=0 | 0 | 1 | 1 | 1 | F=0 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 1 | 0 | 0 | 1 | 0 | Q=1 | f(0) | f(1) | ... | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 32 | ... | ... | ... | ... | ... | ... |f1(249)|f1(250)| ---+-------+-------+-------+-------+-------+-------+-------+-------+ 33 | f(251)| f(252)| 0 | 0 | 0 | 0 | 0 | 0 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ Figure 7: One AMR-WB frame per payload example. 7.2. Example with CRCs In this example two frames are transmitted in one payload. Simple payload sorting is used (S=0), CRC fields are present (C=1), and interleaving is not used (I=0). No mode request is sent (CMR=15), and neither of the frames is corrupted (Q=1). The payload contains one frame at 14.25 kbps mode (FT=3) and one frame at 15.85 kbps mode (FT=4). Bits p1(0...7) and p2(0...7) mark the CRC checksum for the first and second frames, respectively. The bits of the first frame are denoted by f1(0...284), and bits of the second frame are marked by f2(0...316). Lakaniemi/Ojala/Sjoberg/Westerlund [Page 15] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 | Bit no. | Oct| 0 1 2 3 4 5 6 7 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 0 | S=0 | C=1 | I=0 | 1 | 1 | 1 | 1 | F=1 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 1 | 0 | 0 | 1 | 1 | Q=1 | F=0 | 0 | 1 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 2 | 0 | 0 | Q=1 | p1(0) | p1(1) | p1(2) | p1(3) | p1(4) | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 3 | p1(5) | p1(6) | p1(7) | p2(0) | p2(1) | p2(2) | p2(3) | p2(4) | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 4 | p2(5) | p2(6) | p2(7) | f1(0) | f1(1) | ... | ... | ... | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 40 | ... | ... | ... | ... | ... | ... |f1(283)|f1(284)| ---+-------+-------+-------+-------+-------+-------+-------+-------+ 41 | f2(0) | f2(1) | ... | ... | ... | ... | ... | ... | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 80 | ... | ... | ... |f2(315)|f2(316)| 0 | 0 | 0 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ Figure 8: Example with two AMR-WB frames and CRCs. 7.3. Example with multiple frames per payload and robust sorting In this example two frames are transmitted in one payload with robust sorting (S=1). No CRC is used (C=0), interleaving is not used (I=0), and 8.85 kbps mode frame is requested from the reverse link (CMR=1). Both frames are undamaged (Q=1), and the two frames in the payload are encoded at 14.25 kbps (FT=3) and 15.85 kbps (FT=4) modes. The first frame is represented by f1(0...284) and the subsequent frame by f2(0...316). | Bit no. | Oct| 0 1 2 3 4 5 6 7 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 0 | S=1 | C=0 | I=0 | 0 | 0 | 0 | 1 | F=1 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 1 | 0 | 0 | 1 | 1 | Q=1 | F=0 | 0 | 1 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 2 | 0 | 0 | Q=1 | f1(0) | f2(0) | f1(1) | f2(1) | ... | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 74 | ... |f1(283)|f2(283)|f1(284)|f2(284)|f2(285)|f2(286)| ... | ---+-------+-------+-------+-------+-------+-------+-------+-------+ 78 | ... | ... | ... |f2(316)|f2(317)| 0 | 0 | 0 | ---+-------+-------+-------+-------+-------+-------+-------+-------+ Figure 9: Example with two AMR-WB frames per payload and robust sorting. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 16] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 8. The AMR-WB MIME type registration This chapter defines the MIME type for the Adaptive Multi-Rate Wideband (AMR-WB) speech codec. AMR-WB implementations according to [1] MUST support all nine coding modes. The fast mode adaptation is supported by transmitting the mode information in-band together with encoded speech data to allow mode change without any additional signaling. Furthermore, fast mode adaptation requires transmission of codec mode request inside payload. In addition to the speech codec, AMR-WB specifications also include Discontinuous Transmission / comfort noise (DTX/CN) functionality [4]. The DTX/CN switches the transmission off during silent periods of the speech and only SID frames containing CN parameter updates are sent at regular intervals. Also the AMR-WB DTX/CN MUST be supported. It is possible that the receiver may only want to receive a certain AMR-WB mode or a subset of AMR-WB modes, due to link limitations in some cellular systems, e.g. the GSM/GERAN radio link can require that only a subset of AMR-WB modes is used. Therefore, it is possible to request a specific set of AMR-WB modes in capability description and the encoder MUST abide this request. If the request for mode set is not given, any mode may be used or requested. The AMR-WB codec can in principle perform a mode change at any time between any two modes. To support interoperability with GSM through a gateway it is possible to set limitations for mode changes. The decoder has possibility to define the minimum number of frames between mode changes and to limit the mode change to happen into neighboring modes only. The receiver can limit the number of AMR-WB frames encapsulated into one RTP packet, and if maximum number of frames per packet is given in capability description, the transmitter MUST comply with this limitation. This is an OPTIONAL feature and if no parameter is given in capability description, the transmitter can encapsulate any number of AMR-WB speech frames into one RTP packet. The payload CRC UED MUST only be used if the receiver has signaled support for this functionality in the capability description. To enable unequal error protection and/or detection outside RTP, the payload format supports robust payload sorting. The robust payload sorting is an optional feature and MUST only be used if the receiver has signaled support for this functionality in the capability description. The speech quality in case of packet losses when transmitting several AMR-WB frames per packet can be improved by using OPTIONAL frame interleaving. The interleaving improves perceived speech quality since it introduces series of single frame errors instead of several Lakaniemi/Ojala/Sjoberg/Westerlund [Page 17] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 consecutive frame errors. Interleaving MUST only be applied if the receiver has signaled support for it, and if used, the interleaving length MUST NOT exceed the limitation given in capability description. Note that the receiver can use the MIME parameters to limit increased buffering requirements caused by the interleaving. For example specifying maxframes=N and interleaving=L, the maximum size of an interleave group would be N*(L+1) (see section 3.1.2 for details on interleaving). 8.1. MIME Registration MIME-name for the AMR-WB codec is allocated from IETF tree since AMR- WB is expected to be widely used speech codec in VoIP applications. Media Type name: audio Media subtype name: AMR-WB Required parameters: none Optional parameters: mode-set: Requested AMR-WB mode set. Restricts the active codec mode set to a subset of all modes. Possible values are comma separated list of modes: 0,...,8 (see Table 1a [3], an example is given in section 8.4). If not present, all speech modes are available. mode-change-period: Defines a number N which restricts the mode changes in such a way that mode changes are only allowed on multiples of N, initial state of the phase is arbitrary. If this parameter is not present, mode change can happen at any time. mode-change-neighbor: If present, mode changes SHALL only be made to neighboring modes in the active codec mode set. If not present, change between any two modes in the active codec mode set is allowed. maxframes:Maximum number of AMR-WB speech frames in one RTP packet. The receiver may set this parameter in order to limit the buffering requirements or delay. crc: If present, transmission of CRCs in the payload is supported, otherwise not supported. robust-sorting: If present, robust payload sorting is supported, otherwise not supported and simple payload sorting SHALL be used. interleaving: Indicates that the frame interleaving is supported and defines a maximum value for interleaving length field ILL (see section 3.1.2). If this parameter is not present, the interleaving is not supported. Encoding considerations: See section 3 in this document. Security considerations: see chapter 6 "Security Consideration". Lakaniemi/Ojala/Sjoberg/Westerlund [Page 18] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 Public specification: please refer to chapter 9 "References". Person & email address to contact for further information: ari.lakaniemi@nokia.com pasi.s.ojala@nokia.com Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Author/Change controller: ari.lakaniemi@nokia.com pasi.s.ojala@nokia.com 8.2. Mapping to SDP Parameters Parameters are mapped to SDP [11] as usual. Example usage in SDP: m=audio 49120 RTP/AVP 97 a=rtpmap:97 AMR-WB/16000 a=fmtp:97 mode-set=2,3,4,5,6; maxframes=1 9. References [1] 3GPP TS 26.190 "AMR Wideband speech codec; Transcoding functions". [2] 3GPP TS 26.090 "AMR speech codec; Transcoding functions". [3] 3GPP TS 26.201 "AMR Wideband speech codec; Frame Structure". [4] 3GPP TS 26.193 "AMR Wideband Speech Codec; Source Controlled Rate operation". [5] 3GPP TS 26.192 "AMR Wideband speech codec; Comfort Noise aspects". [6] 3GPP TS 26.194 "AMR Wideband speech codec; Voice Activity Detector (VAD)". [7] 3GPP TS 26.191 "AMR Wideband speech codec; Error concealment of lost frames". [8] 3GPP TS 25.415 "UTRAN Iu Interface User Plane Protocols". [9] IETF RFC 2119, "Key words for use in RFCs to Indicate Requirement Levels". [10]IETF RFC 1889, "RTP: A Transport Protocol for Real-Time Applications". Lakaniemi/Ojala/Sjoberg/Westerlund [Page 19] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 [11]IETF RFC 2327 "SDP: Session Description Protocol", April 1998. [12]IETF draft-ietf-avt-rtp-amr-03.txt, "RTP payload format for AMR", work in progress. [13]IETF draft-westberg-realtime-cellular-01.txt, "Realtime Traffic over Cellular Access Networks", work in progress. [14]IETF draft-larzon-udplite-03.txt, "The UDP Lite Protocol", work in progress. [15]IETF draft-ietf-avt-ulp-00.txt, " An RTP Payload Format for Generic FEC with Uneven Level Protection", work in progress. [16]S. Floyd, M. Handley, J. Padhye, J. Widmer, "Equation-Based Congestion Control for Unicast Applications", ACM SIGCOMM 2000, Stockholm, Sweden. [17] 3GPP TS 26.202 "AMR Wideband speech codec; Interface to Iu and Uu". 10. Authors' addresses Ari Lakaniemi Nokia Research Center P.O.Box 407 FIN-00045 Nokia Group Finland E-mail: ari.lakaniemi@nokia.com Pasi Ojala Nokia Research Center P.O.Box 100 FIN-33721 Tampere Finland E-mail: pasi.s.ojala@nokia.com Johan Sj÷berg Ericsson Research Ericsson Radio System AB Torshamsgatan 23 SE-164 80 Stockholm SWEDEN E-mail: johan.sjoberg@ericsson.com Magnus Westerlund Ericsson Research Ericsson Radio System AB Torshamsgatan 23 Lakaniemi/Ojala/Sjoberg/Westerlund [Page 20] INTERNET-DRAFT RTP Payload Format for AMR-WB February 23, 2001 SE-164 80 Stockholm SWEDEN E-mail: magnus.westerlund@ericsson.com This Internet-Draft expires in August 23, 2001. Lakaniemi/Ojala/Sjoberg/Westerlund [Page 21]