idnits 2.17.1 draft-ietf-avt-rtp-amrwbplus-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1735. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1708. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1715. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1721. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 15 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 459 has weird spacing: '... loss sever...' == Line 828 has weird spacing: '...payload is th...' == Line 1230 has weird spacing: '... frames needs...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 22, 2005) is 6790 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '4' -- Possible downref: Non-RFC (?) normative reference: ref. '5' ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) ** Obsolete normative reference: RFC 3267 (ref. '7') (Obsoleted by RFC 4867) -- Obsolete informational reference (is this intentional?): RFC 2733 (ref. '11') (Obsoleted by RFC 5109) -- Obsolete informational reference (is this intentional?): RFC 2326 (ref. '16') (Obsoleted by RFC 7826) Summary: 5 errors (**), 0 flaws (~~), 6 warnings (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Johan Sjoberg 3 INTERNET-DRAFT Magnus Westerlund 4 Expires: March 2006 Ericsson 5 Ari Lakaniemi 6 Stephan Wenger 7 Nokia 8 September 22, 2005 10 RTP Payload Format for Extended AMR Wideband (AMR-WB+) Audio Codec 11 13 Status of this memo 15 By submitting this Internet-Draft, each author represents that any 16 applicable patent or other IPR claims of which he or she is aware 17 have been or will be disclosed, and any of which he or she becomes 18 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six 26 months and may be updated, replaced, or obsoleted by other documents 27 at any time. It is inappropriate to use Internet-Drafts as 28 reference material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/1id-abstracts.txt 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html 36 This document is a submission of the IETF AVT WG. Comments should 37 be directed to the AVT WG mailing list, avt@ietf.org. 39 Abstract 41 This document specifies a real-time transport protocol (RTP) payload 42 format for Extended Adaptive Multi-Rate Wideband (AMR-WB+) encoded 43 audio signals. The AMR-WB+ codec is an audio extension of the AMR- 44 WB speech codec. It encompasses the AMR-WB frame types and a number 45 of new frame types designed to support high quality music and 46 speech. A media type registration for AMR-WB+ is included in this 47 specification. 49 TABLE OF CONTENTS 51 1. Definitions.....................................................3 52 1.1. Glossary...................................................3 53 1.2. Terminology................................................3 54 2. Introduction....................................................3 55 3. Background of AMR-WB+ and Design Principles.....................4 56 3.1. The AMR-WB+ Audio Codec....................................4 57 3.2. Multi-rate Encoding and Rate Adaptation....................7 58 3.3. Voice Activity Detection and Discontinuous Transmission....8 59 3.4. Support for Multi-Channel Session..........................8 60 3.5. Unequal Bit-error Detection and Protection.................8 61 3.6. Robustness against Packet Loss.............................9 62 3.6.1. Use of Forward Error Correction (FEC).................9 63 3.6.2. Use of Frame Interleaving............................10 64 3.7. AMR-WB+ Audio over IP scenarios...........................11 65 3.8. Out-of-Band Signaling.....................................12 66 4. RTP Payload Format for AMR-WB+.................................12 67 4.1. RTP Header Usage..........................................13 68 4.2. Payload Structure.........................................14 69 4.3. Payload Definitions.......................................14 70 4.3.1. Payload Header.......................................14 71 4.3.2. The Payload Table of Contents........................15 72 4.3.3. Audio Data...........................................21 73 4.3.4. Methods for Forming the Payload......................21 74 4.3.5. Payload Examples.....................................22 75 4.4. Interleaving Considerations...............................24 76 4.5. Implementation Considerations.............................25 77 4.5.1. ISF recovery in case of packet loss..................26 78 4.5.2. Decoding Validation..................................28 79 5. Congestion Control.............................................28 80 6. Security Considerations........................................28 81 6.1. Confidentiality...........................................29 82 6.2. Authentication and Integrity..............................29 83 7. Payload Format Parameters......................................29 84 7.1. Media Type Registration...................................30 85 7.2. Mapping Media Type Parameters into SDP....................31 86 7.2.1. Offer-Answer Model Considerations....................32 87 7.2.2. Examples.............................................34 88 8. IANA Considerations............................................34 89 9. Contributors...................................................34 90 10. Acknowledgements..............................................34 91 11. References....................................................35 92 11.1. Normative references.....................................35 93 11.2. Informative references...................................36 94 12. Authors' Addresses............................................37 95 13. IPR Notice....................................................38 96 14. Copyright Notice..............................................38 97 1. Definitions 99 1.1. Glossary 101 3GPP - Third Generation Partnership Project 102 AMR - Adaptive Multi-Rate (Codec) 103 AMR-WB - Adaptive Multi-Rate Wideband (Codec) 104 AMR-WB+ - Extended Adaptive Multi-Rate Wideband (Codec) 105 CMR - Codec Mode Request 106 CN - Comfort Noise 107 DTX - Discontinuous Transmission 108 FEC - Forward Error Correction 109 FT - Frame Type 110 ISF - Internal Sampling Frequency 111 SCR - Source Controlled Rate Operation 112 SID - Silence Indicator (the frames containing only CN 113 parameters) 114 TFI - Transport Frame Index 115 TS - Timestamp 116 VAD - Voice Activity Detection 117 UED - Unequal Error Detection 118 UEP - Unequal Error Protection 120 1.2. Terminology 122 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 123 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 124 this document are to be interpreted as described in RFC 2119 [2]. 126 2. Introduction 128 This document specifies the payload format for packetization of 129 Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] encoded audio 130 signals into the Real-time Transport Protocol (RTP) [3]. The 131 payload format supports the transmission of mono or stereo audio, 132 aggregating multiple frames per payload, and mechanisms enhancing 133 the robustness of the packet stream against packet loss. 135 The AMR-WB+ codec is an extension of the Adaptive Multi-Rate 136 Wideband (AMR-WB) speech codec. New features include extended audio 137 bandwidth to enable high quality for non-speech signals (e.g. 138 music), native support for stereophonic audio, and the option to 139 operate on, and switch between, several internal sampling 140 frequencies (ISFs). The primary usage scenario for AMR-WB+ is the 141 transport over IP. Therefore, interworking with other transport 142 networks, as discussed for AMR-WB in [7], is not a major concern and 143 hence not addressed in this memo. 145 The expected key application for AMR-WB+ is streaming. To make the 146 packetization process on a streaming server as efficient as 147 possible, an octet-aligned payload format is desirable. Therefore, 148 a bandwidth efficient mode as defined for AMR-WB in [7] is not 149 specified herein; the bandwidth-savings of the bandwidth efficient 150 mode would be very small anyway, since all extension frame types are 151 octet aligned. 153 The stereo encoding capability of AMR-WB+ renders the support for 154 multi-channel transport at RTP payload format level, as specified 155 for AMR-WB [7], obsolete. Therefore this feature is not included in 156 this memo. 158 This specification does not include a definition of a file format 159 for AMR-WB+. Instead, it is referred to the ISO based 3GP file 160 format [14], which supports AMR-WB+ and provides all functionality 161 required. The 3GP format also supports storage of AMR and AMR-WB, 162 and many other multi-media formats, thereby allowing synchronized 163 playback. 165 The rest of the document is organized as follows: Background 166 information on the AMR-WB+ codec, and design principles, can be 167 found in Section 3. The payload format itself is specified in 168 Section 4. Sections 5 and 6 discuss congestion control and security 169 considerations, respectively. In Section 7, a media type 170 registration is provided. 172 3. Background of AMR-WB+ and Design Principles 174 The Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] audio codec 175 is designed to compress speech and audio signals at low bit-rate and 176 good quality. The codec is specified by the Third Generation 177 Partnership Project (3GPP). The primary target applications are 1. 178 the packet-switched streaming service (PSS) [13], 2. multimedia 179 messaging service (MMS), and 3. multimedia broadcast and multicast 180 service (MBMS). However, due to its flexibility and robustness, AMR- 181 WB+ is also well suited for streaming services in other highly 182 varying transport environments, for example the Internet. 184 3.1. The AMR-WB+ Audio Codec 186 3GPP originally developed the AMR-WB+ audio codec for streaming and 187 messaging services in Global System for Mobile communications (GSM) 188 and third generation (3G) cellular systems. The codec is designed 189 as an audio extension of the AMR-WB speech codec. The extension 190 adds new functionality to the codec in order to provide high audio 191 quality for a large range of signals including music. Stereophonic 192 operation has also been added. A new, high-efficiency hybrid stereo 193 coding algorithm enables stereo operation at bit-rates as low as 6.2 194 kbit/s. 196 The AMR-WB+ codec includes the nine frame types specified for AMR- 197 WB, extended by new bit-rates ranging from 5.2 to 48 kbit/s. The 198 AMR-WB frame types can employ only a 16000 Hz sampling frequency and 199 operate only on monophonic signals. The newly introduced extension 200 frame types, however, can operate at a number of internal sampling 201 frequencies (ISFs), both in mono and stereo. Please see Table 24 in 202 [1] for details. The output sampling frequency of the decoder is 203 limited to 8, 16, 24, 32 or 48 kHz. 205 An overview of the AMR-WB+ encoding operations is provided as 206 follows. The encoder receives the audio sampled at, for example, 48 207 kHz. The encoding process starts with pre-processing and resampling 208 to the user-selected ISF. The encoding is performed on equally 209 sized super-frames. Each super-frame corresponds to 2048 samples 210 per channel, at the ISF. The codec carries out a number of encoding 211 decisions for each super-frame, thereby choosing between different 212 encoding algorithms and block lengths, so to achieve a fidelity- 213 optimized encoding adapted to the signal characteristics of the 214 source. The stereo encoding (if used) executes separately from the 215 monophonic core encoding, thus enabling the selection of different 216 combinations of core and stereo encoding rates. The resulting 217 encoded audio is produced in four transport frames of equal length. 218 Each transport frame corresponds to 512 samples at the ISF, and is 219 individually usable by the decoder, provided that its position in 220 the super-frame structure is known. 222 The codec supports 13 different ISFs, ranging from 12.8 up to 38.4 223 kHz, as described by Table 24 of [1]. The high number of ISFs 224 allows a trade-off between the audio bandwidth and the target bit- 225 rate. As encoding is performed on 2048 samples at the ISF, the 226 duration of a super-frame and the effective bit-rate of the frame 227 type in use varies. 229 The ISF of 25600 Hz has a super-frame duration of 80 ms. It is the 230 'nominal' value used to describe the encoding bit-rates henceforth. 231 Assuming this normalization, the ISF selection results in bit-rate 232 variations from 1/2 up to 3/2 of the nominal bit-rate. 234 The encoding for the extension modes is performed as one monophonic 235 core encoding and one stereo encoding. The core encoding is 236 executed by splitting the monophonic signal into a lower and a 237 higher frequency band. The lower band is encoded employing either 238 algebraic code excited linear prediction (ACELP), or transform coded 239 excitation (TCX). This selection can be made once per transport 240 frame, but must obey certain limitations of legal combinations 241 within the super-frame. The higher band is encoded using a low-rate 242 parametric bandwidth extension approach. 244 The stereo signal is encoded employing a similar frequency band 245 decomposition; however, here the signal is divided into three bands 246 that are individually parameterized. 248 The total bit-rate produced by the extension is the result of the 249 combination of the encoder's core rate, stereo rate and ISF. The 250 extension supports 8 different core encoding rates producing bit- 251 rates between 10.4 and 24.0 kbit/s; see table 22 in [1]. There are 252 16 stereo encoding rates generating bit-rates between 2.0 and 8.0 253 kbit/s; see table 23 in [1]. The frame type encodes the AMR-WB 254 modes, 4 fixed extension rates (see below), 24 combinations of core 255 and stereo rates for stereo signals, and the 8 core rates for mono 256 signals, as listed in table 25 in [1]. This results in the AMR-WB+ 257 supporting encoding rates between 10.4 and 32 kbit/s, assuming an 258 ISF of 25600 Hz. 260 Different ISFs allow for additional freedom in the produced bit- 261 rates and audio quality. The selection of an ISF changes the 262 available audio bandwidth of the reconstructed signal, and also the 263 total bit-rate. The bit-rate for a given combination of frame type 264 and ISF is determined by multiplying the frame type's bit-rate with 265 the used ISF's bit-rate factor, see table 24 in [1]. 267 The extension also has four frame types which have fixed ISFs. 268 Please see frame types 10-13 in Table 21 in [1]. These four pre- 269 defined frame types have a fixed input sampling frequency at the 270 encoder, which can be set either at 16 or 24 kHz. Like the AMR-WB 271 frame types, transport frames encoded utilizing these frame types 272 represent exactly 20 ms of the audio signal. However, they are also 273 part of 80 ms super-frames. Frame types 0-13 (AMR-WB and fixed 274 extension rates), as listed in table 21 in [1], do not require an 275 explicit ISF indication. The other frame types 14-47 require the 276 ISF employed to be indicated. 278 The 32 different frame types of the extension, in combination with 279 13 ISFs, allows for a great flexibility in bit-rate and selection of 280 desired audio quality. A number of combinations exist that produce 281 the same codec bit-rate. For example, a 32 kbit/s audio stream can 282 be produced by utilizing frame type 41, i.e. 25.6 kbit/s, and the 283 ISF of 32kHz (5/4 * (19.2+6.4) = 32 kbit/s), or frame type 47 and 284 the ISF of 25.6 kHz (1 * (24 + 8) = 32 kbit/s). Which combination 285 is more beneficial for the perceived audio quality depends on the 286 content. In the above example the first case provides a higher 287 audio bandwidth, while the second one spends the same number of bits 288 on somewhat narrower audio bandwidth but provides higher fidelity. 289 Encoders are free to select the combination they deem most 290 beneficial. 292 Since a transport frame always corresponds to 512 samples at the 293 used ISF, its duration is limited to the range 13.33 to 40 ms, see 294 Table 1. An RTP Timestamp clock rate of 72000 Hz, as mandated by 295 this specification, results in AMR-WB+ transport frame lengths of 296 960 to 2880 timestamp ticks, depending solely on the selected ISF. 298 Index ISF Duration(ms) Duration(TS Ticks @ 72 kHz) 299 ------------------------------------------------------ 300 0 N/A 20 1440 301 1 12800 40 2880 302 2 14400 35.55 2560 303 3 16000 32 2304 304 4 17067 30 2160 305 5 19200 26.67 1920 306 6 21333 24 1728 307 7 24000 21.33 1536 308 8 25600 20 1440 309 9 28800 17.78 1280 310 10 32000 16 1152 311 11 34133 15 1080 312 12 36000 14.22 1024 313 13 38400 13.33 960 315 Table 1: Normative number of RTP Timestamp Ticks for each 316 Transport Frame depending on ISF (ISF and Duration in 317 ms are rounded) 319 The encoder is free to change both the ISF and the encoding frame 320 type (both mono and stereo) during a session. For the extension 321 frame types with index 10-13 and 16-47, the ISF and frame type 322 changes are constrained to occur at super-frame boundaries. This 323 implies that, for the frame types mentioned, the ISF is constant 324 throughout a super-frame. This limitation does not apply for frame 325 types with index 0-9, 14 and 15, i.e. the original AMR-WB frame 326 types. 328 A number of features of the AMR-WB+ codec require special 329 consideration from a transport point of view, and solutions that 330 could perhaps be viewed as unorthodox. First, there are constraints 331 on the RTP timestamping, due to the relationship of the frame 332 duration and the ISFs. Second, each frame of encoded audio must 333 maintain information about its frame type, ISF and position in the 334 super-frame. 336 3.2. Multi-rate Encoding and Rate Adaptation 338 The multi-rate encoding capability of AMR-WB+ is designed to 339 preserve high audio quality under a wide range of bandwidth 340 requirements and transmission conditions. 342 AMR-WB+ enables seamless switching between frame types that use the 343 same number of audio channels and the same ISF. Every AMR-WB+ codec 344 implementation is required to support all frame types defined by the 345 codec, and must be able to handle switching between any two frame 346 types. Switching between frame types employing a different number 347 of audio channels or a different ISF must also be supported, but it 348 may not be completely seamless. Therefore it is recommended to 349 perform such switching infrequently and, if possible, during periods 350 of silence. 352 3.3. Voice Activity Detection and Discontinuous Transmission 354 AMR-WB+ supports the same algorithms as AMR-WB for voice activity 355 detection (VAD) and generation of comfort noise (CN) parameters 356 during silence periods. However, these functionalities can only be 357 used in conjunction with the AMR-WB frame types (FT=0-8). This 358 option allows reducing the number of transmitted bits and packets 359 during silence periods to a minimum. The operation of sending CN 360 parameters at regular intervals during silence periods is usually 361 called discontinuous transmission (DTX) or source controlled rate 362 (SCR) operation. The AMR-WB+ frames containing CN parameters are 363 called Silence Indicator (SID) frames. More details about the VAD 364 and DTX functionality is provided in [4] and [5]. 366 3.4. Support for Multi-Channel Session 368 Some of the AMR-WB+ frame types support the encoding of stereophonic 369 audio. Because of this native support for a two-channel 370 stereophonic signal, it does not seem necessary to support multi- 371 channel transport with separate codec instances, as specified in the 372 AMR-WB RTP payload [7]. The codec has the capability of stereo to 373 mono downmixing as part of the decoding process. Thus, a receiver 374 that is only capable of playout of monophonic audio must still be 375 able to decode and play signals originally encoded and transmitted 376 as stereo. However, to avoid spending bits on a stereo encoding 377 that is not going to be utilized, a mechanism is defined in this 378 specification to signal mono-only audio. 380 3.5. Unequal Bit-error Detection and Protection 382 The audio bits encoded in each AMR-WB frame are sorted according to 383 their different perceptual sensitivity to bit errors. In cellular 384 systems, for example, this property can be exploited to achieve 385 better voice quality, by using unequal error protection and 386 detection (UEP and UED) mechanisms. However, the bits of the 387 extension frame types of the AMR-WB+ codec do not have a consistent 388 perceptual significance property and are not sorted in this order. 389 Thus, UEP or UED is meaningless with the extension frame types. If 390 there is a need to use UEP or UED for AMR-WB frame types, it is 391 recommended to use RFC 3267 [7]. 393 3.6. Robustness against Packet Loss 395 The payload format supports two mechanisms to improve robustness 396 against packet loss: simple forward error correction (FEC) and frame 397 interleaving. 399 3.6.1. Use of Forward Error Correction (FEC) 401 Generic forward error correction within RTP is defined, for example 402 in RFC2733 [11]. Audio redundancy coding is defined in RFC2198 403 [12]. Either scheme can be used to add redundant information to the 404 RTP packet stream and make it more resilient to packet losses, at 405 the expense of a higher bit rate. Please see either RFC for a 406 discussion of the implications of the higher bit rate to network 407 congestion. 409 In addition to these media-unaware mechanisms, this memo specifies 410 an AMR-WB+ specific form of audio redundancy coding, which may be 411 beneficial in terms of packetization overhead. 413 Conceptually, previously transmitted transport frame(s) are 414 aggregated together with new one(s). A sliding window is used to 415 group the frames to be sent in each payload. Figure 1 below shows 416 an example. 418 --+--------+--------+--------+--------+--------+--------+--------+-- 419 | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | 420 --+--------+--------+--------+--------+--------+--------+--------+-- 422 <---- p(n-1) ----> 423 <----- p(n) -----> 424 <---- p(n+1) ----> 425 <---- p(n+2) ----> 426 <---- p(n+3) ----> 427 <---- p(n+4) ----> 429 Figure 1: An example of redundant transmission. 431 Here, each frame is retransmitted once in the following RTP payload 432 packet. F(n-2)...f(n+4) denote a sequence of audio frames and p(n- 433 1)...p(n+4) a sequence of payload packets. 435 The mechanism described does not require signaling at the session 436 setup. In other words, the audio sender can choose to use this 437 scheme without consulting the receiver. For a certain timestamp, 438 the receiver may receive multiple copies of a frame containing 439 encoded audio data or frames indicated as NO_DATA. The cost of this 440 scheme is bandwidth and the receiver delay necessary to allow the 441 redundant copy to arrive. 443 This redundancy scheme provides a similar functionality as the one 444 described in RFC 2198, but works only if both original frames and 445 redundant representations are AMR-WB+ frames. When the use of other 446 media coding schemes is desirable, one has to resort to RFC2198. 448 The sender is responsible for selecting an appropriate amount of 449 redundancy based on feedback about the channel conditions, e.g. in 450 the RTP Control Protocol (RTCP) [3] receiver reports. The sender is 451 also responsible for avoiding congestion, which may be exacerbated 452 by redundancy (see Section 5 for more details). 454 3.6.2. Use of Frame Interleaving 456 To decrease protocol overhead, the payload design allows several 457 audio transport frames to be encapsulated into a single RTP packet. 458 One of the drawbacks of such an approach is that in case of packet 459 loss several consecutive frames are lost. Consecutive frame loss 460 normally renders error concealment less efficient and usually causes 461 clearly audible and annoying distortions in the reconstructed audio. 462 Interleaving of transport frames can improve the audio quality in 463 such cases by distributing the consecutive losses into a number of 464 isolated frame losses, which are easier to conceal. However, 465 interleaving and bundling several frames per payload also increases 466 end-to-end delay and sets higher buffering requirements. Therefore, 467 interleaving is not appropriate for all use cases or devices. 468 Streaming applications should most likely be able to exploit 469 interleaving to improve audio quality in lossy transmission 470 conditions. 472 Note that this payload design supports the use of frame interleaving 473 as an option. The usage of this feature needs to be negotiated in 474 the session set-up. 476 The interleaving supported by this format is rather flexible. For 477 example, a continuous pattern can be defined, as depicted in Figure 478 2. 480 --+--------+--------+--------+--------+--------+--------+--------+-- 481 | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | 482 --+--------+--------+--------+--------+--------+--------+--------+-- 484 [ P(n) ] 485 [ P(n+1) ] [ P(n+1) ] 486 [ P(n+2) ] [ P(n+2) ] 487 [ P(n+3) ] [P( 488 [ P(n+4) ] 490 Figure 2: An example of interleaving pattern that has constant 491 delay. 493 In Figure 2 the consecutive frames, denoted f(n-2) to f(n+4), are 494 aggregated into packets P(n) to P(n+4), each packet carrying two 495 frames. This approach provides an interleaving pattern that allows 496 for constant delay in both the interleaving and deinterleaving 497 processes. The deinterleaving buffer needs to have room for at 498 least three frames, including the one that is ready to be consumed. 499 The storage space for three frames is needed, for example, when f(n) 500 is the next frame to be decoded: since frame f(n) was received in 501 packet P(n+2) carrying also frame f(n+3), both these frames are 502 stored in the buffer. Furthermore, frame f(n+1) received in the 503 previous packet P(n+1) is also in the deinterleaving buffer. Note 504 also that in this example the buffer occupancy varies: when frame 505 f(n+1) is the next one to be decoded, there are only two frames, 506 f(n+1) and f(n+3), in the buffer. 508 3.7. AMR-WB+ Audio over IP scenarios 510 Since the primary target application for the AMR-WB+ codec is 511 streaming over packet networks, the most relevant usage scenario for 512 this payload format is IP end-to-end between a server and a 513 terminal, as shown in Figure 3. 515 +----------+ +----------+ 516 | | IP/UDP/RTP/AMR-WB+ | | 517 | SERVER |<------------------------>| TERMINAL | 518 | | | | 519 +----------+ +----------+ 521 Figure 3: Server to terminal IP scenario 523 3.8. Out-of-Band Signaling 525 Some of the options of this payload format remain constant 526 throughout a session. Therefore, they can be controlled/negotiated 527 at the session set-up. Throughout this specification, these options 528 and variables are denoted as "parameters to be established through 529 out-of-band means". In Section 7, all of the parameters are 530 formally specified in the form of media type registration for the 531 AMR-WB+ encoding. The method used to signal these parameters at 532 session setup or to arrange prior agreement of the participants is 533 beyond the scope of this document; however, Section 7.2 provides a 534 mapping of the parameters into the Session Description Protocol 535 (SDP) [6] for those applications that use SDP. 537 4. RTP Payload Format for AMR-WB+ 539 The main emphasis in the payload design for AMR-WB+ has been to 540 minimize the overhead in typical use cases, while providing full 541 flexibility with a slightly higher overhead. In order to keep the 542 specification reasonably simple, we refrained from defining frame- 543 specific parameters for each frame type. Instead, a few common 544 parameters were specified that cover all types of frames. 546 The payload format has two modes, basic mode and interleaved mode. 547 The main structural difference between the two modes is the 548 extension of the table of content entries with frame displacement 549 fields (when operating in the interleaved mode). The basic mode 550 supports aggregation of multiple consecutive frames in a payload. 551 The interleaved mode supports aggregation of multiple frames that 552 are non-consecutive in time. In both modes it is possible to have 553 frames encoded with different frame types in the same payload. The 554 ISF must remain constant throughout the payload of a single packet. 556 The payload format is designed around the property of AMR-WB+ frames 557 that the frames are consecutive in time and share the same frame 558 duration (in the absence of an ISF change). This enables the 559 receiver to derive the timestamp for an individual frame within a 560 payload. In basic mode, the deriving process is based on the order 561 of frames. In interleaved mode, it is based on the compact 562 displacement fields. The frame timestamps are used to regenerate 563 the correct order of frames after reception, identify duplicates, 564 and detect lost frames that require concealment. 566 The interleaving scheme of this payload format is significantly more 567 flexible than the one specified in RFC 3267. The AMR and AMR-WB 568 payload format is only capable of using periodic patterns with 569 frames taken from an interleaving group at fixed intervals. The 570 interleaving scheme of this specification, in contrast, allows for 571 any interleaving pattern, as long as the distance in decoding order 572 between any two adjacent frames is not more than 256 frames. Note 573 that even at the highest ISF this allows an interleaving depth up to 574 3.41 seconds. 576 To allow for error resiliency through redundant transmission, the 577 periods covered by multiple packets MAY overlap in time. A receiver 578 MUST be prepared to receive any audio frame multiple times. All 579 redundantly sent frames MUST use the same frame type and ISF, and 580 MUST have the same RTP timestamp, or MUST be a NO_DATA frame 581 (FT=15). 583 The payload consists of octet aligned elements (header, ToC and 584 audio frames). Only the audio frames for AMR-WB frame types (0-9) 585 require padding for octet alignment. If additional padding is 586 desired, then the P bit in the RTP header MAY be set and padding MAY 587 be appended as specified in [3]. 589 4.1. RTP Header Usage 591 The format of the RTP header is specified in [3]. This payload 592 format uses the fields of the header in a manner consistent with 593 that specification. 595 The RTP timestamp corresponds to the sampling instant of the first 596 sample encoded for the first frame in the packet. The timestamp 597 clock frequency SHALL be 72000 Hz. This frequency allows the frame 598 duration to be integer RTP timestamp ticks for the ISFs specified in 599 Table 1. It also provides reasonable conversion factors to the 600 input/output audio sampling frequencies supported by the codec. See 601 section 4.3.1 for guidance on how to derive the RTP timestamp for 602 any audio frame beyond the first one. 604 The RTP header marker bit (M) SHALL be set to 1 whenever the first 605 frame carried in the packet is the first frame in a talkspurt (see 606 definition of the talkspurt in section 4.1 of [9]). For all other 607 packets the marker bit SHALL be set to zero (M=0). 609 The assignment of an RTP payload type for the format defined in this 610 memo is outside the scope of this document. The RTP profile in use 611 either assigns a static payload type or mandates binding the payload 612 type dynamically. 614 The media type parameter "channels" is used to indicate the maximum 615 number of channels allowed for a given payload type. A payload type 616 where channels=1 (mono), SHALL only carry mono content. A payload 617 type for which channels=2 has been declared MAY carry both mono and 618 stereo content. Note that this definition is different from the one 619 in RFC 3551 [9]. As mentioned before, the AMR-WB+ codec handles the 620 support of stereo content and the (eventual) downmixing of stereo to 621 mono internally. This makes it unnecessary to negotiate for the 622 number of channels for reasons other than bit-rate efficiency. 624 4.2. Payload Structure 626 The payload consists of a payload header, a table of contents, and 627 the audio data representing one or more audio frames. The following 628 diagram shows the general payload format layout: 630 +----------------+-------------------+---------------- 631 | payload header | table of contents | audio data ... 632 +----------------+-------------------+---------------- 634 Payloads containing more than one audio frame are called compound 635 payloads. 637 The following sections describe the variations taken by the payload 638 format depending on the mode in use, basic mode or interleaved mode. 640 4.3. Payload Definitions 642 4.3.1. Payload Header 644 The payload header carries data that is common for all frames in the 645 payload. The structure of the payload header is described below. 647 0 1 2 3 4 5 6 7 648 +-+-+-+-+-+-+-+-+ 649 | ISF |TFI|L| 650 +-+-+-+-+-+-+-+-+ 652 ISF (5 bits): Indicates the Internal Sampling Frequency employed for 653 all frames in this payload. The index value corresponds to 654 internal sampling frequency as specified in Table 24 in [1]. 655 This field SHALL be set to 0 for payloads containing frames with 656 Frame Type values 0-13. 658 TFI (2 bits): Transport Frame Index, from 0 (first) to 3 (last), 659 indicating the position of the first transport frame of this 660 payload in the AMR-WB+ super-frame structure. For payloads with 661 frames of only Frame Type values 0-9 this field SHALL be set to 662 0. The TFI value for a frame of type 0-9 SHALL be ignored. Note 663 that the frame type is coded in the table of contents (as 664 discussed later) -- hence the mentioned dependencies of the frame 665 type can be applied easily by interpreting only values carried in 666 the payload header. It is not necessary to interpret the audio 667 bit stream itself. 669 L (1 bit): Long displacement field flag for payloads in interleaved 670 mode. If set to 0, four-bit displacement fields are used to 671 indicate interleaving offset; if set to 1, displacement fields of 672 eight bits are used (see section 4.3.2.2). For payloads in the 673 basic mode this bit SHALL be set to 0 and SHALL be ignored by the 674 receiver. 676 Note that frames employing different ISF values require 677 encapsulation in separate packets. Thus, special considerations 678 apply when generating interleaved packets and an ISF change is 679 executed. In particular, frames that, according to the previously 680 used interleaving pattern, would be aggregated into a single packet 681 have to be separated into different packets, so that the 682 aforementioned condition (all frames in a packet share the ISF) 683 remains true. A naive implementation that splits the frames with 684 different ISF into different packets can result in up to twice the 685 number of RTP packets, when compared to an optimal interleaved 686 solution. Alteration of the interleaving before and after the ISF 687 change may reduce the need for extra RTP packets. 689 4.3.2. The Payload Table of Contents 691 The table of contents (ToC) consists of a list of entries, each 692 entry corresponds to a group of audio frames carried in the payload, 693 as depicted below. 695 +----------------+----------------+- ... -+----------------+ 696 | ToC entry #1 | Toc entry #2 | ToC entry #N | 697 +----------------+----------------+- ... -+----------------+ 699 When multiple groups of frames are present in a payload, the ToC 700 entries SHALL be placed in the packet in order of increasing RTP 701 timestamp value (modulo 2^32) of the first transport frame the TOC 702 entry represent. 704 4.3.2.1. ToC Entry in the Basic Mode 706 A ToC entry of a payload in the basic mode has the following format: 708 0 1 709 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 710 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 711 |F| Frame Type | #frames | 712 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 714 F (1 bit): If set to 1, indicates that this ToC entry is followed by 715 another ToC entry; if set to 0, indicates that this ToC entry is 716 the last one in the ToC. 718 Frame Type (FT) (7 bits): Indicates the audio codec frame type used 719 for the group of frames referenced by this ToC entry. FT 720 designates the combination of AMR-WB+ core and stereo rate, one 721 of the special AMR-WB+ frame types, the AMR-WB rate, or comfort 722 noise, as specified by Table 25 in [1]. 724 #frames (8 bits): Indicates the number of frames in the group 725 referenced by this ToC entry. ToC entries with this field equal 726 to 0 (that would indicate zero frames) SHALL NOT be used and 727 received packets with such a TOC entry SHALL be discarded. 729 4.3.2.2. ToC Entry in the Interleaved Mode 731 Two different ToC entry formats are defined in interleaved mode. 732 They differ in the length of the displacement field, 4 bits or 8 733 bits. The L-bit in the payload header differentiates between the 734 two modes. 736 If L=0, a ToC entry has the following format: 738 0 1 2 3 739 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 741 |F| Frame Type | #frames | DIS1 | ... | DISi | ... | 742 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 743 | ... | ... | DISn | Padd | 744 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 746 F (1 bit): See definition in 4.3.2.1. 748 Frame Type (FT) (7 bits): See definition in 4.3.2.1. 750 #frames (8 bits): See definition in 4.3.2.1. 752 DIS1...DISn (4 bits): A list of n (n=#frames) displacement fields 753 indicating the displacement of the i:th (i=1..n) audio frame 754 relative to the preceding audio frame in the payload, in units of 755 frames. The four-bit unsigned integer displacement values may be 756 between 0 and 15 indicating the number of audio frames in 757 decoding order between the (i-1):th and the i:th frame in the 758 payload. Note that for the first ToC entry of the payload the 759 value of DIS1 is meaningless. It SHALL be set to zero by a 760 sender, and SHALL be ignored by a receiver. This frame's location 761 in the decoding order is uniquely defined by the RTP timestamp 762 and TFI in the payload header. Note also that for subsequent ToC 763 entries DIS1 indicates the number of frames between the last 764 frame of the previous group and the first frame of this group. 766 Padd (4 bits): To ensure octet alignment, four padding bits SHALL be 767 included at the end of the ToC entry in case there is odd number 768 of frames in the group referenced by this entry. These bits 769 SHALL be set to zero and SHALL be ignored by the receiver. If a 770 group containing an even number of frames is referenced by this 771 ToC entry, these padding bits SHALL NOT be included in the 772 payload. 774 If L=1, a ToC entry has the following format: 776 0 1 2 3 777 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 778 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 779 |F| Frame Type | #frames | DIS1 | ... | 780 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 781 | ... | DISn | 782 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 784 F (1 bit): See definition in 4.3.2.1. 786 Frame Type (FT) (7 bits): See definition in 4.3.2.1. 788 #frames (8 bits): See definition in 4.3.2.1. 790 DIS1...DISn (8 bits): A list of n (n=#frames) displacement fields 791 indicating the displacement of the i:th (i=1..n) audio frame 792 relative to the preceding audio frame in the payload, in units of 793 frames. The eight-bit unsigned integer displacement values may 794 be between 0 and 255 indicating the number of audio frames in 795 decoding order between the (i-1):th and the i:th frame in the 796 payload. Note that for the first ToC entry of the payload the 797 value of DIS1 is meaningless. It SHALL be set to zero by a 798 sender, and SHALL be ignored by a receiver. This frame's location 799 in the decoding order is uniquely defined by the RTP timestamp 800 and TFI in the payload header. Note also that for subsequent ToC 801 entries DIS1 indicates the displacement between the last frame of 802 the previous group and the first frame of this group. 804 4.3.2.3. RTP Timestamp Derivation 806 The RTP Timestamp value for a frame SHALL be the timestamp value of 807 the first audio sample encoded in the frame. The timestamp value 808 for a frame is derived differently depending on the payload mode, 809 basic or interleaved. In both cases the first frame in a compound 810 packet has an RTP timestamp equal to the one received in the RTP 811 header. In the basic mode, the RTP time for any subsequent frame is 812 derived in two steps. First, the sum of the frame durations (see 813 Table 1) of all the preceding frames in the payload is calculated. 814 Then, this sum is added to the RTP header timestamp value. For 815 example, if the RTP Header timestamp value is 12345, the payload 816 carries four frames, and the frame duration is 16 ms (ISF = 32 kHz) 817 corresponding to 1152 timestamp ticks, the RTP timestamp of the 818 fourth frame in the payload is 12345 + 3 * 1152 = 15801. 820 In interleaved mode, the RTP timestamp for each frame in the payload 821 is derived from the RTP header timestamp and the sum of the time 822 offsets of all preceding frames in this payload. The frame 823 timestamps are computed based on displacement fields and the frame 824 duration derived from the ISF value. Note that the displacement in 825 time between frame i-1 and frame i is (DISi + 1) * frame duration 826 because also the duration of the (i-1):th must be taken into 827 account. The timestamp of the first frame of the first group of 828 frames (TS(1)), i.e. the first frame of the payload is the RTP 829 header timestamp. For subsequent frames in the group the timestamp 830 is computed by 832 TS(i) = TS(i-1) + (DISi + 1) * frame duration, 2 < i < n 834 For subsequent groups of frames the timestamp of the first frame is 835 computed by 837 TS(1) = TSprev + (DIS1 + 1) * frame duration, 839 where TSprev denotes the timestamp of the last frame in the previous 840 group. The timestamps of the subsequent frames in the group are 841 computed in the same way as for the first group. 843 The following example derives the RTP timestamps for the frames in 844 an interleaved mode payload having the following header and ToC 845 information: 847 RTP header timestamp: 12345 848 ISF = 32 kHz 849 Frame 1 displacement field: DIS1 = 0 850 Frame 2 displacement field: DIS2 = 6 851 Frame 3 displacement field: DIS3 = 4 852 Frame 4 displacement field: DIS4 = 7 854 Assuming an ISF of 32 kHz, which implies frame duration of 16 ms, 855 one frame lasts 1152 ticks. The timestamp of the first frame in the 856 payload is the RTP timestamp, i.e. TS(1) = RTP TS. Note that the 857 displacement field value for this frame must be ignored. For the 858 second frame in the payload the timestamp can be calculated as TS(2) 859 = TS(1) + (DIS2 + 1) * 1152 = 20409. For the third frame the 860 timestamp is TS(3) = TS(2) + (DIS3 + 1) * 1152 = 26169. Finally, 861 for the fourth frame of the payload we have TS(4) = TS(3) + (DIS4 + 862 1) * 1152 = 35385. 864 4.3.2.4. Frame Type Considerations 866 The value of Frame Type (FT) is defined in Table 25 in [1]. FT=14 867 (AUDIO_LOST) is used to denote frames that are lost. A NO_DATA 868 (FT=15) frame could be the result of two conditions: First, to 869 indicate that no data has been produced by the audio encoder, and 870 second that no data is transmitted in the current payload. An 871 example for the latter would be that the frame in question has been 872 or will be sent in an earlier or later packet. The duration for 873 these non-included frames is dependent on the internal sampling 874 frequency indicated by the ISF field. 876 For frame types with index 0-13 the ISF field SHALL be set 0. The 877 frame duration for these frame types is fixed to 20 ms in time, i.e. 878 1440 ticks in 72 kHz. For payloads containing only frames of type 879 0-9, the TFI field SHALL be set to 0, and SHALL be ignored by the 880 receiver. In a payload combining frames of type 0-9 and 10-13 the 881 TFI values needs to be set to match the transport frames of type 10- 882 13. Thus, frames of type 0-9 will also have a derived TFI, which is 883 ignored. 885 4.3.2.5. Other TOC Considerations 887 If a ToC entry with an undefined FT value is received, the whole 888 packet SHALL be discarded. This is to avoid the loss of data 889 synchronization in the depacketization process, which can result in 890 a severe degradation in audio quality. 892 Packets containing only NO_DATA frames SHOULD NOT be transmitted. 893 Also, NO_DATA frames at the end of a frame sequence to be carried in 894 a payload SHOULD NOT be included in the transmitted packet. The 895 AMR-WB+ SCR/DTX is identical with AMR-WB SCR/DTX described in [5] 896 and can only be used in combination with the AMR-WB frame types (0- 897 8). 899 When multiple groups of frames are present, their ToC entries SHALL 900 be placed in the ToC in the order of increasing RTP timestamp value 901 (modulo 2^32) of the first transport frame the TOC entry represents, 902 independent of the payload mode. In basic mode the frames SHALL be 903 consecutive in time, while in interleaved mode the frames MAY not 904 only be non-consecutive in time but MAY even have varying inter 905 frame distances. 907 4.3.2.6. ToC Examples 909 The following example illustrates a ToC for three audio frames in 910 basic mode. Note that in this case all audio frames are encoded 911 using the same frame type, i.e. there is only one ToC entry. 913 0 1 914 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 915 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 916 |0| Frame Type1 | #frames = 3 | 917 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 919 The next example depicts a ToC of three entries in basic mode. Note 920 that in this case the payload carries also three frames, but three 921 ToC entries are needed because the frames of the payload are encoded 922 using different frame types. 924 0 1 2 3 925 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 926 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 927 |1| Frame Type1 | #frames = 1 |1| Frame Type2 | #frames = 1 | 928 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 929 |0| Frame Type3 | #frames = 1 | 930 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 932 The following example illustrates a ToC with two entries in 933 interleaved mode using four bit displacement fields. The payload 934 includes two groups of frames, the first one including a single 935 frame, and the other one consisting of two frames. 937 0 1 2 3 938 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 939 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 940 |1| Frame Type1 | #frames = 1 | DIS1 | padd |0| Frame Type2 | 941 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 942 | #frames = 2 | DIS1 | DIS2 | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 4.3.3. Audio Data 947 Audio data of a payload consists of zero or more audio frames, as 948 described in the ToC of the payload. 950 ToC entries with FT=14 or 15 represent frame types with a length of 951 0. Hence, no data SHALL be placed in the audio data section to 952 represent frames of this type. 954 As already discussed before, each audio frame of an extension frame 955 type represents an AMR-WB+ transport frame corresponding to the 956 encoding of 512 samples of audio, sampled with the internal sampling 957 frequency specified by the ISF indicator. As an exception, frame 958 types with index 10-13 are only capable of using a single internal 959 sampling frequency (25600 Hz). The encoding rates (combination of 960 core bit-rate and stereo bit-rate) are indicated in the frame type 961 field of the corresponding ToC entry. The octet length of the audio 962 frame is implicitly defined by the frame type field and is given in 963 tables 21 and 25 of [1]. The order and numbering notation of the 964 bits are as specified in [1]. For the AMR-WB+ extension frame types 965 and comfort noise frames, the bits are in the order produced by the 966 encoder. The last octet of each audio frame MUST be padded with 967 zeroes at the end if not all bits in the octet are used. In other 968 words, each audio frame MUST be octet-aligned. 970 4.3.4. Methods for Forming the Payload 972 The payload begins with the payload header, followed by the table of 973 contents that consists of a list of ToC entries. 975 The audio data follows the table of contents. All of the octets 976 comprising an audio frame SHALL be appended to the payload as a 977 unit. The audio frames are packetized in timestamp order within 978 each group of frames (per ToC entry). The groups of frames are 979 packetized in the same order as their corresponding ToC entries. 980 Note that there are no data octets in a group having a ToC entry 981 with FT=14 or FT=15. 983 4.3.5. Payload Examples 985 4.3.5.1. Example 1, Basic Mode Payload Carrying Multiple Frames Encoded 986 Using the Same Frame Type 988 Figure 4 depicts a payload that carries three AMR-WB+ frames encoded 989 using 14 kbit/s frame type (FT=26) with a frame length of 280 bits 990 (35 bytes). The internal sampling frequency in this example is 25.6 991 kHz (ISF = 8). The TFI for the first frame is 2, indicating that 992 the first transport frame in this payload is the third in a super- 993 frame. Since this payload is in the basic mode the subsequent 994 frames of the payload are consecutive frames in decoding order, i.e. 995 the fourth transport frame of the current super-frame and the first 996 transport frame of the next super-frame. Note that because the 997 frames are all encoded using the same frame type, only one ToC entry 998 is required. 1000 0 1 2 3 1001 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1002 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1003 | ISF = 8 | 2 |0|0| FT = 26 | #frames = 3 | f1(0...7) | 1004 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1005 : ... : 1006 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1007 | ... | f1(272...279) | f2(0...7) | | 1008 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1009 : ... : 1010 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1011 | f2(272...279) | f3(0...7) | ... | 1012 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1013 : ... : 1014 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1015 | ... | f3(272...279) | 1016 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1018 Figure 4: An example of a basic mode payload carrying three frames 1019 of the same frame type. 1021 4.3.5.2. Example 2, Basic Mode Payload Carrying Multiple Frames Encoded 1022 Using Different Frame Types 1024 Figure 5 depicts a payload that carries three AMR-WB+ frames; the 1025 first frame is encoded using 18.4 kbit/s frame type (FT=33) with a 1026 frame length of 368 bits (46 bytes), and the two subsequent frames 1027 are encoded using 20 kbit/s frame type (FT=35) having frame length 1028 of 400 bits (50 bytes). The internal sampling frequency in this 1029 example is 32 kHz (ISF = 10), implying the overall bit-rates of 23 1030 kbit/s for the first frame of the payload, and 25 kbit/s for the 1031 subsequent frames. The TFI for the first frame is 3, indicating 1032 that the first transport frame in this payload is the fourth in a 1033 super-frame. Since this is a payload in the basic mode, the 1034 subsequent frames of the payload are consecutive frames in decoding 1035 order, i.e. the first and second transport frames of the current 1036 super-frame. Note that since the payload carries two different 1037 frame types, there are two ToC entries. 1039 0 1 2 3 1040 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1041 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1042 | ISF=10 | 3 |0|1| FT = 33 | #frames = 1 |0| FT = 35 | 1043 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1044 | #frames = 2 | f1(0...7) | ... | 1045 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1046 : ... : 1047 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1048 | ... | f1(360...367) | f2(0...7) | 1049 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1050 : ... : 1051 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1052 | f2(392...399) | f3(0...7) | ... | 1053 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1054 : ... : 1055 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1056 | ... | f3(392...399) | 1057 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1059 Figure 5: An example of a basic mode payload carrying three frames 1060 employing two different frame types. 1062 4.3.5.3. Example 3, Payload in Interleaved Mode 1064 The example in Figure 6 depicts a payload in interleaved mode, 1065 carrying four frames encoded using 32 kbit/s frame type (FT=47) with 1066 frame length of 640 bits (80 bytes). The internal sampling 1067 frequency is 38.4 kHz (ISF = 13), implying a bit-rate of 48 kbit/s 1068 for all frames in the payload. The TFI for the first frame is 0, 1069 hence it is the first transport frame of a super-frame. The 1070 displacement fields for the subsequent frames are DIS2=18, DIS3=15, 1071 and DIS4=10, which indicates that the subsequent frames have the 1072 TFIs of 3, 3, and 2, respectively. The long displacement field flag 1073 L in the payload header is set to 1, which results in the use of 1074 eight bits for the displacement fields in the ToC entry. Note that 1075 since all frames of this payload are encoded using the same frame 1076 type, there is need only for a single ToC entry. Furthermore, the 1077 displacement field for the first frame (corresponding to the first 1078 ToC entry with DIS1=0) must be ignored, since its timestamp and TFI 1079 are defined by the RTP timestamp and the TFI found in the payload 1080 header. 1082 The RTP timestamp values of the frames in this example is: 1083 Frame1: TS1 = RTP Timestamp 1084 Frame2: TS2 = TS1 + 19 * 960 1085 Frame3: TS3 = TS2 + 16 * 960 1086 Frame4: TS4 = TS3 + 11 * 960 1088 0 1 2 3 1089 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1090 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1091 | ISF=13 | 0 |1|0| FT = 47 | #frames = 4 | DIS1 = 0 | 1092 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1093 | DIS2 = 18 | DIS3 = 15 | DIS4 = 10 | f1(0...7) | 1094 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1095 : ... : 1096 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1097 | ... | f1(632...639) | f2(0...7) | 1098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1099 : ... : 1100 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1101 | ... | f2(632...639) | f3(0...7) | 1102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1103 : ... : 1104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1105 | ... | f3(632...639) | f4(0...7) | 1106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1107 : ... : 1108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1109 | ... | f4(632...639) | 1110 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1112 Figure 6: An example of an interleaved mode payload carrying four 1113 frames at the same frame type. 1115 4.4. Interleaving Considerations 1117 The use of interleaving requires further considerations. As 1118 presented in the example in Section 3.6.2, a given interleaving 1119 pattern requires a certain amount of the deinterleaving buffer. 1120 This buffer space, expressed in a number of transport frame slots, 1121 is indicated by the "interleaving" media parameter. The number of 1122 frame slots needed can be converted into actual memory requirements 1123 by considering the 80 bytes per frame used by the largest 1124 combination of AMR-WB+'s core and stereo rates. 1126 The information about the frame buffer size is not always sufficient 1127 to determine when it is appropriate to start consuming frames from 1128 the interleaving buffer. There are two cases in which additional 1129 information is needed: first, when switching of the ISF occurs, and 1130 second when the interleaving pattern changes. The "int-delay" media 1131 type parameter is defined to convey this information. It allows a 1132 sender to indicate the minimal media time that needs to be present 1133 in the buffer before the decoder can start consuming frames from the 1134 buffer. Because the sender has full control over ISF changes and 1135 the interleaving pattern, it can calculate this value. 1137 In certain cases, for example if joining a multicast session with 1138 interleaving mid-session, a receiver may initially receive only part 1139 of the packets in the interleaving pattern. This initial partial 1140 reception (in frame sequence order) of frames can yield too few 1141 frames for acceptable quality from the audio decoding. This problem 1142 also arises when using encryption for access control, and the 1143 receiver does not have the previous key. 1145 Although the AMR-WB+ is robust and thus tolerant to a high random 1146 frame erasure rate, it would have difficulties handling consecutive 1147 frame losses at startup. Thus some special implementation 1148 considerations are described. In order to efficiently handle this 1149 type of startup, it must be noted that decoding is only possible to 1150 start at the beginning of a super-frame, and that holds true even if 1151 the first transport frame is indicated as lost. Secondly, decoding 1152 is only RECOMMENDED to start if at least 2 transport frames are 1153 available out of the 4 belonging to that super-frame. 1155 After receiving a number of packets, in the worst case as many 1156 packets as the interleaving pattern covers, the previously described 1157 effects disappear and normal decoding is resumed. 1159 Similar issues arise when a receiver leaves a session or has lost 1160 access to the stream. In the case of the receiver leaving the 1161 session, this would be a minor issue since playout is normally 1162 stopped. It is also a minor issue for the case of lost access, since 1163 the AMR-WB+ error concealment will fade out the audio if massive 1164 consecutive losses are encountered. 1166 The sender can avoid this type of problems in many sessions by 1167 starting and ending interleaving patterns correctly when risks of 1168 losses occur. One such example is a key-change done for access 1169 control to encrypted streams. If only some keys are provided to 1170 clients and there is a risk of them receiving content for which they 1171 do not have the key, it is recommended that interleaving patterns 1172 not overlap key changes. 1174 4.5. Implementation Considerations 1176 An application implementing this payload format MUST understand all 1177 the payload parameters. Any mapping of the parameters to a 1178 signaling protocol MUST support all parameters. So an 1179 implementation of this payload format in an application using SDP is 1180 required to understand all the payload parameters in there SDP- 1181 mapped form. This requirement ensures that an implementation always 1182 can decide whether it is capable to communicate. 1184 Both basic and interleaving mode SHALL be implemented. The 1185 implementation burden of both is rather small and requiring both 1186 ensures interoperability. As the AMR-WB+ codec contains the full 1187 functionality of the AMR-WB codec, it is RECOMMENDED to also 1188 implement the payload format in RFC 3267 [7] for the AMR-WB frame 1189 types when implementing this specification. Doing so makes the 1190 interoperability with devices that only support AMR-WB more likely. 1192 The switching of ISF combined with packet loss could result in 1193 concealment using the wrong audio frame length. This can occur if 1194 packet loss(es) result in lost frames directly after the point of 1195 ISF change. The packet loss would prevent the receiver from 1196 noticing the changed ISF and thereby conceal the lost transport 1197 frame with the previous ISF, instead of the new one. Such an error, 1198 although always later detectable results in boundary misalignment, 1199 which can cause audio distortions and problems with synchronization, 1200 as too many or too few audio samples were created. This problem can 1201 be mitigated in most cases by performing ISF recovery prior to 1202 concealment as outlined in section 4.5.1 below. 1204 4.5.1. ISF recovery in case of packet loss 1206 In case of packet loss, it is important that the AMR-WB+ decoder 1207 initiates a proper error concealment to replace the frames carried 1208 in the lost packet. A loss concealment algorithm requires a codec 1209 framing that matches the timestamps of the correctly received 1210 frames. Hence, it is necessary to recover the timestamps of the 1211 lost frames. Doing in so is non-trivial because the codec frame 1212 length that is associated with the ISF may have changed during the 1213 frame loss. 1215 In the following, the recovery of the timestamp information of lost 1216 frames is illustrated by the means of an example. Two frames with 1217 timestamps t0 and t1 have been received properly, the first one 1218 being the last packet before the loss, and the latter one is the 1219 first packet after the loss period. The ISF values for these 1220 packets are isf0 and isf1, respectively. The TFIs of these frames 1221 are tfi0 and tfi1, respectively. The associated frame lengths (in 1222 timestamp ticks) are given as L0 and L1, respectively. In this 1223 example three frames with timestamps x1 - x3 have been lost. The 1224 example further assumes that ISF changes once from isf0 to isf1 1225 during the frame loss period, as shown in the figure below. 1227 Since not all information required for the full recovery of the 1228 timestamps is generally known in the receiver, an algorithm is 1229 needed to estimate the ISF associated with the lost frames. Also 1230 the number of lost frames needs to be recovered. 1232 |<---L0--->|<---L0--->|<-L1->|<-L1->|<-L1->| 1234 | Rxd | lost | lost | lost | Rxd | 1235 --+----------+----------+------+------+------+-- 1237 t0 x1 x2 x3 t1 1239 Example Algorithm: 1241 Start: # check for frame loss 1242 If (t0 + L0) == t1 Then goto End # no frame loss 1244 Step 1: # check case with no ISF change 1245 If (isf0 != isf1) Then goto Step 2 # At least one ISF change 1246 If (isFractional(t1 - t0)/L0) Then goto Step 3 1247 # More than 1 ISF change 1249 Return recovered timestamps as 1250 x(n) = t0 + n*L1 and associated ISF equal to isf0, for 0