idnits 2.17.1 draft-ahmadi-avt-rtp-vmr-wb-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts -- however, there's a paragraph with a matching beginning. Boilerplate error? == It seems as if not all pages are separated by form feeds - found 0 form feeds but 32 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 33 instances of too long lines in the document, the longest one being 5 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 1529 has weird spacing: '...limited mode ...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'not RECOMMENDED' in this paragraph: VMR-WB has the capability to operate with 8000 Hz sampled input/output speech signals in all modes of operation [1]. Mode switching can be utilized to change the mode of operation while processing narrowband speech signals. However, during a session, transition between narrowband and wideband processing is not RECOMMENDED due to different timestamps and other likely synchronization problems. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'not RECOMMENDED' in this paragraph: For narrowband operation of VMR-WB, the input/output sampling frequency is 8 kHz, corresponding to 160 encoded speech samples per frame from each channel. Thus, the timestamp is increased by 160 for VMR-WB for each consecutive frame-block while processing narrowband input/output speech signals. The choice of sampling frequency MUST be indicated in the beginning of a session (see section 10). The default input/output sampling rate is 16 kHz. Note that during a session, the change of sampling rate is not RECOMMENDED. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: Late packets (i.e., unavailability of a packet when needed for decoding at the receiver) SHALL be treated as lost packets. Furthermore, if the late packet is part of an interleave group, depending upon the availability of the other packets in that interleave group, decoding MUST be resumed from the next (sequential order) available packet. In other words, the unavailability of a packet in an interleave group at certain time SHOULD not invalidate the other packets within that interleave group that MAY arrive later. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHALL not' in this paragraph: interleaving: Indicates that frame-block level interleaving SHALL be used for the session and its value defines the maximum number of frame-blocks allowed in an interleaving group (see Section 6.3.1). If this parameter is not present, interleaving SHALL not be used. The presence of this parameter also implies automatically that octet-aligned operation SHALL be used. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 17, 2004) is 7283 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '6' is defined on line 1581, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 1585, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 1592, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 3267 (ref. '4') (Obsoleted by RFC 4867) ** Obsolete normative reference: RFC 2327 (ref. '5') (Obsoleted by RFC 4566) -- Obsolete informational reference (is this intentional?): RFC 3448 (ref. '6') (Obsoleted by RFC 5348) -- Obsolete informational reference (is this intentional?): RFC 2733 (ref. '7') (Obsoleted by RFC 5109) Summary: 6 errors (**), 0 flaws (~~), 10 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Audio Video Transport WG Sassan Ahmadi 2 INTERNET-DRAFT Nokia Inc. 3 Category: Standards Track May 17, 2004 4 Expires: November 17, 2004 6 Real-Time Transport Protocol (RTP) Payload and File Storage 7 Formats for the Variable-Rate Multimode Wideband (VMR-WB) 8 Audio Codec 9 11 Status of this Memo 13 This document is an Internet-Draft and is in full conformance 14 with all provisions of Section 10 of RFC 2026 16 Internet-Drafts are working documents of the Internet 17 Engineering Task Force (IETF), its areas, and its working 18 groups. Note that other groups may also distribute working 19 documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of 22 six Months and may be updated, replaced, or obsoleted by 23 other documents at any time. It is inappropriate to use 24 Internet-Drafts as reference material or cite them other than 25 as "work in progress". 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/lid-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed 31 at http://www.ietf.org/shadow.html 33 This document is an individual submission to the IETF 34 Comments should be directed to the authors 36 Copyright Notice 38 Copyright (C) The Internet Society (2004). All Rights 39 Reserved. 41 Abstract 43 This document specifies a real-time transport protocol (RTP) 44 payload format to be used for the Variable-Rate Multimode 45 Wideband (VMR-WB) speech codec. The payload format is 46 designed to be able to interoperate with existing VMR-WB 47 transport formats on non-IP networks. In addition, a file 48 format is specified for transport of VMR-WB speech data in 49 storage mode applications such as email. A MIME type 50 registration is included, for VMR-WB, specifying use of 51 both the RTP payload and the storage formats 53 VMR-WB is a variable-rate multimode wideband speech codec 54 that has a number of operating modes, one of which is 55 interoperable with AMR-WB (i.e., RFC 3267) audio codec at 56 certain rates. Therefore, provisions have been made in 57 this draft to facilitate and simplify data packet exchange 58 between VMR-WB and AMR-WB in the interoperable mode with no 59 transcoding function involved. 61 Table of Contents 63 1.Introduction.................................................3 64 2.Conventions and Acronyms.....................................3 65 3.The Variable-Rate Multimode Wideband (VMR-WB) Speech Codec...4 66 3.1. Narrowband Speech Processing...........................5 67 3.2. Continuous vs. Discontinuous Transmission..............5 68 3.3. Support for Multi-Channel Session......................6 69 4. Robustness against Packet Loss..............................6 70 4.1. Forward Error Correction (FEC).........................6 71 4.2. Frame Interleaving and Multi-Frame Encapsulation.......7 72 5. VMR-WB Voice over IP scenarios..............................8 73 5.1. IP Terminal to IP Terminal.............................8 74 5.2 IP Terminal to GW to IP Terminal.......................8 75 5.3. GW to IP Terminal......................................9 76 5.4. GW to GW (Between VMR-WB and AMR-WB Enabled Terminals)10 77 5.5. GW to GW (Between two VMR-WB Enabled Terminals).......11 78 6. VMR-WB RTP Payload Formats.................................11 79 6.1. RTP Header Usage.............................. .......13 80 6.2. Header-Free Payload Format............................12 81 6.3. Octet-Aligned Payload Format..........................14 82 6.3.1. Payload Structure................................14 83 6.3.2. The Payload Header...............................14 84 6.3.3. The Payload Table of Contents....................17 85 6.3.4. Speech Data......................................19 86 6.3.5. Payload Example..................................20 87 Basic Single Channel Payload Carrying Multiple Frames 88 6.4. Implementation Considerations.........................20 89 7. VMR-WB Storage Format......................................20 90 7.1. Single Channel Header.................................21 91 7.2. Multi-Channel Header..................................21 92 7.3. Speech Frames.........................................22 93 8. Congestion Control.........................................23 94 9. Security Considerations....................................24 95 9.1. Confidentiality.......................................24 96 9.2. Authentication........................................25 97 9.3. Decoding Validation and Provision for Lost or Late 98 Packets...............................................25 99 10. Payload Format Parameters.................................25 100 10.1. VMR-WB MIME Registration.............................26 101 10.2. Mapping MIME Parameters into SDP.....................28 102 10.3. Offer-Answer Model Considerations....................29 103 11. IANA Considerations.......................................30 104 12. Acknowledgements..........................................30 105 References....................................................30 106 Normative References.......................................30 107 Informative References.....................................30 109 Author's Address..............................................31 110 Full Copyright Statement......................................31 112 1. Introduction 114 This document specifies the payload format for packetization 115 of VMR-WB encoded speech signals into the Real-time Transport 116 Protocol (RTP) [3]. The VMR-WB payload formats support 117 transmission of single and multiple channels, frame 118 interleaving, multiple frames per payload, header-free 119 payload, the use of mode switching, and interoperation with 120 existing VMR-WB transport formats on non-IP networks, as 121 described in Section 3. 123 The payload format itself is specified in Section 6. A 124 related file format is specified in Section 7 for transport 125 of VMR-WB speech data in storage mode applications such as 126 email. In Section 10, a MIME type registration for VMR-WB is 127 provided. 129 Since VMR-WB is interoperable with AMR-WB at certain rates, 130 an attempt has been made throughout this document to maximize 131 the similarities with RFC 3267 while optimizing the payload 132 and storage formats for the non-interoperable modes of the 133 VMR-WB codec. 135 2. Conventions and Acronyms 137 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 138 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 139 "OPTIONAL" in this document are to be interpreted as 140 described in RFC2119 [2]. 142 The following acronyms are used in this document: 144 3GPP2 - The Third Generation Partnership Project 2 145 CDMA - Code Division Multiple Access 146 WCDMA - Wideband Code Division Multiple Access 147 GSM - Global System for Mobile Communications 148 AMR-WB - Adaptive Multi-Rate Wideband Codec 149 VMR-WB - Variable-Rate Multimode Wideband Codec 150 CMR - Codec Mode Request 151 GW - Gateway 152 DTX - Discontinuous Transmission 153 FEC - Forward Error Correction 154 SID - Silence Descriptor 155 TrFO - Transcoder-Free Operation 156 UDP - User Datagram Protocol 157 RTP - Real-Time Transfer Protocol 158 RTCP - Real-Time Control Protocol 159 MIME - Multipurpose Internet Mail Extension 160 SDP - Session Description Protocol 161 SIP - Session Initiation Protocol 163 The term "frame-block" is used in this document to describe 164 the time-synchronized set of speech frames in a multi-channel 165 VMR-WB session. In particular, in an N-channel session, a 166 frame-block will contain N speech frames, one from each of 167 the channels, and all N speech frames represent exactly the 168 same time period. 170 3. The Variable-Rate Multimode Wideband (VMR-WB) Speech Codec 172 VMR-WB is the wideband speech-coding standard developed by 173 Third Generation Partnership Project 2 (3GPP2) for 174 encoding/decoding wideband/narrowband speech content in 175 multimedia services in 3G CDMA cellular systems. VMR-WB is a 176 source-controlled variable-rate multimode wideband speech 177 codec. It has a number of operating modes, where each mode is 178 a tradeoff between voice quality and average data rate. The 179 operating mode in VMR-WB is chosen based on the traffic 180 condition of the network and the desired quality of 181 service [1]. The desired average data rate (ADR) in each mode 182 is obtained by encoding speech frames at different rates 183 compliant with CDMA Rate-Set II depending on the 184 instantaneous characteristics of input speech and the 185 maximum and minimum rate constraints imposed by the network 186 operator. While VMR-WB is a native CDMA codec complying with 187 all CDMA system requirements, it is further interoperable 188 with AMR-WB [4] at 12.65, 8.85, and 6.60 kbps. This is due to 189 the fact that VMR-WB and AMR-WB share the 190 same core technology. This feature enables Transcoder Free 191 (TrFO) interconnections between VMR-WB and AMR-WB across 192 different wireless/wireline systems (e.g., GSM/WCDMA and 193 CDMA2000) without use of unnecessary complex media format 194 conversion. 196 VMR-WB is able to transition between various modes with no 197 degradation in voice quality that is attributable to the mode 198 switching itself. The operation mode of the VMR-WB encoder 199 may be switched seamlessly without prior knowledge of the 200 decoder. Any non-interoperable mode (i.e., mode 0, 1, or 2) 201 can be chosen depending on the traffic conditions (e.g., 202 network congestion) and the desired quality of service. 204 While in the interoperable mode (i.e., VMR-WB mode 3), mode 205 switching is not allowed. There is only one AMR-WB 206 interoperable mode in VMR-WB. Since AMR-WB codec depending on 207 channel conditions may request a mode change, in-band data 208 included in VMR-WB frame structure (see Section 8 of [1] for 209 more details), is used during an interoperable 210 interconnection to switch between AMR-WB codec modes 0, 1, or 211 2. 213 As mentioned earlier, VMR-WB is compliant with CDMA Rate-Set 214 II (see Section 2 of [1]) with the permissible encoding rates 215 shown in Table 1. 217 +--------------+-------------------+-----------------+ 218 | Frame Type | Bits per Packet | Encoding Rate | 219 | | (Frame Size) | (kbps) | 220 +--------------+-------------------+-----------------+ 221 | Full-Rate | 266 | 13.3 | 222 | Half-Rate | 124 | 7.2 | 223 | Quarter-Rate | 54 | 2.7 | 224 | Eighth-Rate | 20 | 1.0 | 225 | Blank | 0 | - | 226 | Erasure | 0 | - | 227 +--------------+-------------------+-----------------+ 228 Table 1: CDMA Rate-Set II frame types and their associated 229 encoding rates 231 VMR-WB is robust to high percentage of packet loss and 232 packets with corrupted rate information. The reception of 233 an Erasure (SPEECH_LOST) frame type at decoder invokes the built-in 234 frame error concealment mechanism. The built-in frame error 235 concealment mechanism in VMR-WB conceals the effect of lost 236 packets by exploiting in-band data and the information 237 available in the previous frames. 239 3.1. Narrowband Speech Processing 241 VMR-WB has the capability to operate with 8000 Hz sampled 242 input/output speech signals in all modes of operation [1]. 243 Mode switching can be utilized to change the mode of 244 operation while processing narrowband speech signals. 245 However, during a session, transition between narrowband and 246 wideband processing is not RECOMMENDED due to different 247 timestamps and other likely synchronization problems. 249 3.2. Continuous vs. Discontinuous Transmission 251 The circuit-switched operation of VMR-WB within a CDMA 252 network requires continuous transmission of the speech data 253 during a conversation. The intrinsic source-controlled 254 variable-rate feature of the CDMA speech codecs is required 255 for optimal operation of the CDMA system and interference 256 control. However, VMR-WB has the capability to operate in a 257 discontinuous transmission mode for some packet-switched 258 applications over IP networks, where the number of 259 transmitted bits and packets during silence period are 260 reduced to a minimum. The VMR-WB DTX operation is similar to 261 that of AMR-WB [4,12]. 263 4. Support for Multi-Channel Session 264 Both the octet-aligned RTP payload format and the storage 265 format defined in this document support multi-channel audio 266 content (e.g., a stereophonic speech session). 268 Although VMR-WB codec itself does not support encoding of 269 multi-channel audio content into a single bit stream, it can 270 be used to separately encode and decode each of the 271 individual channels. 273 To transport (or store) the separately encoded multi-channel 274 content, the speech frames for all channels that are framed 275 and encoded for the same 20 ms periods are logically 276 collected in a frame-block. 278 At the session setup, out-of-band signaling must be used to 279 indicate the number of channels in the session and the order 280 of the speech frames from different channels in each frame- 281 block. When using SDP for signaling, the number of 282 channels is specified in the rtpmap attribute and the order 283 of channels carried in each frame-block is implied by the 284 number of channels as specified in Section 4.1 in [10]. 286 4. Robustness against Packet Loss 288 The octet-aligned payload format, described in this document, 289 supports several features including forward error correction 290 (FEC) and frame interleaving in order to increase robustness 291 against lost packets. 293 4.1. Forward Error Correction (FEC) 295 The simple scheme of repetition of previously sent data is 296 one way of achieving FEC. Another possible scheme, which is 297 more bandwidth efficient is to use payload external FEC,e.g., 298 RFC2733 [5], which generates extra packets containing repair 299 data. 301 The repetition method involves the simple retransmission of 302 previously transmitted frame-blocks together with the current 303 frame-block(s). This is done by using a sliding window to 304 group the speech frame-blocks to send in each payload. Figure 305 1 illustrates an example. 307 --+--------+--------+--------+--------+--------+--------+--------+-- 308 | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | 309 --+--------+--------+--------+--------+--------+--------+--------+-- 311 <---- p(n-1) ----> 312 <----- p(n) -----> 313 <---- p(n+1) ----> 314 <---- p(n+2) ----> 315 <---- p(n+3) ----> 316 <---- p(n+4) ----> 318 Figure 1: An example of redundant transmission. 320 In this example each frame-block is retransmitted one time in 321 the following RTP payload packet. Here, f(n-2)..f(n+4) 322 denotes a sequence of speech frame-blocks and p(n-1)..p(n+4) 323 a sequence of payload packets. 325 The use of this approach does not require signaling at the 326 session setup. In other words, the speech sender can choose 327 to use this scheme without consulting the receiver. This is 328 because a packet containing redundant frames will not look 329 different from a packet with only new frames. The receiver 330 may receive multiple copies or versions of a frame for a 331 certain timestamp if no packet is lost. If multiple versions 332 of the same speech frame are received, it is RECOMMENDED that 333 the highest rate be used by the speech decoder. 335 This redundancy scheme provides the same functionality as the 336 one described in RFC 2198 "RTP Payload for Redundant Audio 337 Data" [10]. In most cases the mechanism in this payload 338 format is more efficient and simpler than requiring both 339 endpoints to support RFC 2198. If the spread in time required 340 between the primary and redundant encodings is larger than 5 341 frame times, the bandwidth overhead of RFC 2198 will be 342 lower. 344 The sender is responsible for selecting an appropriate amount 345 of redundancy based on feedback about the channel, e.g., in 346 RTCP receiver reports, or network traffic. A sender should 347 not base selection of FEC on the CMR, as this parameter 348 most probably was set based on none-IP information. The 349 sender is also responsible for avoiding congestion, which may 350 be aggravated by redundant transmission. 352 4.2. Frame Interleaving and Multi-Frame Encapsulation 354 To decrease protocol overhead, the octet-aligned payload 355 format allows several speech frame-blocks to be encapsulated 356 into a single RTP packet. One of the drawbacks of such 357 approach is that in case of packet loss this means loss of 358 several consecutive speech frame-blocks, which usually causes 359 clearly audible distortion in the reconstructed speech. 360 Interleaving of frame-blocks can improve the speech quality 361 in such cases by distributing the consecutive losses into a 362 series of single frame-block losses. However, interleaving 363 and bundling several frame-blocks per payload will also 364 increase end-to-end delay and is therefore not appropriate 365 for all types of applications. Streaming applications will 366 most likely be able to exploit interleaving to improve speech 367 quality in lossy transmission conditions. 369 The octet-aligned payload format supports the use of frame 370 interleaving as an option. For the encoder (speech sender) to 371 use frame interleaving in its outbound RTP packets for a 372 given session, the decoder (speech receiver) needs to 373 indicate its support via out-of-band means (see Section 10). 375 5. VMR-WB Voice over IP Scenarios 377 5.1 IP Terminal to IP Terminal 379 The primary scenario for this payload format is IP end-to-end 380 between two terminals incorporating VMR-WB codec, as shown in 381 Figure 2. This payload format is expected to be useful for 382 both conversational and streaming services. 384 +----------+ +----------+ 385 | | | | 386 | TERMINAL |<----------------------->| TERMINAL | 387 | | VMR-WB/RTP/UDP/IP | | 388 +----------+ +----------+ 390 Figure 2: IP terminal to IP terminal scenario 392 A conversational service puts requirements on the payload 393 format. Low delay is a very important factor, i.e. fewer 394 speech frame-blocks per payload packet. Low overhead is also 395 required when the payload format traverses across low 396 bandwidth links, especially if the frequency of packets will 397 be high. 399 Streaming service has less strict real-time requirements and 400 therefore can use a larger number of frame-blocks per packet 401 than conversational service. This reduces the overhead from 402 IP, UDP, and RTP headers. However, including several frame- 403 blocks per packet makes the transmission more vulnerable to 404 packet loss, so interleaving may be used to reduce the effect 405 of packet loss on speech quality. A streaming server handling 406 a large number of clients also needs a payload format that 407 requires as few resources as possible when doing 408 packetization. 410 Note that all modes of the VMR-WB codec can be used in this 411 scenario. Also both header-free and octet-aligned payload 412 formats can be utilized. 414 5.2 IP Terminal to GW to IP Terminal 416 A second scenario for this payload format is IP end-to-end 417 (Through a gateway) between two terminals, one with AMR-WB 418 codec and the other one with VMR-WB codec using the 419 interoperable mode of VMR-WB, as shown in Figure 3. This 420 payload format is expected to be useful for both 421 conversational and streaming services. 423 +----------+ +------+ +----------+ 424 | | VMR-WB/RTP/UDP/IP | | AMR-WB/RTP/UDP/IP | | 425 | TERMINAL |<-------------------->| GW |<------------------->| TERMINAL | 426 | | | | | | 427 +----------+ +------+ +----------+ 428 VMR-WB enabled | AMR-WB enabled 429 | 430 | 431 <----VMR-WB Session----> <----AMR-WB Session----> 433 Figure 3: IP terminal to GW to IP terminal scenario 434 (AMR-WB <-> VMR-WB interoperable interconnection) 436 The VMR-WB mode 3 and octet-aligned payload format SHALL be 437 used for this scenario. Moreover, to avoid signaling 438 conflicts in the IP network, two sessions SHALL be 439 established using SIP/SDP, one between the VMR-WB enabled 440 terminal and the gateway and another session between the 441 gateway and the AMR-WB enabled terminal. Note that no 442 transcoding is involved since the VMR-WB payload is identical 443 to that of AMR-WB. 445 5.3 GW to IP Terminal 447 Another scenario occurs when VMR-WB encoded speech will be 448 transmitted from a non-IP system (e.g., 3GPP2/CDMA2000 449 network) to an RTP/UDP/IP VoIP terminal, and/or vice versa, 450 as depicted in Figure 4. 452 VMR-WB over 453 3GPP2/CDMA2000 network 454 +------+ +----------+ 455 | | | | 456 <-------------->| GW |<---------------------->| TERMINAL | 457 | | VMR-WB/RTP/UDP/IP | | 458 +------+ +----------+ 459 | 460 | IP network 461 | 463 Figure 4: GW to VoIP terminal scenario 465 VMR-WB's capability to seamlessly switch between operational 466 modes is exploited in CDMA (non-IP) networks to optimize 467 speech quality for a given traffic condition. To preserve 468 this functionality in scenarios including a gateway to an IP 469 network using the octet-aligned payload format, a codec mode 470 request (CMR) field is considered. The gateway will be 471 responsible for forwarding the CMR between the non-IP and IP 472 parts in both directions. The IP terminal should follow the 473 CMR forwarded by the gateway to optimize speech quality going 474 to the non-IP decoder. The mode control algorithm in the 475 gateway SHOULD accommodate the delay imposed by the IP 476 network on the response to CMR by the IP terminal. 478 The IP terminal should not set the CMR (see Section 6.3.2), 479 but the gateway can set the CMR value on frames going toward 480 the encoder in the non-IP part to optimize speech quality 481 from that encoder to the gateway. The gateway can 482 alternatively set a different CMR value, if desired, as one 483 means to control congestion on the IP network. 485 5.4 GW to GW (Between VMR-WB and AMR-WB Enabled Terminals) 487 A fourth likely scenario is that RTP/UDP/IP is used as 488 transport between two non-IP systems, i.e., IP is originated 489 and terminated in gateways on both sides of the IP transport, 490 as illustrated in Figure 5. This is the most likely scenario 491 for an interoperable interconnection between 492 3GPP/(GSM,WCDMA)/AMR-WB and 3GPP2/CDMA2000/VMR-WB. 494 VMR-WB over AMR-WB over 495 3GPP2/CDMA2000 network 3GPP/(GSM, WCDMA) network 497 +------+ +------+ 498 (VMR-WB Payload) | | AMR-WB/RTP/UDP/IP | | (AMR-WB Payload) 499 <------------------>| GW |<------------------->| GW |<------------------> 500 | | | | 501 +------+ +------+ 502 | | 503 | IP network | 504 | | 505 <---VMR-WB Session----> <---------------AMR-WB Session---------------> 507 Figure 5: GW to GW scenario (AMR-WB <-> VMR-WB 508 interoperable interconnection) 510 The VMR-WB mode 3 and octet-aligned payload format SHALL be 511 used for this scenario. Moreover, to avoid signaling 512 conflicts in the IP network, two sessions SHALL be 513 established using SIP/SDP, one between the VMR-WB enabled 514 terminal and the gateway and another session between the 515 gateway and the AMR-WB enabled terminal. Note that no 516 transcoding is involved since the VMR-WB payload is identical 517 to that of AMR-WB. 519 The CMR value may be set in packets received by the gateways 520 on the IP network side. The gateway should forward to the 521 non-IP side a CMR value that is the minimum of two values (1) 522 the CMR value it receives on the IP side; and (2) a CMR value 523 it may choose for congestion control of transmission on the 524 IP side. 526 The details of the traffic control algorithm are left to the 527 implementation. 529 During and upon initiation of an interoperable 530 interconnection between VMR-WB and AMR-WB, only VMR-WB mode 3 531 SHALL be used. There are three Frame Types (i.e., FT=0, 1, or 532 2 see Table 3) within this mode that are compatible with 533 AMR-WB codec modes 0, 1, and 2, respectively. 535 If the AMR-WB codec is engaged in an interoperable 536 interconnection with VMR-WB, the active AMR-WB codec mode set 537 SHALL be limited to 0, 1, and 2. 539 5.5 GW to GW (Between two VMR-WB Enabled Terminals) 541 The fifth example VoIP scenario comprises a RTP/UDP/IP 542 transport between two non-IP systems, i.e., IP is originated 543 and terminated in gateways on both sides of the IP transport, 544 as illustrated in Figure 6. This is the most likely scenario 545 for Mobile Station-to-Mobile Station (MS-to-MS) Transcoder- 546 Free (TrFO) interconnection between two 3GPP2/CDMA2000 547 terminals that both use VMR-WB codec. 549 VMR-WB over VMR-WB over 550 3GPP2/CDMA2000 network 3GPP2/CDMA2000 network 552 +------+ +------+ 553 | | | | 554 <----------------->| GW |<------------------->| GW |<---------------> 555 | | VMR-WB/RTP/UDP/IP | | 556 +------+ +------+ 557 | | 558 | IP network | 559 | | 561 Figure 6: GW to GW scenario (a CDMA2000 MS-to-MS 562 voice over IP scenario) 564 6. VMR-WB RTP Payload Formats 566 For a given session, the payload format can be either header 567 free or octet-aligned, depending on the mode of operation 568 that is established for the session via out-of-band means and 569 the application. 571 The header-free payload format is designed for maximum 572 bandwidth efficiency, simplicity, and low latency. Only one 573 codec data frame can be sent in each header-free payload 574 format. None of the payload header fields or ToC entries is 575 present [11]. 577 In the octet-aligned payload format, all the fields in a 578 payload, including payload header, table of contents entries, 579 and speech frames themselves, are individually aligned to 580 octet boundaries to make implementations efficient. 582 Note that octet alignment of a field or payload means that 583 the last octet is padded with zeroes in the least significant 584 bits to fill the octet. Also note that this padding is 585 separate from padding indicated by the P bit in the RTP 586 header. 588 Between the two payload formats, only the octet-aligned 589 format has the capability to use the interleaving to make the 590 speech transport robust to packet loss. 592 The VMR-WB octet-aligned payload format in the interoperable 593 mode is identical to that of AMR-WB (i.e., RFC 3267). 595 Implementations SHOULD support both header-free and octet- 596 aligned payload formats to increase interoperability. 598 6.1. RTP Header Usage 600 The format of the RTP header is specified in [3]. This 601 payload format uses the fields of the header in a manner 602 consistent with that specification. 604 The RTP timestamp corresponds to the sampling instant of the 605 first sample encoded for the first frame-block in the packet. 606 The timestamp clock frequency is the same as the sampling 607 frequency, so the timestamp unit is in samples. 609 The duration of one speech frame-block is 20 ms for VMR-WB. 610 For normal wideband operation of VMR-WB, the input/output 611 sampling frequency is 16 kHz, corresponding to 320 samples 612 per frame from each channel. Thus, the timestamp is increased 613 by 320 for VMR-WB for each consecutive frame-block. 615 For narrowband operation of VMR-WB, the input/output sampling 616 frequency is 8 kHz, corresponding to 160 encoded speech 617 samples per frame from each channel. Thus, the timestamp is 618 increased by 160 for VMR-WB for each consecutive frame- 619 block while processing narrowband input/output speech 620 signals. The choice of sampling frequency MUST be indicated 621 in the beginning of a session (see section 10). The default 622 input/output sampling rate is 16 kHz. Note that during a 623 session, the change of sampling rate is not RECOMMENDED. 625 A packet may contain multiple frame-blocks of encoded speech 626 or comfort noise parameters. If interleaving is employed, the 627 frame-blocks encapsulated into a payload are picked according 628 to the interleaving rules as defined in Section 629 6.3.2. Otherwise, each packet covers a period of one or more 630 contiguous 20 ms frame-block intervals. In case the data from 631 all the channels for a particular frame-block in the period 632 is missing, for example at a gateway from some other 633 transport format, it is possible to indicate that no data is 634 present for that frame-block rather than breaking a multi- 635 frame-block packet into two, as explained in Section 6.3.2. 637 The payload is always made an integral number of octets long 638 by padding with zero bits if necessary. If additional padding 639 is required to bring the payload length to a larger multiple 640 of octets or for some other purpose, then the P bit in the 641 RTP header MAY be set and padding appended as specified in 642 [3]. 644 The RTP header marker bit (M) SHALL be always set to 0 if the 645 VMR-WB codec operates in continuous transmission. When 646 operating in discontinuous transmission (DTX), the RTP header 647 marker bit SHALL be set to 1 if the first frame-block carried 648 in the packet contains a speech frame, which is the first in 649 a talkspurt. For all other packets the marker bit SHALL be 650 set to zero (M=0). 652 The assignment of an RTP payload type for this new packet 653 format is outside the scope of this document, and will not be 654 specified here. It is expected that the RTP profile under 655 which this payload format is being used will assign a payload 656 type for this encoding or specify that the payload type is to 657 be bound dynamically. 659 6.2. Header-Free Payload Format 661 The header-free Packet payload format is designed for maximum 662 bandwidth efficiency, simplicity, and minimum delay. Only one 663 speech data frame can be sent in each header-free payload 664 format. None of the payload header fields or ToC entries is 665 present. The encoding rate for the speech frame can be 666 determined from the length of the speech data frame, since 667 there is only one speech data frame in each header-free 668 payload format. 670 Use of the RTP header fields for header-free payload format 671 is the same as the corresponding one for the octet-aligned 672 payload format. The detailed bit mapping of speech data 673 packets permissible for this payload format is described in 674 Section 8 of [1]. 676 Since the header-free payload format is not compatible with 677 AMR-WB, it is RECOMMENDED that only VMR-WB modes 0, 1, and 2 678 be used with this payload format. 680 0 1 2 3 681 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 682 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 683 | RTP Header [3] | 684 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 685 | | 686 + ONLY one speech data frame +-+-+-+-+-+-+-+-+ 687 | | 688 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 690 Note that the mode of operation, using this payload format, 691 is decided by the transmitting (encoder) site. The default 692 mode of operation for VMR-WB encoder is mode 0 [1]. The mode 693 change request MAY also be sent through non-RTP means, which 694 is out of the scope of this specification. 696 6.3. Octet-Aligned Payload Format 698 6.3.1 Payload Structure 700 The complete payload consists of a payload header, a payload 701 table of contents, and speech data representing one or more 702 speech frame-blocks. The following diagram shows the general 703 payload format layout: 705 +----------------+-------------------+---------------- 706 | Payload header | Table of contents | Speech data ... 707 +----------------+-------------------+---------------- 709 6.3.2. The Payload Header 711 In octet-aligned payload format the payload header consists 712 of a 4-bit CMR, 4 reserved bits, and optionally, an 8 bit- 713 interleaving header, as shown below 715 0 1 716 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 717 +-+-+-+-+-+-+-+-+- - - - - - - - 718 | CMR |R|R|R|R| ILL | ILP | 719 +-+-+-+-+-+-+-+-+- - - - - - - - 721 CMR (4 bits): Indicates a codec mode request sent to the 722 speech encoder at the site of the receiver of this payload, 723 provided that the network allows the use of the requested 724 mode. 726 The value of the CMR field is set according to the following 727 Table 729 +-------+------------------------------------------------------------+ 730 | CMR | VMR-WB Operating Modes | 731 +-------+------------------------------------------------------------+ 732 | 0 | VMR-WB mode 3 (AMR-WB interoperable mode at 6.60 kbps) | 733 | 1 | VMR-WB mode 3 (AMR-WB interoperable mode at 8.85 kbps) | 734 | 2 | VMR-WB mode 3 (AMR-WB interoperable mode at 12.65 kbps) | 735 | 3 | VMR-WB mode 2 | 736 | 4 | VMR-WB mode 1 | 737 | 5 | VMR-WB mode 0 | 738 | 6-14 | (reserved) | 739 | 15 | No Preference (Operating mode SHOULD be set by the network)| 740 +-------+------------------------------------------------------------+ 741 Table 2: List of valid CMR values and their associated VMR-WB 742 operating modes. 744 R: is a reserved bit that MUST be set to zero. The receiver 745 MUST ignore all R bits. 747 ILL (4 bits, unsigned integer): This is an OPTIONAL field 748 that is present only if interleaving is signaled out-of-band 749 for the session. ILL=L indicates to the receiver that the 750 interleaving length is L+1, in number of frame-blocks. 752 ILP (4 bits, unsigned integer): This is an OPTIONAL field 753 that is present only if interleaving is signaled. ILP MUST 754 take a value between 0 and ILL, inclusive, indicating the 755 interleaving index for frame-blocks in this payload in the 756 interleave group. If the value of ILP is found greater than 757 ILL, the payload SHOULD be discarded. 759 ILL and ILP fields MUST be present in each packet in a 760 session if interleaving is signaled for the session. 762 The mode request received in the CMR field is valid until the 763 next CMR is received, i.e. a newly received CMR value 764 overrides the previous one. Therefore, if a terminal 765 continuously wishes to receive frames in the same mode x, it 766 needs to set CMR=x for all its outbound payloads, and if a 767 terminal has no preference in which mode to receive, it 768 SHOULD set CMR=15 in all its outbound payloads. 770 If receiving a payload with a CMR value, which is not valid, 771 the CMR MUST be ignored by the receiver. 773 In a multi-channel session, CMR SHOULD be interpreted by the 774 receiver of the payload as the desired encoding mode for all 775 the channels in the session, if the network allows. 777 An IP end-point SHOULD NOT set the CMR based on packet losses 778 or other congestion indications, for several reasons 780 - The other end of the IP path may be a gateway to a non-IP 781 network (such as a radio link) that needs to set the CMR 782 field to optimize performance on that network. 784 - Congestion on the IP network is managed by the IP sender, 785 in this case at the other end of the IP path. Feedback 786 about congestion SHOULD be provided to that IP sender 787 through RTCP or other means, and then the sender can 788 choose to avoid congestion using the most appropriate 789 mechanism. That may include adjusting the codec mode, but 790 also includes adjusting the level of redundancy or number 791 of frames per packet. 793 The encoder SHOULD follow a received mode request, but MAY 794 change to a different mode if the network necessitates it, 795 for example to control congestion. 797 The CMR field MUST be set to 15 for packets sent to a 798 multicast group. The encoder in the speech sender SHOULD 799 ignore mode requests when sending speech to a multicast 800 session but MAY use RTCP feedback information as a hint that 801 a mode change is needed. 803 If interleaving option is utilized, It MUST be performed on a 804 frame-block basis as oppose to a frame basis in a multi- 805 channel session. 807 The following example illustrates the arrangement of speech 808 frame-blocks in an interleave group during an interleave 809 session. Here we assume ILL=L for the interleave group that 810 starts at speech frame-block n. We also assume that the 811 first payload packet of the interleave group is s and the 812 number of speech frame-blocks carried in each payload is N. 813 Then we will have 815 Payload s (the first packet of this interleave group): 816 ILL=L, ILP=0, 817 Carry frame-blocks: n, n+(L+1), n+2*(L+1),..., n+(N-1)*(L+1) 819 Payload s+1 (the second packet of this interleave group): 820 ILL=L, ILP=1, 821 Carry frame-blocks: n+1, n+1+(L+1), n+1+2*(L+1),..., n+1+(N-1)*(L+1) 823 ... 825 Payload s+L (the last packet of this interleave group): 826 ILL=L, ILP=L, 827 Carry frame-blocks: n+L, n+L+(L+1), n+L+2*(L+1), ..., n+L+(N-1)*(L+1) 829 The next interleave group will start at frame-block n+N*(L+1). 831 There will be no interleaving effect unless the number of 832 frame-blocks per packet (N) is at least 2. Moreover, the 833 number of frame-blocks per payload (N) and the value of ILL 834 MUST NOT be changed inside an interleave group. In other 835 words, all payloads in an interleave group MUST have the same 836 ILL and MUST contain the same number of speech frame-blocks. 838 The sender of the payload MUST only apply interleaving if the 839 receiver has signaled its use through out-of-band means. 840 Since interleaving will increase buffering requirements at 841 the receiver, the receiver uses MIME parameter 842 "interleaving=I" to set the maximum number of frame-blocks 843 allowed in an interleaving group to I. 845 When performing interleaving the sender MUST use a proper 846 number of frame-blocks per payload (N) and ILL so that the 847 resulting size of an interleave group is less than or equal 848 to I, i.e., N*(L+1)<=I. 850 6.3.3. The Payload Table of Contents 852 The table of contents (ToC) in octet-aligned payload format 853 consists of a list of ToC entries where each entry 854 corresponds to a speech frame carried in the payload, i.e., 856 +---------------------+ 857 | list of ToC entries | 858 +---------------------+ 860 When interleaving is used, the frame-blocks in the ToC will 861 almost never be placed consecutive in time. Instead, the 862 presence and order of the frame-blocks in a packet will 863 follow the pattern described in 6.3.2. 865 The following example shows the ToC of three consecutive 866 packets, each carrying 3 frame-blocks, in an interleaved two 867 channel session. Here, the two channels are left (L) and 868 right (R) with L coming before R, and the interleaving length 869 is 3 (i.e., ILL=2). This makes the interleave group 9 frame- 870 blocks large. 872 Packet #1 873 --------- 875 ILL=2, ILP=0: 876 +----+----+----+----+----+----+ 877 | 1L | 1R | 4L | 4R | 7L | 7R | 878 +----+----+----+----+----+----+ 879 |<------->|<------->|<------->| 880 Frame- Frame- Frame- 881 Block 1 Block 4 Block 7 883 Packet #2 884 --------- 886 ILL=2, ILP=1: 887 +----+----+----+----+----+----+ 888 | 2L | 2R | 5L | 5R | 8L | 8R | 889 +----+----+----+----+----+----+ 890 |<------->|<------->|<------->| 891 Frame- Frame- Frame- 892 Block 2 Block 5 Block 8 894 Packet #3 895 --------- 897 ILL=2, ILP=2: 899 +----+----+----+----+----+----+ 900 | 3L | 3R | 6L | 6R | 9L | 9R | 901 +----+----+----+----+----+----+ 902 |<------->|<------->|<------->| 903 Frame- Frame- Frame- 904 Block 3 Block 6 Block 9 906 A ToC entry for the octet-aligned payload format is as follows: 908 0 1 2 3 4 5 6 7 909 +-+-+-+-+-+-+-+-+ 910 |F| FT |Q|P|P| 911 +-+-+-+-+-+-+-+-+ 913 The table of contents (ToC) consists of a list of ToC 914 entries, each representing a speech frame. 916 F (1 bit): If set to 1, indicates that this frame is followed 917 by another speech frame in this payload; if set to 0, 918 indicates that this frame is the last frame in this payload. 920 FT (4 bits): Frame type index whose value is chosen according 921 to the following Table. 923 +----+--------------------------------------------+-------------------+ 924 | FT | Encoding Rate | Frame Size (Bits) | 925 +----+--------------------------------------------+-------------------+ 926 | 0 | Interoperable Full-Rate (AMR-WB 6.60 kbps) | 132 | 927 | 1 | Interoperable Full-Rate (AMR-WB 8.85 kbps) | 177 | 928 | 2 | Interoperable Full-Rate (AMR-WB 12.65 kbps)| 253 | 929 | 3 | Full-Rate 13.3 kbps | 266 | 930 | 4 | Half-Rate 6.2 kbps | 124 | 931 | 5 | Quarter-Rate 2.7 kbps | 54 | 932 | 6 | Eighth-Rate 1.0 kbps | 20 | 933 | 7 | (reserved) | | 934 | 8 | (reserved) | | 935 | 9 | CNG (AMR-WB SID) | 35 | 936 | 10 | (reserved) | | 937 | 11 | (reserved) | | 938 | 12 | (reserved) | | 939 | 13 | (reserved) | | 940 | 14 | Erasure (AMR-WB SPEECH_LOST) | 0 | 941 | 15 | Blank (AMR-WB NO_DATA) | 0 | 942 +----+--------------------------------------------+-------------------+ 943 Table 3:VMR-WB payload frame types for real-time 944 (or non real-time) transport and storage 946 During the interoperable mode, FT=14 (SPEECH_LOST) and FT=15 947 (NO_DATA) are used to indicate frames that are either lost or 948 not being transmitted in this payload, respectively. FT=14 or 949 15 MAY be used in the non-interoperable modes to indicate 950 frame erasure or blank frame, respectively (see Section 2.1 951 of [1]). 953 Note that for ToC entries with FT=14 or 15, there will be no 954 corresponding speech frame in the payload. 956 Q (1 bit): Frame quality indicator. If set to 0, indicates 957 the corresponding frame is corrupted. During the 958 interoperable mode, the receiver side (with AMR-WB codec) 959 should set the RX_TYPE to either SPEECH_BAD or SID_BAD 960 depending on the frame type (FT), if Q=0. The VMR-WB encoder 961 always sets Q bit to 1. 963 P bits: Padding bits MUST be set to zero. 965 For multi-channel sessions, the ToC entries of all frames 966 from a frame-block are placed in the ToC in consecutive. 968 Therefore, with N channels and K speech frame-blocks in a 969 packet, there MUST be N*K entries in the ToC, and the first N 970 entries will be from the first frame-block, the second N 971 entries will be from the second frame-block, and so on. 973 6.3.4. Speech Data 975 Speech data of a payload contains one or more speech as 976 described in the ToC of the payload. 978 Each speech frame represents 20 ms of speech encoded in one 979 of the available encoding rates depending on the operation 980 mode. The length of the speech frame is defined by the frame 981 type in the FT field with the following considerations: 983 - The last octet of each speech frame MUST be padded with 984 zeroes at the end if not all bits in the octet are used. 985 In other words, each speech frame MUST be octet-aligned. 987 - When multiple speech frames are present in the speech 988 data, the speech frames MUST be arranged one whole frame 989 after another. 991 The order and numbering notation of the speech data bits are 992 as specified in the VMR-WB standard specification [1]. 994 The payload begins with the payload header of one octet or 995 two if frame interleaving is selected. The payload header is 996 followed by the table of contents consisting of a list of 997 one-octet ToC entries. 999 The speech data follows the table of contents. For 1000 packetization in the normal order, all of the octets 1001 comprising a speech frame are appended to the payload as a 1002 unit. The speech frames are packed in the same order as their 1003 corresponding ToC entries are arranged in the ToC list, with 1004 the exception that if a given frame has a ToC entry with 1005 FT=14 or 15, there will be no data octets present for that 1006 frame. 1008 6.3.5. Payload Example: Basic Single Channel Payload Carrying Multiple Frames 1010 The following diagram shows an octet-aligned payload format 1011 from a single channel session that carries two VMR-WB Full- 1012 Rate frames (FT=3). In the payload, a codec mode request is 1013 sent (e.g., CMR=4), requesting the encoder at the receiver's 1014 side to use VMR-WB mode 1. No interleaving is used. 1016 0 1 2 3 1017 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1018 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1019 | CMR=4 |R|R|R|R|1|FT#1=3 |Q|P|P|0|FT#2=3 |Q|P|P| f1(0..7) | 1020 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1021 | f1(8..15) | f1(16..23) | ... | 1022 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1023 : ... : 1024 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1025 | r |P|P|P|P|P|P| f2(0..7) | f2(8..15) | f2(16..23) | 1026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1027 : ... : 1028 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1029 | ... | l |P|P|P|P|P|P| 1030 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1031 r= f1(264,265) 1032 l= f2(264,265) 1034 Note, in above example the last octet in both speech frames 1035 is padded with zeros to make them octet-aligned. 1037 6.4. Implementation Considerations 1039 An application implementing this payload format MUST 1040 understand all the payload parameters in the out-of-band 1041 signaling used. For example, if an application uses SDP, all 1042 the SDP and MIME parameters in this document MUST be 1043 understood. This requirement ensures that an implementation 1044 always can decide if it is capable or not of communicating. 1046 7. VMR-WB Storage Format 1048 The storage format is used for storing VMR-WB encoded speech 1049 frames in a file or as an e-mail attachment. Multiple channel 1050 content is also supported. 1052 In general, VMR-WB file has the following structure: 1054 +------------------+ 1055 | Header | 1056 +------------------+ 1057 | Speech frame 1 | 1058 +------------------+ 1059 : ... : 1060 +------------------+ 1061 | Speech frame n | 1062 +------------------+ 1064 7.1. Single channel Header 1066 A single channel VMR-WB file header contains only a magic 1067 number. 1069 The magic number for single channel VMR-WB files containing 1070 speech data generated in the non-interoperable modes; i.e., 1071 VMR-WB modes 0, 1, or 2, MUST consist of ASCII character 1072 string 1074 "#!VMR-WB\n" 1075 (or 0x2321564d522d57420a in hexadecimal). 1077 Note, the "\n" is an important part of the magic numbers and 1078 MUST be included in the comparison; otherwise, the single 1079 channel magic number above will become indistinguishable from 1080 that of the multi-channel file defined in the next section. 1082 The magic number for single channel VMR-WB files containing 1083 speech data generated in the interoperable mode; i.e., VMR-WB 1084 mode 3, MUST consist of ASCII character string 1086 "#!VMR-WB_I\n" 1087 (or 0x2321564d522d57425F490a in hexadecimal). 1089 In the interoperable mode, a file generated by VMR-WB is 1090 decodable with AMR-WB (with the exception of different magic 1091 numbers). However, to ensure compatibility and because VMR-WB 1092 can only decode AMR-WB codec modes 0, 1, or 2, AMR-WB codec 1093 SHOULD be instructed not to generate the modes that are not 1094 in common so that files generated by AMR-WB can be decoded by 1095 VMR-WB. 1097 7.2. Multi-channel Header 1099 The multi-channel header consists of a magic number followed 1100 by a 32-bit channel description field, giving the multi- 1101 channel header the following structure: 1103 +----------------------------+ 1104 | Magic Number | 1105 +----------------------------+ 1106 | Channel Description Field | 1107 +----------------------------+ 1108 The magic number for multi-channel VMR-WB files containing 1109 speech data generated in the non-interoperable modes; i.e., 1110 VMR-WB modes 0, 1, or 2, MUST consist of the ASCII character 1111 string 1113 "#!VMR-WB_MC1.0\n" 1114 (or 0x2321564d522d57425F4D43312E300a in hexadecimal). 1116 The version number in the magic numbers refers to the version 1117 of the file format. 1119 The magic number for multi-channel VMR-WB files containing 1120 speech data generated in the interoperable mode; i.e., VMR-WB 1121 mode 3, MUST consist of the ASCII character string 1123 "#!VMR-WB_MCI1.0\n" 1124 (or 0x2321564d522d57425F4D4349312E300a in hexadecimal). 1126 The 32-bit channel description field is defined as 1128 0 1 2 3 1129 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1131 | Reserved bits | CHAN | 1132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1134 Reserved bits: MUST be set to 0 when written, and a reader 1135 MUST ignore them. 1137 CHAN (4 bit unsigned integer): Indicates the number of audio 1138 channels contained in this storage file. The valid values and 1139 the order of the channels within a frame-block are specified 1140 in Section 4.1 in [10]. 1142 7.3. Speech Frames 1144 After the file header, speech frame-blocks consecutive in 1145 time are stored in the file. Each frame-block contains a 1146 number of octet-aligned speech frames equal to the number of 1147 channels, and stored in increasing order, starting with 1148 channel 1. 1150 Each stored speech frame starts with a one-octet frame header 1151 with the following format: 1153 0 1 2 3 4 5 6 7 1154 +-+-+-+-+-+-+-+-+ 1155 |P| FT |Q|P|P| 1156 +-+-+-+-+-+-+-+-+ 1158 The FT field is defined as shown in Table 3. The P bits are 1159 padding and MUST be set to 0. 1161 Q (1 bit): Frame quality indicator. If set to 0, indicates 1162 the corresponding frame is corrupted. The VMR-WB encoder 1163 always sets Q bit to 1. 1165 Following this one octet header, the speech bits are placed 1166 as defined in 6.3.4. The last octet of each frame is padded 1167 with zeroes, if needed, to achieve octet alignment. 1169 The following example shows a VMR-WB speech frame encoded at 1170 Half-Rate (with 124 speech bits) in the storage format. 1172 0 1 2 3 1173 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1175 |0| FT=4 |1|0|0| | 1176 +-+-+-+-+-+-+-+-+ + 1177 | | 1178 + Speech bits for frame-block n, channel k + 1179 | | 1180 + + 1181 | | 1182 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1183 | | 1184 +-+-+-+-+ 1186 Frame-blocks or speech frames that are lost in transmission 1187 and thereby not received MUST be stored as Blank/NO_DATA 1188 frames (FT=15) or Erasure/SPEECH_LOST (FT=14) in complete 1189 frame-blocks to keep synchronization with the original media. 1191 8. Congestion Control 1193 The general congestion control considerations for 1194 transporting RTP data apply to VMR-WB speech over RTP as 1195 well. However, the multimode capability of VMR-WB speech 1196 coding may provide an advantage over other payload formats 1197 for controlling congestion since the bandwidth demand can be 1198 adjusted by selecting a different operating mode (i.e., mode 1199 switching). 1201 Another parameter that may impact the bandwidth demand for 1202 VMR-WB is the number of frame-blocks that are encapsulated in 1203 each RTP payload. Packing more frame-blocks in each RTP 1204 payload can reduce the number of packets sent and hence the 1205 overhead from RTP/UDP/IP headers, at the expense of increased 1206 delay. 1208 If forward error correction (FEC) is used to alleviate the 1209 packet loss, the amount of redundancy added by FEC will need 1210 to be regulated so that the use of FEC itself does not cause 1211 a congestion problem. 1213 It is RECOMMENDED that VMR-WB applications using this payload 1214 format employ congestion control. The actual mechanism for 1215 congestion control is not specified but should be suitable 1216 for real-time transport of datagrams. 1218 9. Security Considerations 1220 RTP packets using the payload formats defined in this 1221 specification are subject to the general security 1222 considerations discussed in [3]. 1224 As this format transports encoded speech, the main security 1225 issues include confidentiality and authentication of the 1226 speech itself. The payload format itself does not have any 1227 built-in security mechanisms. External mechanisms, such as 1228 SRTP [8], MAY be used. 1230 This payload format does not exhibit any significant non- 1231 uniformity in the receiver side computational complexity for 1232 packet processing and thus is unlikely to pose a denial-of- 1233 service threat due to the receipt of pathological/corrupted 1234 data. 1236 9.1. Confidentiality 1238 To achieve confidentiality of the encoded VMR-WB speech, all 1239 speech data bits MAY be encrypted. There is no need to 1240 encrypt the payload header or the table of contents due to 1241 the following reasons: 1243 1) They only carry information about the requested speech 1244 mode, frame type, and frame quality 1246 2) This information could be useful to some third party, 1247 e.g., quality monitoring. 1249 As long as the VMR-WB payload is only packed and unpacked at 1250 either end, encryption may be performed after packet 1251 encapsulation so that there is no conflict between the two 1252 operations. 1254 Interleaving may affect encryption. Depending on the 1255 encryption scheme used, there may be restrictions on, for 1256 example, the time when keys can be changed. Specifically, the 1257 key change may need to occur at the boundary between 1258 interleave groups. 1260 The type of encryption method used may impact the error 1261 robustness of the payload data. The error robustness may be 1262 severely reduced when the data is encrypted unless an 1263 encryption method without error-propagation is used, e.g. a 1264 stream cipher. 1266 9.2. Authentication 1268 To authenticate the sender of the speech, an external 1269 mechanism MUST be used. It is RECOMMENDED that such a 1270 mechanism protect all the speech data bits. 1272 Data tampering by a man-in-the-middle attacker could result 1273 in erroneous depacketization/decoding that could lower the 1274 speech quality. For example, tampering with the CMR field may 1275 result in speech in a different quality than desired. 1277 To prevent a man-in-the-middle attacker from tampering with 1278 the payload packets, some additional information besides the 1279 speech bits SHOULD be protected. 1281 This may include the payload header, ToC, RTP timestamp, RTP 1282 sequence number, and the RTP marker bit. 1284 9.3. Decoding Validation and Provision for Lost or Late Packets 1286 When processing a received payload packet, if the receiver 1287 finds that the calculated payload length, based on the 1288 information of the session and the values found in the 1289 payload header fields, do not match the size of the received 1290 packet, the receiver SHOULD discard the packet to avoid 1291 potential degradation of speech quality and to invoke the 1292 VMR-WB built-in frame error concealment mechanism. Therefore, 1293 invalid packets SHALL be treated as lost packets. 1295 Late packets (i.e., unavailability of a packet when needed 1296 for decoding at the receiver) SHALL be treated as lost 1297 packets. Furthermore, if the late packet is part of an 1298 interleave group, depending upon the availability of the 1299 other packets in that interleave group, decoding MUST be 1300 resumed from the next (sequential order) available packet. In 1301 other words, the unavailability of a packet in an interleave 1302 group at certain time SHOULD not invalidate the other 1303 packets within that interleave group that MAY arrive later. 1305 10. Payload Format Parameters 1307 This section defines the parameters that may be used to 1308 select optional features in the VMR-WB payload. The 1309 parameters are defined here as part of the MIME subtype 1310 registration for the VMR-WB speech codec. A mapping of the 1311 parameters into the Session Description Protocol (SDP) [5] is 1312 also provided for those applications that use SDP. Equivalent 1313 parameters could be defined elsewhere for use with control 1314 protocols that do not use MIME or SDP. 1316 The data format and parameters are specified for both real- 1317 time transport in RTP and for storage type applications such as e-mail 1318 attachments. 1320 10.1. VMR-WB MIME Registration 1322 The MIME subtype for the Variable-Rate Multimode Wideband 1323 (VMR-WB) audio codec is allocated from the IETF tree since 1324 VMR-WB is expected to be a widely used speech codec in 1325 multimedia streaming and messaging as well as VoIP 1326 applications. This MIME registration covers both real-time 1327 transfer via RTP and non-real-time transfers via stored 1328 files. 1330 Note, the receiver MUST ignore any unspecified parameter and 1331 use the default values instead. 1333 Media Type name: audio 1335 Media subtype name: VMR-WB 1337 Required parameters: none 1339 Note that if no input parameters are defined, the default 1340 values will be used. 1342 Also note that "crc" and "robust-sorting" parameters from RFC 1343 3267 [4] are not applicable to VMR-WB RTP payload and storage 1344 file formats. To ensure compatibility between VMR-WB and 1345 AMR-WB in the interoperable sessions, one SHOULD make sure 1346 that AMR-WB does not utilize crc and robust-sorting (i.e., 1347 these options are deactivated in the session initiation). 1349 OPTIONAL parameters: 1350 These parameters apply to RTP transfer only. 1352 payload_format: Permissible values are 0 and 1. If 1, 1353 octet-aligned payload format SHALL be used. 1354 If 0 or if not present, header-free payload 1355 format is employed (default). 1357 maxptime: The maximum amount of media, which can be 1358 encapsulated in a payload packet, expressed 1359 as time in milliseconds. The time is 1360 calculated as the sum of the time the media 1361 present in the packet represents. The time 1362 SHALL be an integer multiple of the frame 1363 size. If this parameter is not present, the 1364 sender MAY encapsulate any number of speech 1365 frames into one RTP packet. 1367 interleaving: Indicates that frame-block level 1368 interleaving SHALL be used for the session 1369 and its value defines the maximum number of 1370 frame-blocks allowed in an interleaving 1371 group (see Section 6.3.1). If this 1372 parameter is not present, interleaving 1373 SHALL not be used. The presence of this 1374 parameter also implies automatically that 1375 octet-aligned operation SHALL be used. 1377 ptime: see RFC2327 [5]. It SHALL be at least one 1378 frame size for VMR-WB. 1380 channels: The number of audio channels. The possible 1381 values and their respective channel order 1382 is specified in section 4.1 in [10]. If 1383 omitted it has the default value of 1. 1385 These parameters apply to both real-time and non-real-time 1386 transfers 1388 dtx: Permissible values are 0 and 1. The default 1389 is 0 (i.e., No DTX) where VMR-WB normally 1390 operates as a continuous variable-rate 1391 codec. If dtx=1, the VMR-WB codec will 1392 operate in discontinuous transmission mode 1393 where silence descriptor (SID) frames are 1394 sent by the VMR-WB encoder during silence 1395 intervals with an adjustable update 1396 frequency. The selection of the SID update- 1397 rate depends on the implementation and 1398 other network considerations that are 1399 beyond the scope of this specification. 1401 Encoding considerations: 1402 This type is defined for transfer via both RTP (RFC 1403 3550) and stored-file methods as described in Sections 1404 6 and 7, respectively, of RFC XXXX. Audio data is 1405 binary data, and must be encoded for non-binary 1406 transport; the Base64 encoding is suitable for Email. 1408 Security considerations: 1409 See Section 9 of RFC XXXX. 1411 Public specification: 1412 The VMR-WB speech codec is specified in following 1413 3GPP2 specifications C.S0052-0 version 1.0. 1414 Transfer methods are specified in RFC XXXX. 1416 Additional information: 1417 The following applies to stored-file transfer methods: 1419 Magic numbers: 1420 Single channel (for the non-interoperable modes) 1421 ASCII character string "#!VMR-WB\n" 1422 (or 0x2321564d522d57420a in hexadecimal) 1424 Single channel (for the interoperable mode) 1425 ASCII character string "#!VMR-WB_I\n" 1426 (or 0x2321564d522d57425F490a in hexadecimal) 1428 Multi-channel (for the non-interoperable modes) 1429 ASCII character string "#!VMR-WB_MC1.0\n" 1430 (or 0x2321564d522d57425F4D43312E300a in hexadecimal) 1432 Multi-channel (for the interoperable mode) 1433 ASCII character string "#!VMR-WB_MCI1.0\n" 1434 (or 0x2321564d522d57425F4D4349312E300a in hexadecimal) 1436 File extensions for the non-interoperable modes: vmr, VMR 1437 Macintosh file type code: none 1438 Object identifier or OID: none 1440 File extensions for the interoperable mode: vmi, VMI 1441 Macintosh file type code: none 1442 Object identifier or OID: none 1444 Person & email address to contact for further information: 1445 Sassan Ahmadi, Ph.D. Nokia Inc. USA 1446 sassan.ahmadi@nokia.com 1448 Intended usage: COMMON. 1449 It is expected that many VoIP, multimedia messaging and 1450 streaming applications (as well as mobile applications) 1451 will use this type. 1453 Author/Change controller: 1454 Sassan Ahmadi, Ph.D. Nokia Inc. USA 1455 sassan.ahmadi@nokia.com 1456 IETF Audio/Video Transport Working Group 1458 10.2. Mapping MIME Parameters into SDP 1460 The information carried in the MIME media type specification 1461 has a specific mapping to fields in the Session Description 1462 Protocol (SDP) [5], which is commonly used to describe RTP 1463 sessions. When SDP is used to specify sessions employing the 1464 VMR-WB codec, the mapping is as follows: 1466 - The MIME type ("audio") goes in SDP "m=" as the media 1467 name. 1468 - The MIME subtype (payload format name) goes in SDP 1469 "a=rtpmap" as the encoding name. The RTP clock rate in 1470 "a=rtpmap" MUST be 16000 for VMR-WB (Note that 8000 is 1471 also supported by VMR-WB for narrowband I/O processing), 1472 and the encoding parameters (number of channels) MUST 1473 either be explicitly set to N or omitted, implying a 1474 default value of 1. The values of N that are allowed is 1475 specified in Section 4.1 in [10]. 1477 - The parameters "ptime" and "maxptime" go in the SDP 1478 "a=ptime" and "a=maxptime" attributes, respectively. 1479 - Any remaining parameters go in the SDP "a=fmtp" attribute 1480 by copying them directly from the MIME media type string 1481 as a semicolon separated list of parameter=value pairs. 1483 Some example SDP session descriptions utilizing VMR-WB 1484 encodings follow. In these examples, long a=fmtp lines are 1485 folded to meet the column width constraints of this document; 1486 the backslash ("\") at the end of a line and the 1487 carriage return that follows it should be ignored. 1489 Example of usage of VMR-WB in a possible VoIP scenario 1490 (wideband audio): 1492 m=audio 49120 RTP/AVP 98 1493 a=rtpmap:98 VMR-WB/16000 1494 a=fmtp:98 payload_format=1 1496 Example of usage of VMR-WB in a possible VoIP scenario 1497 (narrowband audio): 1499 m=audio 49120 RTP/AVP 98 1500 a=rtpmap:98 VMR-WB/8000 1501 a=fmtp:98 1503 Example of usage of VMR-WB in a possible streaming scenario 1504 (two channel stereo): 1506 m=audio 49120 RTP/AVP 99 1507 a=rtpmap:99 VMR-WB/16000/2 1508 a=fmtp:99 interleaving=30 1509 a=maxptime:100 payload_format=1 1511 10.3. Offer-Answer Model Considerations 1513 To achieve good interoperability for the VMR-WB RTP payload in an 1514 Offer-Answer negotiation usage in SDP the following considerations 1515 SHOULD be made: 1517 - Both header-free and octet-aligned payload formats MAY be offered by 1518 a VMR-WB enabled terminal. However, for an interoperable 1519 interconnection with AMR-WB only octet-aligned payload format SHALL be 1520 used. 1522 - The parameters "maxptime" and "ptime" should in most cases not 1523 affect the interoperability, however the setting of the parameters 1524 can affect the performance of the application. 1526 - To maintain interoperability with AMR-WB in cases where 1527 negotiation is possible using the VMR-WB interoperable mode, a 1528 VMR-WB enabled terminal SHOULD also declare itself capable of AMR-WB 1529 with limited mode set (i.e., only AMR-WB codec modes 0, 1, and 1530 2 are allowed) and octet-align mode of operation. Example: 1532 m=audio 49120 RTP/AVP 98 99 1533 a=rtpmap:98 VMR-WB/16000/1 1534 a=rtpmap:99 AMR-WB/16000/1 1535 a=fmtp:99 octet-align=1; mode-set=0,1,2 1537 11. IANA Considerations 1539 The new attributes "dtx" and "payload_format" need to be registered. 1540 The definition of the "maxptime" attribute used in this specification is 1541 consistent with the corresponding parameter in RFC 3267. 1543 12. Acknowledgements 1545 The author would like to thank Redwan Salami of VoiceAge 1546 Corporation, Ari Lakaniemi of Nokia Inc., and IETF/AVT chairs Colin 1547 Perkins and Magnus Westerlund for their technical comments 1548 to improve this document. 1550 Also, the author would like to acknowledge that some parts of 1551 RFC 3267 [4] and RFC 3558 [11] have been used in this 1552 document. 1554 References 1556 Normative References 1558 [1] 3GPP2 C.S0052-0 "Source-Controlled Variable-Rate 1559 Multimode Wideband Speech Codec (VMR-WB) Service Option 1560 62 for Wideband Spread Spectrum 1561 Communication Systems", 3GPP2 Technical Specification, 1562 June 2004. 1564 [2] S. Bradner, "Key words for use in RFCs to Indicate 1565 Requirement Levels", IETF RFC 2119, March 1997. 1567 [3] H. Schulzrinne, S. Casner, R. Frederick, and V. 1568 Jacobson, "RTP: A Transport Protocol for Real-Time 1569 Applications", IETF RFC 3550, July 2003. 1571 [4] J. Sjoberg, et al., "Real-Time Transport Protocol (RTP) 1572 Payload Format and File Storage Format for the Adaptive 1573 Multi-Rate (AMR) and Adaptive Multi-Rate Wideband 1574 (AMR-WB) Audio Codecs", IETF RFC 3267, June 2002. 1576 [5] M. Handley and V. Jacobson, "SDP: Session Description 1577 Protocol", IETF RFC 2327, April 1998. 1579 Informative References 1581 [6] M. Handley, S. Floyd, J. Padhye, J. Widmer, "TCP 1582 Friendly Rate Control (TFRC): Protocol Specification", 1583 IETF RFC 3448, January 2003. 1585 [7] J. Rosenberg, and H. Schulzrinne, "An RTP Payload Format 1586 for Generic Forward Error Correction", IETF RFC 2733, 1587 December 1999. 1589 [8] Baugher, et al., "The Secure Real Time Transport 1590 Protocol", IETF Draft (Work in Progress), November 2001. 1592 [9] C. Perkins, et al., "RTP Payload for Redundant Audio 1593 Data", IETF RFC 2198, September 1997. 1595 [10] H. Schulzrinne, "RTP Profile for Audio and Video 1596 Conferences with Minimal Control" IETF RFC 3551, July 1597 2003. 1599 [11] A. Li, "RTP Payload Format for Enhanced Variable Rate 1600 Codecs (EVRC) and Selectable Mode Vocoders (SMV)", IETF 1601 RFC 3558, July 2003. 1603 [12] 3GPP TS 26.193 "AMR Wideband Speech Codec; Source 1604 Controlled Rate operation", version 5.0.0 (2001-03), 3rd 1605 Generation Partnership Project (3GPP). 1607 Any 3GPP2 document can be downloaded from the 3GPP2 web 1608 server, "http://www.3gpp2.org/", see specifications. 1610 Author's Address 1612 The editor will serve as the point of contact for all 1613 technical matters related to this document. 1615 Dr. Sassan Ahmadi Phone: 1 (858) 831-5916 1616 Fax: 1 (858) 831-4174 1617 Nokia Inc. Email: sassan.ahmadi@nokia.com 1618 12278 Scripps Summit Dr. 1619 San Diego, CA 92131 USA 1621 This Internet-Draft expires in six months from May 17, 2004. 1623 Full Copyright Statement 1625 Copyright (C) The Internet Society (2004). All Rights 1626 Reserved. 1628 This document and translations of it may be copied and 1629 furnished to others, and derivative works that comment on or 1630 otherwise explain it or assist in its implementation may be 1631 prepared, copied, published and distributed, in whole or in 1632 part, without restriction of any kind, provided that the 1633 above copyright notice and this paragraph are included on all 1634 such copies and derivative works. However, this document 1635 itself may not be modified in any way, such as by removing 1636 the copyright notice or references to the Internet Society or 1637 other Internet organizations, except as needed for the 1638 purpose of developing Internet standards in which case the 1639 procedures for copyrights defined in the Internet Standards 1640 process must be followed, or as required to translate it into 1641 languages other than English. 1643 The limited permissions granted above are perpetual and will 1644 not be revoked by the Internet Society or its successors or 1645 assignees. 1647 This document and the information contained herein is 1648 provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE 1649 INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, 1650 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY 1651 THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY 1652 RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR 1653 FITNESS FOR A PARTICULAR PURPOSE.