idnits 2.17.1 draft-ietf-avt-evrc-smv-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == There are 2 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 402 instances of too long lines in the document, the longest one being 6 characters in excess of 72. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' ** Obsolete normative reference: RFC 1889 (ref. '4') (Obsoleted by RFC 3550) ** Obsolete normative reference: RFC 1890 (ref. '5') (Obsoleted by RFC 3551) ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft Adam H. Li 3 draft-ietf-avt-evrc-smv-02.txt UCLA 4 June 7, 2002 Editor 5 Expires: December 7, 2002 7 RTP Payload Format for Enhanced Variable Rate Codecs (EVRC) and 8 Selectable Mode Vocoders (SMV) 10 STATUS OF THIS MEMO 12 This document is an Internet-Draft and is in full conformance with 13 all provisions of Section 10 of RFC 2026. 15 Internet-Drafts are working documents of the Internet Engineering 16 Task Force (IETF), its areas, and its working groups. Note that other 17 groups may also distribute working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet- Drafts as reference 22 material or to cite them other than as work in progress. 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 ABSTRACT 32 This document describes the RTP payload format for Enhanced Variable 33 Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech. 34 Two sub-formats are specified for different application scenarios. A 35 bundled/interleaved format is included to reduce the effect of packet 36 loss on speech quality and amortize the overhead of the RTP header 37 over more than one speech frame. A non-bundled format is also 38 supported for conversational applications. 40 Table of Contents 42 1. Introduction ................................................... 2 43 2. Background ..................................................... 2 44 3. The Codecs Supported ........................................... 3 45 3.1. EVRC ......................................................... 3 46 3.2. SMV .......................................................... 3 47 3.3. Other Frame-Based Vocoders ................................... 4 48 4. RTP/Vocoder Packet Format ...................................... 4 49 4.1. Interleaved/Bundled Packet Format ............................ 4 50 4.2. Header-Free Packet Format .................................... 6 51 4.3. Determining the Format of Packets ............................ 6 52 5. Packet Table of Contents Entries and Codec Data Frame Format ... 7 53 5.1. Packet Table of Contents entries ............................. 7 54 5.2. Codec Data Frames ............................................ 7 55 6. Interleaving Codec Data Frames ................................. 8 56 7. Bundling Codec Data Frames .................................... 11 57 8. Handling Missing Codec Data Frames ............................ 11 58 9. Implementation Issues ......................................... 11 59 9.1. Interleaving Length ......................................... 11 60 9.2. Validation of Received Packets .............................. 12 61 9.3. Processing the Late Packets ................................. 12 62 10. Mode Request ................................................. 12 63 11. Storage Format ............................................... 13 64 12. IANA Considerations .......................................... 14 65 12.1. Registration of Media Type EVRC ............................ 14 66 12.2. Registration of Media Type EVRC0 ........................... 15 67 12.3. Registration of Media Type SMV ............................. 16 68 12.4. Registration of Media Type SMV0 ............................ 17 69 13. Mapping to SDP Parameters .................................... 18 70 14. Security Considerations ...................................... 18 71 15. Adding Support of Other Frame-Based Vocoders ................. 19 72 16. Acknowledgements ............................................. 19 73 17. References ................................................... 20 74 18. Authors' Address ............................................. 20 76 1. Introduction 78 This document describes how speech compressed with EVRC [1] or SMV 79 [2] may be formatted for use as an RTP payload type. The format is 80 also extensible to other codecs that generate a similar set of frame 81 types. Two methods are provided to packetize the codec data frames 82 into RTP packets: an interleaved/bundled format and a zero-header 83 format. The sender may choose the best format for each application 84 scenario, based on network conditions, bandwidth availability, delay 85 requirements, and packet-loss tolerance. 87 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 88 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 89 document are to be interpreted as described in RFC 2119 [3]. 91 2. Background 93 The 3rd Generation Partnership Project 2 (3GPP2) has published two 94 standards which define speech compression algorithms for CDMA 95 applications: EVRC [1] and SMV [2]. EVRC is currently deployed in 96 millions of first and second generation CDMA handsets. SMV is the 97 preferred speech codec standard for CDMA2000, and will be deployed in 98 third generation handsets in addition to EVRC. Improvements and new 99 codecs will keep emerging as technology improves, and future handsets 100 will likely support multiple codecs. 102 The formats of the EVRC and SMV codec frames are very similar. Many 103 other vocoders also share common characteristics, and have many 104 similar application scenarios. This parallelism enables an RTP 105 payload format to be designed for EVRC and SMV that may also support 106 other, similar vocoders with minimal additional specification work. 107 This can simplify the protocol for transporting vocoder data frames 108 through RTP and reduce the complexity of implementations. 110 3. The Codecs Supported 112 3.1. EVRC 114 The Enhanced Variable Rate Codec (EVRC) [1] compresses each 20 115 milliseconds of 8000 Hz, 16-bit sampled speech input into output 116 frames in one of the three different sizes: Rate 1 (171 bits), Rate 117 1/2 (80 bits), or Rate 1/8 (16 bits). In addition, there are two zero 118 bit codec frame types: null frames and erasure frames. Null frames 119 are produced as a result of the vocoder running at rate 0. Null 120 frames are zero bits long and are normally not transmitted. Erasure 121 frames are the frames substituted by the receiver to the codec for 122 the lost or damaged frames. Erasure frames are also zero bits long 123 and are normally not transmitted. 125 The codec chooses the output frame rate based on analysis of the 126 input speech and the current operating mode (either normal or one of 127 several reduced rate modes). For typical speech patterns, this 128 results in an average output of 4.2 kilobits/second for normal mode 129 and a lower average output for reduced rate modes. 131 3.2. SMV 133 The Selectable Mode Vocoder (SMV) [2] compresses each 20 milliseconds 134 of 8000 Hz, 16-bit sampled speech input into output frames of one of 135 the four different sizes: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 136 1/4 (40 bits), or Rate 1/8 (16 bits). In addition, there are two zero 137 bit codec frame types: null frames and erasure frames. Null frames 138 are produced as a result of the vocoder running at rate 0. Null 139 frames are zero bits long and are normally not transmitted. Erasure 140 frames are the frames substituted by the receiver to the codec for 141 the lost or damaged frames. Erasure frames are also zero bits long 142 and are normally not transmitted. 144 The SMV codec can operate in four modes. Each mode may produce frames 145 of any of the rates (full rate to 1/8 rate) for varying percentages 146 of time, based on the characteristics of the speech samples and the 147 selected mode. The SMV mode can change on a frame-by-frame basis. The 148 SMV codec does not need additional information other than the codec 149 data frames to correctly decode the data of various modes; therefore, 150 the mode of the encoder does not need to be transmitted with the 151 encoded frames. 153 The SMV codec chooses the output frame rate based on analysis of the 154 input speech and the current operating mode. For typical speech 155 patterns, this results in an average output of 4.2 kilobits/second 156 for Mode 0 in two way conversation (approximately 50% active speech 157 time and 50% in eighth rate while listening) and lower for other 158 reduced rate modes. SMV is more bandwidth efficient than EVRC. EVRC 159 is equivalent in performance to SMV mode 1. 161 3.3. Other Frame-Based Vocoders 163 Other frame-based vocoders can be carried in the packet format 164 defined in this document, as long as they possess the following 165 properties: 167 o The codec is frame-based; 168 o blank and erasure frames are supported; 169 o the total number of rates is less than 17; 170 o the maximum full rate frame can be transported in a single RTP 171 packet using this specific format. 173 Vocoders with the characteristics listed above can be transported 174 using the packet format specified in this document with some 175 additional specification work; the pieces that must be defined are 176 listed in Section 15. 178 4. RTP/Vocoder Packet Format 180 The vocoder speech data may be transmitted in either of the two RTP 181 packet formats specified in the following two subsections, as 182 appropriate for the application scenario. In the packet format 183 diagrams shown in this document, bit 0 is the most significant bit. 185 4.1. Interleaved/Bundled Packet Format 187 This format is used to send one or more vocoder frames per packet. 188 Interleaving or bundling MAY be used. The RTP packet for this format 189 is as follows: 191 0 1 2 3 192 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 194 | RTP Header [4] | 195 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 196 |R|R| LLL | NNN | MMM | Count | TOC | ... | TOC |padding| 197 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 198 | one or more codec data frames, one per TOC entry | 199 | .... | 200 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 201 The RTP header has the expected values as described in the RTP 202 specification [4]. The RTP timestamp is in 1/8000 of a second units 203 for EVRC and SMV. For any other vocoders that use this packet format, 204 the timestamp unit needs to be defined explicitly. The M bit should 205 be set as specified in the applicable RTP profile, for example, RFC 206 1890 [5]. Note that RFC 1890 [5] specifies that if the sender does 207 not suppress silence, the M bit will always be zero. When multiple 208 codec data frames are present in a single RTP packet, the timestamp 209 is that of the oldest data represented in the RTP packet. The 210 assignment of an RTP payload type for this packet format is outside 211 the scope of this document; it is specified by the RTP profile under 212 which this payload format is used. 214 The first octet of a Interleaved/Bundled format packet is the 215 Interleave Octet. The second octet contains the Mode Request and 216 Frame Count fields. The Table of Contents (ToC) field then follows. 217 The fields are specified as follows: 219 Reserved (RR): 2 bits 220 Reserved bits. MUST be set to zero by sender, SHOULD be ignored 221 by receiver. 223 Interleave Length (LLL): 3 bits 224 Indicates the length of interleave; a value of 0 indicates 225 bundling, a special case of interleaving. See Section 6 and 226 Section 7 for more detailed discussion. 228 Interleave Index (NNN): 3 bits 229 Indicates the index within an interleave group. MUST have a value 230 less than or equal to the value of LLL. Values of NNN greater 231 than the value of LLL are invalid. Packet with invalid NNN values 232 SHOULD be ignored by the receiver. 234 Mode Request (MMM): 3 bits 235 The Mode Request field is used to signal Mode Request 236 information. See Section 10 for details. 238 Frame Count (Count): 5 bits 239 The number of ToC fields (and vocoder frames) present in the 240 packet is the value of the frame count field plus one. A value of 241 zero indicates that the packet contains one ToC field, while a 242 value of 31 indicates that the packet contains 32 ToC fields. 244 Padding (padding): 0 or 4 bits 245 This padding ensures that codec data frames start on an octet 246 boundary. When the frame count is odd, the sender MUST add 4 bits 247 of padding following the last TOC. When the frame count is even, 248 the sender MUST NOT add padding bits. If padding is present, the 249 padding bits MUST be set to zero by sender, and SHOULD be ignored 250 by receiver. 252 The Table of Contents field (ToC) provides information on the codec 253 data frame(s) in the packet. There is one ToC entry for each codec 254 data frame. The detailed formats of the ToC field and codec data 255 frames are specified in Section 5. 257 Multiple data frames may be included within a Interleaved/Bundled 258 packet using interleaving or bundling as described in Section 6 and 259 Section 7. 261 4.2. Header-Free Packet Format 263 The Header-Free Packet Format is designed for maximum bandwidth 264 efficiency and low latency. Only one codec data frame can be sent in 265 each Header-Free format packet. None of the payload header fields 266 (LLL, NNN, MMM, Count) nor ToC entries are present. The codec rate 267 for the data frame can be determined from the length of the codec 268 data frame, since there is only one codec data frame in each Header- 269 Free packet. 271 Use of the RTP header fields for Header-Free RTP/Vocoder Packet 272 Format is the same as described in Section 4.1 for 273 Interleaved/Bundled RTP/Vocoder Packet Format. The detailed format of 274 the codec data frame is specified in Section 5. 276 0 1 2 3 277 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 278 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 279 | RTP Header [4] | 280 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 281 | | 282 + ONLY one codec data frame +-+-+-+-+-+-+-+-+ 283 | | 284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 286 4.3. Determining the Format of Packets 288 All receivers SHOULD be able to process both packet formats. The 289 sender MAY choose to use one or both packet formats. 291 A receiver MUST have prior knowledge of the packet format to 292 correctly decode the RTP packets. 293 When packets of both formats are used within the same session, 294 different RTP payload type values MUST be used for each format to 295 distinguish the packet formats. The association of payload type 296 number with the packet format is done out-of-band, for example by SDP 297 during the setup of a session. 299 5. Packet Table of Contents Entries and Codec Data Frame Format 301 5.1. Packet Table of Contents entries 303 Each codec data frame in a Interleaved/Bundled packet has a 304 corresponding Table of Contents (ToC) entry. The ToC entry indicates 305 the rate of the codec frame. (Header-Free packets MUST NOT have a ToC 306 field.) 308 Each ToC entry is occupies four bits. The format of the bits is 309 indicated below: 311 0 1 2 3 312 +-+-+-+-+ 313 |fr type| 314 +-+-+-+-+ 316 Frame Type: 4 bits 317 The frame type indicates the type of the corresponding codec data 318 frame in the RTP packet. 320 For EVRC and SMV codecs, the frame type values and size of the 321 associated codec data frame are described in the table below: 323 Value Rate Total codec data frame size (in octets) 324 --------------------------------------------------------- 325 0 Blank 0 (0 bit) 326 1 1/8 2 (16 bits) 327 2 1/4 5 (40 bits; not valid for EVRC) 328 3 1/2 10 (80 bits) 329 4 1 22 (171 bits; 5 padded at end with zeros) 330 5 Erasure 0 (SHOULD NOT be transmitted by sender) 332 All values not listed in the above table MUST be considered reserved. 333 A ToC entry with a reserved Frame Type value SHOULD be considered 334 invalid. Note that the EVRC codec does not have 1/4 rate frames, thus 335 frame type value 2 MUST be considered a reserved value when the EVRC 336 codec is in use. 338 Other vocoders that use this packet format need to specify their own 339 table of frame types and corresponding codec data frames. 341 5.2. Codec Data Frames 343 The output of the vocoder MUST be converted into codec data frames 344 for inclusion in the RTP payload. The conversions for EVRC and SMV 345 codecs are specified below. (Note: Because the EVRC codec does not 346 have Rate 1/4 frames, the specifications of 1/4 frames does not apply 347 to EVRC codec data frames). Other vocoders that use this packet 348 format need to specify how to convert vocoder output data into 349 frames. 351 The codec output data bits as numbered in EVRC and SMV are packed 352 into octets. The lowest numbered bit (bit 1 for Rate 1, Rate 1/2, 353 Rate 1/4 and Rate 1/8) is placed in the most significant bit 354 (internet bit 0) of octet 1 of the codec data frame, the second 355 lowest bit is placed in the second most significant bit of the first 356 octet, the third lowest in the third most significant bit of the 357 first octet, and so on. This continues until all of the bits have 358 been placed in the codec data frame. 360 The remaining unused bits of the last octet of the codec data frame 361 MUST be set to zero. Note that in EVRC and SMV this is only 362 applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits), 363 Rate 1/4 (40 bits, SMV only) and Rate 1/8 frames (16 bits) fit 364 exactly into a whole number of octets. 366 Following is a detailed listing showing a Rate 1 EVRC/SMV codec 367 output frame converted into a codec data frame: 369 The codec data frame for a EVRC/SMV codec Rate 1 frame is 22 octets 370 long. Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are 371 placed as indicated, with bits marked with "Z" set to zero. EVRC/SMV 372 codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly, 373 but do not require zero padding because they align on octet 374 boundaries. 376 Rate 1 codec data frame 378 0 1 2 3 379 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 380 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 381 |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 382 |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3| 383 |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2| 384 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 385 : : 386 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 387 |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | | 388 |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z| 389 |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | | 390 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 392 6. Interleaving Codec Data Frames 394 As indicated in Section 4.1, more than one codec data frame MAY be 395 included in a single Interleaved/Bundled packet by a sender. This is 396 accomplished by interleaving or bundling. 398 Bundling is used to spread the transmission overhead of the RTP and 399 payload header over multiple vocoder frames. Interleaving 400 additionally reduces the listener's perception of data loss by 401 spreading such loss over non-consecutive vocoder frames. EVRC, SMV, 402 and similar vocoders are able to compensate for an occasional lost 403 frame, but speech quality degrades exponentially with consecutive 404 frame loss. 406 Bundling is signaled by setting the LLL field to zero and the Count 407 field to greater than zero. Interleaving is indicated by setting the 408 LLL field to a value greater than zero. 410 The discussions on general interleaving apply to the bundling (which 411 can be viewed as a reduced case of interleaving) with reduced 412 complexity. The bundling case is discussed in detail in Section 7. 414 Senders MAY support interleaving and/or bundling. All receivers that 415 support Interleave/Bundling packet format MUST support both 416 interleaving and bundling. 418 Given a time-ordered sequence of output frames from the codec 419 numbered 0..n, a bundling value B (the value in the Count field plus 420 one), and an interleave length L where n = B * (L+1) - 1, the output 421 frames are placed into RTP packets as follows (the values of the 422 fields LLL and NNN are indicated for each RTP packet): 424 First RTP Packet in Interleave group: 425 LLL=L, NNN=0 426 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of 427 B frames 429 Second RTP Packet in Interleave group: 430 LLL=L, NNN=1 431 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a 432 total of B frames 434 This continues to the last RTP packet in the interleave group: 436 L+1 RTP Packet in Interleave group: 437 LLL=L, NNN=L 438 Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a 439 total of B frames 441 Within each interleave group, the RTP packets making up the 442 interleave group MUST be transmitted in value-increasing order of the 443 NNN field. While this does not guarantee reduced end-to-end delay on 444 the receiving end, when packets are delivered in order by the 445 underlying transport, delay will be reduced to the minimum possible. 447 Receivers MAY signal the maximum number of codec data frames (i.e., 448 the maximum acceptable bundling value B) they can handle in a single 449 RTP packet using the OPTIONAL maxptime RTP mode parameter identified 450 in Section 12. 452 Receivers MAY signal the maximum interleave length (i.e., the maximum 453 acceptable LLL value in the Interleaving Octet) they will accept 454 using the OPTIONAL maxinterleave RTP mode parameter identified in 455 Section 12. 457 The parameters maxptime and maxinterleave are exchanged at the 458 initial setup of the session. In one-to-one sessions, the sender MUST 459 respect these values set be the receiver, and MUST NOT 460 interleave/bundle more packets than what the receiver signals that it 461 can handle. This ensures that the receiver can allocate a known 462 amount of buffer space that will be sufficient for all 463 interleaving/bundling used in that session. During the session, the 464 sender may decrease the bundling value or interleaving length (so 465 that less buffer space is required at the receiver), but never exceed 466 the maximum value set by the receiver. This prevents the situation 467 where a receiver needs to allocate more buffer space in the middle of 468 a session but is unable to do so. 470 Additionally, senders have the following restrictions: 472 o MUST NOT bundle more codec data frames in a single RTP packet than 473 indicated by maxptime (see Section 12) if it is signaled. 475 o SHOULD NOT bundle more codec data frames in a single RTP packet 476 than will fit in the MTU of the underlying network. 478 o Once beginning a session with a given maximum interleaving value 479 set by maxinterleave in Section 12, MUST NOT increase the 480 interleaving value (LLL) to exceed the maximum interleaving value 481 that is signaled. 483 o MAY change the interleaving value, but MUST do so only between 484 interleave groups. 486 o Silence suppression MUST only be used between interleave groups. A 487 ToC with Frame Type 0 (Blank Frame, Section 5.1) MUST be used 488 within interleaving groups if the codec outputs a blank frame. 489 The M bit in the RTP header is not set for these blank frames, as 490 the stream is continuous in time. Because there is only one time 491 stamp for each RTP packet, silence suppression used within an 492 interleave group would cause ambiguities when reconstructing the 493 speech at the receiver side, and thus is prohibited. 495 Given an RTP packet with sequence number S, interleave length (field 496 LLL) L, interleave index value (field NNN) N, and bundling value B, 497 the interleave group consists of this RTP packet and other RTP 498 packets with sequence numbers from S-N mod 65536 to S-N+L mod 65536 499 inclusive. In other words, the interleave group always consists of 500 L+1 RTP packets with sequential sequence numbers. The bundling value 501 for all RTP packets in an interleave group MUST be the same. 503 The receiver determines the expected bundling value for all RTP 504 packets in an interleave group by the number of codec data frames 505 bundled in the first RTP packet of the interleave group received. 506 Note that this may not be the first RTP packet of the interleave 507 group if packets are delivered out of order by the underlying 508 transport. 510 7. Bundling Codec Data Frames 512 As discussed in Section 6, the bundling of codec data frames is a 513 special reduced case of interleaving with LLL value in the Interleave 514 Octet set to 0. 516 Bundling codec data frames indicates that multiple data frames are 517 included consecutively in a packet, because the interleaving length 518 (LLL) is 0. The interleaving group is thus reduced to a single RTP 519 packet, and the reconstruction of the codec data frames from RTP 520 packets becomes a much simpler process. 522 Furthermore, the additional restrictions on senders are reduced to: 524 o MUST NOT bundle more codec data frames in a single RTP packet than 525 indicated by maxptime (see Section 12) if it is signaled. 527 o SHOULD NOT bundle more codec data frames in a single RTP packet 528 than will fit in the MTU of the underlying network. 530 8. Handling Missing Codec Data Frames 532 The vocoders covered by this payload format support erasure frames as 533 an indication when frames are not available. The erasure frames are 534 normally used internally by a receiver to advance the state of the 535 voice decoder by exactly one frame time for each missing frame. Using 536 the information from packet sequence number, time stamp, and the M 537 bit, the receiver can detect missing codec data frames from RTP 538 packet loss and/or silence suppression, and generate corresponding 539 erasure frames. Erasure frames MUST also be used in storage format to 540 record missing frames. 542 9. Implementation Issues 544 9.1. Interleaving Length 546 The vocoder interpolates the missing speech content when given an 547 erasure frame. However, the best quality is perceived by the listener 548 when erasure frames are not consecutive. This makes interleaving 549 desirable as it increases speech quality when packet loss occurs. 551 On the other hand, interleaving can greatly increase the end-to-end 552 delay. Where an interactive session is desired, either 553 Interleaved/Bundled packet format with interleaving length (field 554 LLL) 0 or Header-Free packet format is RECOMMENDED. 556 When end-to-end delay is not a primary concern, an interleaving 557 length (field LLL) of 4 or 5 is RECOMMENDED as it offers a reasonable 558 compromise between robustness and latency. 560 9.2. Validation of Received Packets 562 When receiving an RTP packet, the receiver SHOULD check the validity 563 of the ToC fields and match the length of the packet with what is 564 indicated by the ToC fields. If any invalidity or mismatch is 565 detected, it is RECOMMENDED to discard the received packet to avoid 566 potential severe degradation of the speech quality. The discarded 567 packet is treated following the same procedure as a lost packet, and 568 the discarded data will be replaced with erasure frames. 570 On receipt of an RTP packet with an invalid value of the LLL or NNN 571 fields, the RTP packet SHOULD be treated as lost by the receiver for 572 the purpose of generating erasure frames as described in Section 8. 574 On receipt of an RTP packet in an interleave group with other than 575 the expected frame count value, the receiver MAY discard codec data 576 frames off the end of the RTP packet or add erasure codec data frames 577 to the end of the packet in order to manufacture a substitute packet 578 with the expected bundling value. The receiver MAY instead choose to 579 discard the whole interleave group. 581 9.3. Processing the Late Packets 583 Assume that the receiver has begun playing frames from an interleave 584 group. The time has come to play frame x from packet n of the 585 interleave group. Further assume that packet n of the interleave 586 group has not been received. As described in Section 8, an erasure 587 frame will be sent to the receiving vocoder. 589 Now, assume that packet n of the interleave group arrives before 590 frame x+1 of that packet is needed. Receivers should use frame x+1 of 591 the newly received packet n rather than substituting an erasure 592 frame. In other words, just because packet n was not available the 593 first time it was needed to reconstruct the interleaved speech, the 594 receiver should not assume it is not available when it is 595 subsequently needed for interleaved speech reconstruction. 597 10. Mode Request 599 The Mode Request signal requests a particular encoding mode for the 600 speech encoding in the reverse direction. All implementations are 601 RECOMMENDED to honor the Mode Request signal. The Mode Request signal 602 SHOULD only be used in one-to-one sessions. In multiparty sessions, 603 any received Mode Request signals SHOULD be ignored. 605 In addition, the Mode Request signal MAY also be sent through non-RTP 606 means, which is out of the scope of this specification. 608 The three-bit Mode Request field is used to signal the receiver to 609 set a particular encoding mode to its audio encoder. If the Mode 610 Request field is set to a non-zero value in RTP packets from node A 611 to node B, it is a request for node B to change to the requested 612 encoding mode for its audio encoder and therefore the bit rate of the 613 RTP stream from node B to node A. Once a node sets this field to a 614 non-zero value it SHOULD continue to set the field to the same value 615 in subsequent packets until the requested mode has changed. This 616 design helps to eliminate the scenario of getting the codec stuck in 617 an unintended state if one of the packets that carries the Mode 618 Request is lost. An otherwise silent node MAY send an RTP packet 619 containing a blank frame in order to send a Mode Request. 621 Each codec type using this format SHOULD define its own 622 interpretation of the Mode Request field. Codecs SHOULD follow the 623 convention that higher values of the three-bit field correspond to an 624 equal or lower average output bit rate. 626 For the EVRC codec, the Mode Request field MUST be interpreted 627 according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec 628 specifications [1]. Values above '100' (4) are currently reserved. 629 If an unknown value above '100' (4) is received, it MUST be handled 630 as if '100' (4) were received, for interoperability with potential 631 future revisions. 633 For SMV codec, the Mode Request field MUST be interpreted according 634 to Table 2.2-2 of the SMV codec specifications [2]. Values above 635 '101' (5) are currently reserved. If an unknown value above '101' (5) 636 is received, it MUST be handled as if '101' (5) were received, also 637 for interoperability with potential future revisions. 639 11. Storage Format 641 The storage format is used for storing speech frames, e.g., as a file 642 or e-mail attachment. 644 The file begins with a magic number to identify the vocoder that is 645 used. The magic number for EVRC corresponds to the ASCII character 646 string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A". The 647 magic number for SMV corresponds to the ASCII character string 648 "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a". 650 The codec data frames are stored in consecutive order, with a single 651 TOC entry field, extended to one octet, prefixing each codec data 652 frame. The ToC field is extended to one octet by setting the four 653 most significant bits of the octet to zero. For example, a ToC value 654 of 4 (a full-rate frame) is stored as 0x04. 656 Speech frames lost in transmission and non-received frames MUST be 657 stored as erasure frames (frame type 5, see definition in Section 658 5.1) to maintain synchronization with the original media. 660 12. IANA Considerations 662 Four new MIME sub-types as described in this section are to be 663 registered. 665 The MIME-names for the EVRC and SMV codec are allocated from the IETF 666 tree since all the vocoders covered are expected to be widely used 667 for Voice-over-IP applications. 669 12.1. Registration of Media Type EVRC 671 Media Type Name: audio 673 Media Subtype Name: EVRC 675 Required Parameter: none 677 Optional parameters: 678 The following parameters apply to RTP transfer only. 680 ptime: Defined as usual for RTP audio RFC 2327. 682 maxptime: The maximum amount of media which can be encapsulated 683 in each packet, expressed as time in milliseconds. The time 684 SHALL be calculated as the sum of the time the media present 685 in the packet represents. The time SHOULD be a multiple of the 686 duration of a single codec data frame (20 msec). If not 687 signaled, the default maxptime value SHALL be 200 688 milliseconds. 690 maxinterleave: Maximum number for interleaving length (field LLL 691 in the Interleaving Octet). The interleaving lengths used in 692 the entire session MUST NOT exceed this maximum value. If not 693 signaled, the maxinterleave length SHALL be 5. 695 Encoding considerations: 696 This type is defined for transfer of EVRC-encoded data via RTP 697 using the Interleaved/Bundled packet format specified in Sections 698 4.1, 6, and 7 of RFC xxxx. It is also defined for other transfer 699 methods using the storage format specified in Section 11 of RFC 700 xxxx. 702 Security considerations: 703 See Section 14 "Security Considerations" of RFC xxxx. 705 Public specification: 706 The EVRC vocoder is specified in 3GPP2 C.S0014. 707 Transfer methods are specified in RFC xxxx. 709 Additional information: 710 The following information applies for storage format only. 712 Magic number: #!EVRC\n (see Section 11 of RFC xxxx) 713 File extensions: evc, EVC 714 Macintosh file type code: none 715 Object identifier or OID: none 717 Intended usage: 718 COMMON. It is expected that many VoIP applications (as well as 719 mobile applications) will use this type. 721 Person & email address to contact for further information: 722 Adam Li 723 adamli@icsl.ucla.edu 725 Author/Change controller: 726 Adam Li 727 adamli@icsl.ucla.edu 728 IETF Audio/Video Transport Working Group 730 12.2. Registration of Media Type EVRC0 732 Media Type Name: audio 734 Media Subtype Name: EVRC0 736 Required Parameters: none 738 Optional parameters: none 740 Encoding considerations: none 741 This type is only defined for transfer of EVRC-encoded data via 742 RTP using the Header-Free packet format specified in Section 4.2 743 of RFC xxxx. 745 Security considerations: 746 See Section 14 "Security Considerations" of RFC xxxx. 748 Public specification: 749 The EVRC vocoder is specified in 3GPP2 C.S0014. 750 Transfer methods are specified in RFC xxxx. 752 Additional information: none 754 Intended usage: 755 COMMON. It is expected that many VoIP applications (as well as 756 mobile applications) will use this type. 758 Person & email address to contact for further information: 759 Adam Li 760 adamli@icsl.ucla.edu 762 Author/Change controller: 763 Adam Li 764 adamli@icsl.ucla.edu 765 IETF Audio/Video Transport Working Group 767 12.3. Registration of Media Type SMV 769 Media Type Name: audio 771 Media Subtype Name: SMV 773 Required Parameter: none 775 Optional parameters: 776 The following parameters apply to RTP transfer only. 778 ptime: Defined as usual for RTP audio 2327. 780 maxptime: The maximum amount of media which can be encapsulated 781 in each packet, expressed as time in milliseconds. The time 782 SHALL be calculated as the sum of the time the media present 783 in the packet represents. The time SHOULD be a multiple of the 784 duration of a single codec data frame (20 msec). If not 785 signaled, the default maxptime value SHALL be 200 786 milliseconds. 788 maxinterleave: Maximum number for interleaving length (field LLL 789 in the Interleaving Octet). The interleaving lengths used in 790 the entire session MUST NOT exceed this maximum value. If not 791 signaled, the maxinterleave length SHALL be 5. 793 Encoding considerations: 794 This type is defined for transfer of SMV-encoded data via RTP 795 using the Interleaved/Bundled packet format specified in Section 796 4.1, 6, and 7 of RFC xxxx. It is also defined for other transfer 797 methods using the storage format specified in Section 11 of RFC 798 xxxx. 800 Security considerations: 801 See Section 14 "Security Considerations" of RFC xxxx. 803 Public specification: 804 The SMV vocoder is specified in 3GPP2 C.S0030-0 v2.0. 805 Transfer methods are specified in RFC xxxx. 807 Additional information: 808 The following information applies to storage format only. 810 Magic number: #!SMV\n (see Section 11 of RFC xxxx) 811 File extensions: smv, SMV 812 Macintosh file type code: none 813 Object identifier or OID: none 815 Intended usage: 816 COMMON. It is expected that many VoIP applications (as well as 817 mobile applications) will use this type. 819 Person & email address to contact for further information: 820 Adam Li 821 adamli@icsl.ucla.edu 823 Author/Change controller: 824 Adam Li 825 adamli@icsl.ucla.edu 826 IETF Audio/Video Transport Working Group 828 12.4. Registration of Media Type SMV0 830 Media Type Name: audio 832 Media Subtype Name: SMV0 834 Required Parameter: none 836 Optional parameters: none 838 Encoding considerations: none 839 This type is only defined for transfer of SMV-encoded data via 840 RTP using the Header-Free packet format specified in Section 4.2 841 of RFC xxxx. 843 Security considerations: 844 See Section 14 "Security Considerations" of RFC xxxx. 846 Public specification: 847 The SMV vocoder is specified in 3GPP2 C.S0030-0 v2.0. 848 Transfer methods are specified in RFC xxxx. 850 Additional information: none 852 Intended usage: 853 COMMON. It is expected that many VoIP applications (as well as 854 mobile applications) will use this type. 856 Person & email address to contact for further information: 857 Adam Li 858 adamli@icsl.ucla.edu 860 Author/Change controller: 861 Adam Li 862 adamli@icsl.ucla.edu 863 IETF Audio/Video Transport Working Group 865 13. Mapping to SDP Parameters 867 Please note that this section applies to the RTP transfer only. 869 The information carried in the MIME media type specification has a 870 specific mapping to fields in the Session Description Protocol (SDP) 871 [6], which is commonly used to describe RTP sessions. When SDP is 872 used to specify sessions employing the EVRC or EMV codec, the mapping 873 is as follows: 875 o The MIME type ("audio") goes in SDP "m=" as the media name. 877 o The MIME subtype (payload format name) goes in SDP "a=rtpmap" 878 as the encoding name. 880 o The parameters "ptime" and "maxptime" go in the SDP "a=ptime" 881 and "a=maxptime" attributes, respectively. 883 o The parameter �maxinterleave� goes in the SDP "a=fmtp" 884 attribute by copying it directly from the MIME media type string 885 as �maxinterleave=value�. 887 Some examples of SDP session descriptions for EVRC and SMV encodings 888 follow below. 890 Example of usage of EVRC: 892 m=audio 49120 RTP/AVP 97 893 a=rtpmap:97 EVRC 894 a=fmtp:97 maxinterleave=2 895 a=maxptime:80 897 Example of usage of SMV 899 m=audio 49122 RTP/AVP 99 900 a=rtpmap:99 SMV0 901 a=fmtp:99 903 Note that the payload format (encoding) names are commonly shown in 904 upper case. MIME subtypes are commonly shown in lower case. These 905 names are case-insensitive in both places. Similarly, parameter names 906 are case-insensitive both in MIME types and in the default mapping to 907 the SDP a=fmtp attribute. 909 14. Security Considerations 911 RTP packets using the payload format defined in this specification 912 are subject to the security considerations discussed in the RTP 913 specification [4], and any appropriate profile (for example [5]). 914 This implies that confidentiality of the media streams is achieved by 915 encryption. Because the data compression used with this payload 916 format is applied end-to-end, encryption may be performed after 917 compression so there is no conflict between the two operations. 919 A potential denial-of-service threat exists for data encoding using 920 compression techniques that have non-uniform receiver-end 921 computational load. The attacker can inject pathological datagrams 922 into the stream which are complex to decode and cause the receiver to 923 become overloaded. However, the encodings covered in this document do 924 not exhibit any significant non-uniformity. 926 As with any IP-based protocol, in some circumstances, a receiver may 927 be overloaded simply by the receipt of too many packets, either 928 desired or undesired. Network-layer authentication may be used to 929 discard packets from undesired sources, but the processing cost of 930 the authentication itself may be too high. In a multicast 931 environment, pruning of specific sources may be implemented in 932 future versions of IGMP [7] and in multicast routing protocols to 933 allow a receiver to select which sources are allowed to reach it. 935 Interleaving may affect encryption. Depending on the used encryption 936 scheme there may be restrictions on for example the time when keys 937 can be changed. Specifically, the key change may need to occur at the 938 boundary between interleave groups. 940 15. Adding Support of Other Frame-Based Vocoders 942 As described above, the RTP packet format defined in this document is 943 very flexible and designed to be usable by other frame-based 944 vocoders. 946 Additional vocoders using this format MUST have properties as 947 described in Section 3.3. 949 For an eligible vocoder to use the payload format mechanisms defined 950 in this document, a new RTP payload format document needs to be 951 published as a standards track RFC. That document can simply refer to 952 this document and then specify the following parameters: 954 o Define the unit used for RTP time stamp; 955 o Define the meaning of the Mode Request bits; 956 o Define corresponding codec data frame type values for ToC; 957 o Define the conversion procedure for vocoders output data frame; 958 o Define a magic number for storage format, and complete the 959 corresponding MIME registration. 961 16. Acknowledgements 963 The following authors have made significant contributions to this 964 document: Adam H. Li, John D. Villasenor, Dong-Seek Park, Jeong-Hoon 965 Park, Keith Miller, S. Craig Greer, David Leon, Nikolai Leung, 966 Marcello Lioy, Kyle J. McKay, Magdalena L. Espelien, Randall Gellens, 967 Tom Hiller, Peter J. McCann, Stinson S. Mathai, Michael D. Turner, 968 Ajay Rajkumar, Dan Gal, Magnus Westerlund, Lars-Erik Jonsson, Greg 969 Sherwood, and Thomas Zeng. 971 17. References 973 [1] 3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service 974 Option 3 for Wideband Spread Spectrum Digital Systems", January 975 1997. 977 [2] 3GPP2 C.S0030-0 v2.0, "Selectable Mode Vocoder, Service Option 978 for Wideband Spread Spectrum Communication Systems", May 2002. 980 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 981 Levels", BCP 14, RFC 2119, March 1997. 983 [4] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 984 "RTP: A Transport Protocol for Real-Time Applications", RFC 985 1889, January 1996. 987 [5] Schulzrinne, H., "RTP Profile for Audio and Video Conferences 988 with Minimal Control", RFC 1890, January 1996. 990 [6] M. Handley and V. Jacobson, "SDP: Session Description Protocol", 991 RFC 2327, April 1998. 993 [7] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 994 1112, August 1989. 996 18. Authors' Address 998 The editor will serve as the point of contact for technical issues. 1000 Adam H. Li 1001 Image Communication Lab 1002 Electrical Engineering Department 1003 University of California 1004 Los Angeles, CA 90095 1005 USA 1006 Phone: +1 310 825 5178 1007 Email: adamli@icsl.ucla.edu