idnits 2.17.1 draft-ietf-avt-evrc-smv-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 400 instances of too long lines in the document, the longest one being 6 characters in excess of 72. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' ** Obsolete normative reference: RFC 1889 (ref. '4') (Obsoleted by RFC 3550) ** Obsolete normative reference: RFC 1890 (ref. '5') (Obsoleted by RFC 3551) ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) Summary: 7 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft Adam H. Li 3 draft-ietf-avt-evrc-smv-01.txt UCLA 4 May 16, 2002 Editor 5 Expires: November 16, 2002 7 RTP Payload Format for EVRC and SMV Vocoders 9 STATUS OF THIS MEMO 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC 2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that other 16 groups may also distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six months 19 and may be updated, replaced, or obsoleted by other documents at any 20 time. It is inappropriate to use Internet- Drafts as reference 21 material or to cite them other than as work in progress. 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html. 29 ABSTRACT 31 This document describes the RTP payload format for Enhanced Variable 32 Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech. 33 Two sub-formats are specified for different application scenarios. A 34 bundled/interleaved format is included to reduce the effect of packet 35 loss on speech quality and amortize the overhead of the RTP header 36 over more than one speech frame. A non-bundled format is also 37 supported for conversational applications. 39 Table of Contents 41 1. Introduction ................................................... 2 42 2. Background ..................................................... 2 43 3. The Codecs Supported ........................................... 3 44 3.1. EVRC ......................................................... 3 45 3.2. SMV .......................................................... 3 46 3.3. Other Frame-Based Vocoders ................................... 4 47 4. RTP/Vocoder Packet Format ...................................... 4 48 4.1. Type 1 Interleaved/Bundled Packet Format ..................... 4 49 4.2. Type 2 Header-Free Packet Format ............................. 6 50 4.3. Determining the Format of Packets ............................ 6 51 5. Packet Table of Contents Entries and Codec Data Frame Format ... 7 52 5.1. Packet Table of Contents entries ............................. 7 53 5.2. Codec Data Frames ............................................ 8 54 6. Interleaving Codec Data Frames in Type 1 Packets ............... 9 55 6.1. Finding Interleave Group Boundaries ......................... 11 56 6.2. Additional Receiver Responsibilities ........................ 11 57 7. Bundling Codec Data Frames in Type 1 Packets .................. 11 58 8. Handling Missing Codec Data Frames ............................ 12 59 9. Implementation Issues ......................................... 12 60 9.1. Interleaving Length ......................................... 12 61 9.2. Validation of Received Packets .............................. 12 62 10. Mode Request ................................................. 13 63 11. Storage Mode ................................................. 13 64 12. IANA Considerations .......................................... 14 65 12.1. Registration of Media Type EVRC ............................ 14 66 12.2. Registration of Media Type EVRC0 ........................... 15 67 12.3. Registration of Media Type SMV ............................. 16 68 12.4. Registration of Media Type SMV0 ............................ 17 69 13. Mapping to SDP Parameters .................................... 17 70 14. Security Considerations ...................................... 18 71 15. Adding Support of Other Frame-Based Vocoders ................. 19 72 16. Acknowledgements ............................................. 19 73 17. References ................................................... 20 74 18. Authors' Address ............................................. 20 76 1. Introduction 78 This document describes how speech compressed with EVRC [1] or SMV 79 [2] may be formatted for use as an RTP payload type. The format is 80 also extensible to other codecs that generate a similar set of frame 81 types. Two methods are provided to packetize the codec data frames 82 into RTP packets: an interleaved/bundled format and a zero-header 83 format. The sender may choose the best format for each application 84 scenario, based on network conditions, bandwidth availability, delay 85 requirements, and packet-loss tolerance. 87 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 88 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 89 document are to be interpreted as described in RFC 2119 [3]. 91 2. Background 93 The 3rd Generation Partnership Project 2 (3GPP2) has published two 94 standards which define speech compression algorithms for CDMA 95 applications: EVRC [1] and SMV [2]. EVRC is currently deployed in 96 millions of first and second generation CDMA handsets. SMV is the 97 preferred speech codec standard for CDMA2000, and will be deployed in 98 third generation handsets in addition to EVRC. Improvements and new 99 codecs will keep emerging as technology improves, and future handsets 100 will likely support multiple codecs. 102 The formats of the EVRC and SMV codec frames are very similar. Many 103 other vocoders also share common characteristics, and have many 104 similar application scenarios. This parallelism enables an RTP 105 payload format to be designed for EVRC and SMV that may also support 106 other, similar vocoders with minimal additional specification work. 107 This can simplify the protocol for transporting vocoder data frames 108 through RTP and reduce the complexity of implementations. 110 3. The Codecs Supported 112 3.1. EVRC 114 The Enhanced Variable Rate Codec (EVRC) [1] compresses each 20 115 milliseconds of 8000 Hz, 16-bit sampled speech input into output 116 frames in one of the three different sizes: Rate 1 (171 bits), Rate 117 1/2 (80 bits), or Rate 1/8 (16 bits). In addition, there are two zero 118 bit codec frame types: null frames and erasure frames. Null frames 119 are produced as a result of the vocoder running at rate 0. Null 120 frames are zero bits long and are normally not transmitted. Erasure 121 frames are the frames substituted by the receiver to the codec for 122 the lost or damaged frames. Erasure frames are also zero bits long 123 and are normally not transmitted. 125 The codec chooses the output frame rate based on analysis of the 126 input speech and the current operating mode (either normal or one of 127 several reduced rate modes). For typical speech patterns, this 128 results in an average output of 4.2 kilobits/second for normal mode 129 and a lower average output for reduced rate modes. 131 3.2. SMV 133 The Selectable Mode Vocoder (SMV) [2] compresses each 20 milliseconds 134 of 8000 Hz, 16-bit sampled speech input into output frames of one of 135 the four different sizes: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 136 1/4 (40 bits), or Rate 1/8 (16 bits). In addition, there are two zero 137 bit codec frame types: null frames and erasure frames. Null frames 138 are produced as a result of the vocoder running at rate 0. Null 139 frames are zero bits long and are normally not transmitted. Erasure 140 frames are the frames substituted by the receiver to the codec for 141 the lost or damaged frames. Erasure frames are also zero bits long 142 and are normally not transmitted. 144 The SMV codec can operate in four modes. Each mode may produce frames 145 of any of the rates (full rate to 1/8 rate) for varying percentages 146 of time, based on the characteristics of the speech samples and the 147 selected mode. The SMV mode can change on a frame-by-frame basis. The 148 SMV codec does not need additional information other than the codec 149 data frames to correctly decode the data of various modes; therefore, 150 the mode of the encoder does not need to be transmitted with the 151 encoded frames. 153 The percentage of different frame rates for the four SMV modes are 154 shown in the table below. 156 Mode 0 Mode 1 Mode 2 Mode 3 157 ------------------------------------------------------------- 158 Rate 1 68.90% 38.14% 15.43% 07.49% 159 Rate 1/2 06.03% 15.82% 38.34% 46.28% 160 Rate 1/4 00.00% 17.37% 16.38% 16.38% 161 Rate 1/8 25.07% 28.67% 29.85% 29.85% 163 The SMV codec chooses the output frame rate based on an analysis of 164 the input speech and the current operating mode. For typical speech 165 patterns, this results in an average output of 4.2kilobits/second for 166 Mode 0 in two way conversation (assuming 50% active speech time and 167 50% in eighth rate while listening) and lower for other reduced rate 168 modes. 170 SMV is more bandwidth efficient than EVRC. EVRC is equivalent in 171 performance to SMV mode 1. 173 3.3. Other Frame-Based Vocoders 175 Other frame-based vocoders can be carried in the packet format 176 defined in this document, as long as they possess the following 177 properties: 179 o The codec is frame-based; 180 o blank and erasure frames are supported; 181 o the total number of rates is less than 17; 182 o the maximum full rate frame can be transported in a single RTP 183 packet using this specific format. 185 Vocoders with the characteristics listed above can be transported 186 using the packet format specified in this document with some 187 additional specification work; the pieces that must be defined are 188 listed in Section 15. 190 4. RTP/Vocoder Packet Format 192 In the packet format diagrams shown in this document, bit 0 is the 193 most significant bit. The vocoder speech data MUST be transmitted in 194 RTP packets of one of the following two types. 196 4.1. Type 1 Interleaved/Bundled Packet Format 198 This format is used to send one or more vocoder frames per packet. 199 Interleaving or bundling MAY be used. The RTP packet for this format 200 is as follows: 202 0 1 2 3 203 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 204 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 205 | RTP Header [4] | 206 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 207 |R|R| LLL | NNN | FFF | Count | TOC | ... | TOC |padding| 208 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 209 | one or more codec data frames, one per TOC entry | 210 | .... | 211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 213 The RTP header has the expected values as described in the RTP 214 specification [4]. The RTP timestamp is in 1/8000 of a second units 215 for EVRC and SMV. For any other vocoders that use this packet format, 216 the timestamp unit needs to be defined explicitly. The M bit should 217 be set as specified in the applicable RTP profile, for example, RFC 218 1890 [5]. Note that RFC 1890 [5] specifies that if the sender does 219 not suppress silence, the M bit will always be zero. When multiple 220 codec data frames are present in a single RTP packet, the timestamp 221 is that of the oldest data represented in the RTP packet. The 222 assignment of an RTP payload type for this new packet format is 223 outside the scope of this document; it is specified by the RTP 224 profile under which this payload format is used. 225 The first octet of a Type 1 Interleaved/Bundled format packet is the 226 Interleave Octet. The second octet contains the Mode Request and 227 Frame Count fields. The Table of Contents (ToC) field then follows. 228 The fields are specified as follows: 230 Reserved (RR): 2 bits 231 Reserved bits. MUST be set to zero by sender, SHOULD be ignored 232 by receiver. 234 Interleave Length (LLL): 3 bits 235 Indicates the length of interleave; a value of 0 indicates 236 bundling, a special case of interleaving. See Section 6 and 237 Section 7 for more detailed discussion. 239 Interleave Index (NNN): 3 bits 240 Indicates the index within an interleave group. MUST have a value 241 less than or equal to the value of LLL. Values of NNN greater 242 than the value of LLL are invalid. Packet with invalid NNN values 243 SHOULD be ignored by the receiver. 245 Mode Request (FFF): 3 bits 246 The Mode Request field is used to signal Mode Request 247 information. See Section 10 for details. 249 Frame Count (Count): 5 bits 250 The number of ToC fields (and vocoder frames) present in the 251 packet is the value of the frame count field plus one. A value of 252 zero indicates that the packet contains one ToC field, while a 253 value of 31 indicates that the packet contains 32 ToC fields. 255 Padding (padding): 0 or 4 bits 256 This padding ensures that codec data frames start on an octet 257 boundary. When the frame count is odd, the sender MUST add 4 bits 258 of padding following the last TOC. When the frame count is even, 259 the sender MUST NOT add padding bits. If padding is present, the 260 padding bits MUST be set to zero by sender, and SHOULD be ignored 261 by receiver. 263 The Table of Contents field (ToC) provides information on the codec 264 data frame(s) in the packet. There is one ToC entry for each codec 265 data frame. The detailed formats of the ToC field and codec data 266 frames are specified in Section 5. 268 Multiple data frames may be included within a Type 1 269 Interleaved/Bundled packet using interleaving or bundling as 270 described in Section 6 and Section 7. 272 4.2. Type 2 Header-Free Packet Format 274 The Type 2 Header-Free Packet Format is designed for maximum 275 bandwidth efficiency and low latency. Only one codec data frame can 276 be sent in each Type 2 Header-Free format packet. None of the payload 277 header fields (LLL, NNN, FFF, Count) nor ToC entries are present. The 278 codec rate for the data frame can be determined from the length of 279 the codec data frame, since there is only one codec data frame in 280 each Type 2 Header-Free packet. 282 Use of the RTP header fields for Type 2 Header-Free RTP/Vocoder 283 Packet Format is the same as described in Section 4.1 for Type 1 284 Interleaved/Bundled RTP/Vocoder Packet Format. The detailed format of 285 the codec data frame is specified in Section 5. 287 0 1 2 3 288 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 289 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 290 | RTP Header [4] | 291 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 292 | | 293 + ONLY one codec data frame +-+-+-+-+-+-+-+-+ 294 | | 295 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 297 4.3. Determining the Format of Packets 299 All receivers SHOULD be able to process both types of packets. The 300 sender MAY choose to use one or both types of packets. 302 A receiver MUST have prior knowledge of the packet type to correctly 303 decode the RTP packets. The packet types used in an RTP session MUST 304 be specified by the sender, and signaled through out-of-band means, 305 for example by SDP during the setup of a session. 307 When packets of both formats are used within the same session, 308 different RTP payload type values MUST be used for each format to 309 distinguish the packet formats. The association of payload type 310 number with the packet format is done out-of-band, for example by SDP 311 during the setup of a session. 313 5. Packet Table of Contents Entries and Codec Data Frame Format 315 5.1. Packet Table of Contents entries 317 Each codec data frame in a Type 1 Interleaved/Bundled packet has a 318 corresponding Table of Contents (ToC) entry. The ToC entry indicates 319 the rate of the codec frame. (Type 2 Header-Free packets MUST NOT 320 have a ToC field, and there is always only one codec data frame in 321 each Type 2 Header-Free packet.) 323 Each ToC entry is occupies four bits. The format of the bits is 324 indicated below: 326 0 1 2 3 327 +-+-+-+-+ 328 |fr type| 329 +-+-+-+-+ 331 Frame Type: 4 bits 332 The frame type indicates the type of the corresponding codec data 333 frame in the RTP packet. 335 For EVRC and SMV codecs, the frame type values and size of the 336 associated codec data frame are described in the table below: 338 Value Rate Total codec data frame size (in octets) 339 --------------------------------------------------------- 340 0 Blank 0 (0 bit) 341 1 1/8 2 (16 bits) 342 2 1/4 5 (40 bits; not valid for EVRC) 343 3 1/2 10 (80 bits) 344 4 1 22 (171 bits; 5 padded at end with zeros) 345 5 Erasure 0 (SHOULD NOT be transmitted by sender) 347 All values not listed in the above table MUST be considered reserved. 348 A ToC entry with a reserved Frame Type value SHOULD be considered 349 invalid. Note that the EVRC codec does not have 1/4 rate frames, thus 350 frame type value 2 MUST be considered a reserved value when the EVRC 351 codec is in use. 353 Other vocoders that use this packet format need to specify their own 354 table of frame types and corresponding codec data frames. 356 5.2. Codec Data Frames 358 The output of the vocoder MUST be converted into codec data frames 359 for inclusion in the RTP payload. The conversions for EVRC and SMV 360 codecs are specified below. (Note: Because the EVRC codec does not 361 have Rate 1/4 frames, the specifications of 1/4 frames does not apply 362 to EVRC codec data frames). Other vocoders that use this packet 363 format need to specify how to convert vocoder output data into 364 frames. 366 The codec output data bits as numbered in EVRC and SMV are packed 367 into octets. The lowest numbered bit (bit 1 for Rate 1, Rate 1/2, 368 Rate 1/4 and Rate 1/8) is placed in the most significant bit 369 (internet bit 0) of octet 1 of the codec data frame, the second 370 lowest bit is placed in the second most significant bit of the first 371 octet, the third lowest in the third most significant bit of the 372 first octet, and so on. This continues until all of the bits have 373 been placed in the codec data frame. 375 The remaining unused bits of the last octet of the codec data frame 376 MUST be set to zero. Note that in EVRC and SMV this is only 377 applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits), 378 Rate 1/4 (40 bits, SMV only) and Rate 1/8 frames (16 bits) fit 379 exactly into a whole number of octets. 381 Following is a detailed listing showing a Rate 1 EVRC/SMV codec 382 output frame converted into a codec data frame: 384 The codec data frame for a EVRC/SMV codec Rate 1 frame is 22 octets 385 long. Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are 386 placed as indicated, with bits marked with "Z" set to zero. EVRC/SMV 387 codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly, 388 but do not require zero padding because they align on octet 389 boundaries. 391 Rate 1 codec data frame 392 0 1 2 3 393 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 395 |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 396 |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3| 397 |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2| 398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 399 : : 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | | 402 |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z| 403 |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | | 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 406 6. Interleaving Codec Data Frames in Type 1 Packets 408 As indicated in Section 4.1, more than one codec data frame MAY be 409 included in a single Type 1 Interleaved/Bundled packet by a sender. 410 This is accomplished by interleaving or bundling. 412 Bundling is used to spread the transmission overhead of the RTP and 413 payload header over multiple vocoder frames. Interleaving 414 additionally reduces the listener's perception of data loss by 415 spreading such loss over non-consecutive vocoder frames. EVRC, SMV, 416 and similar vocoders are able to compensate for an occasional lost 417 frame, but speech quality degrades exponentially with consecutive 418 frame loss. 420 Bundling is signaled by setting the LLL field to zero and the Count 421 field to greater than zero. Interleaving is indicated by setting the 422 LLL field to a value greater than zero. 424 The discussions on general interleaving apply to the bundling (which 425 can be viewed as a reduced case of interleaving) with reduced 426 complexity. The bundling case is discussed in detail in Section 7. 428 Senders MAY support interleaving and/or bundling. All receivers MUST 429 support interleaving and bundling. 431 Given a time-ordered sequence of output frames from the codec 432 numbered 0..n, a bundling value B (the value in the Count field plus 433 one), and an interleave length L where n = B * (L+1) - 1, the output 434 frames are placed into RTP packets as follows (the values of the 435 fields LLL and NNN are indicated for each RTP packet): 437 First RTP Packet in Interleave group: 438 LLL=L, NNN=0 439 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of 440 B frames 442 Second RTP Packet in Interleave group: 443 LLL=L, NNN=1 444 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a 445 total of B frames 447 This continues to the last RTP packet in the interleave group: 449 L+1 RTP Packet in Interleave group: 450 LLL=L, NNN=L 451 Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a 452 total of B frames 454 Within each interleave group, the RTP packets making up the 455 interleave group MUST be transmitted in value-increasing order of the 456 NNN field. While this does not guarantee reduced end-to-end delay on 457 the receiving end, when packets are delivered in order by the 458 underlying transport, delay will be reduced to the minimum possible. 460 Receivers MAY signal the maximum number of codec data frames (i.e., 461 the maximum acceptable bundling value B) they can handle in a single 462 RTP packet using the OPTIONAL maxptime RTP mode parameter identified 463 in Section 12. 465 Receivers MAY signal the maximum interleave length (i.e., the maximum 466 acceptable LLL value in the Interleaving Octet) they will accept 467 using the OPTIONAL maxinterleave RTP mode parameter identified in 468 Section 12. 470 The parameters maxptime and maxinterleave are exchanged at the 471 initial setup of the session. In one-to-one sessions, the sender MUST 472 respect these values set be the receiver, and MUST NOT 473 interleave/bundle more packets than what the receiver signals that it 474 can handle. This ensures that the receiver can allocate a known 475 amount of buffer space that will be sufficient for all 476 interleaving/bundling used in that session. During the session, the 477 sender may decrease the bundling value or interleaving length (so 478 that less buffer space is required at the receiver), but never exceed 479 the maximum value set by the receiver. This prevents the situation 480 where a receiver needs to allocate more buffer space in the middle of 481 a session but is unable to do so. 483 Additionally, senders have the following restrictions: 485 o MUST NOT bundle more codec data frames in a single RTP packet than 486 indicated by maxptime (see Section 12) if it is signaled. 488 o SHOULD NOT bundle more codec data frames in a single RTP packet 489 than will fit in the MTU of the underlying network. 491 o Once beginning a session with a given maximum interleaving value 492 set by maxinterleave in Section 12, MUST NOT increase the 493 interleaving value (LLL) to exceed the maximum interleaving value 494 that is signaled. 496 o MAY change the interleaving value, but MUST do so only between 497 interleave groups. 499 o Silence suppression MAY only be used between interleave groups. A 500 ToC with Frame Type 0 (Blank Frame, Section 5.1) MUST be used 501 within interleaving groups if the codec outputs a blank frame. 502 The M bits in the RTP header is not set for these blank frames, 503 as the stream is continuous in time. Because there is only one 504 time stamp for each RTP packet, silence suppression used within 505 an interleave group would cause ambiguities when reconstructing 506 the speech at the receiver side, and thus is prohibited. 508 6.1. Finding Interleave Group Boundaries 510 Given an RTP packet with sequence number S, interleave length (field 511 LLL) L, interleave index value (field NNN) N, and bundling value B, 512 the interleave group consists of this RTP packet and other RTP 513 packets with sequence numbers from S-N mod 65536 to S-N+L mod 65536 514 inclusive. In other words, the interleave group always consists of 515 L+1 RTP packets with sequential sequence numbers. The bundling value 516 for all RTP packets in an interleave group MUST be the same. 518 The receiver determines the expected bundling value for all RTP 519 packets in an interleave group by the number of codec data frames 520 bundled in the first RTP packet of the interleave group received. 521 Note that this may not be the first RTP packet of the interleave 522 group if packets are delivered out of order by the underlying 523 transport. 525 6.2. Additional Receiver Responsibilities 527 Assume that the receiver has begun playing frames from an interleave 528 group. The time has come to play frame x from packet n of the 529 interleave group. Further assume that packet n of the interleave 530 group has not been received. As described in Section 8, an erasure 531 frame will be sent to the receiving vocoder. 533 Now, assume that packet n of the interleave group arrives before 534 frame x+1 of that packet is needed. Receivers SHOULD use frame x+1 of 535 the newly received packet n rather than substituting an erasure 536 frame. In other words, just because packet n was not available the 537 first time it was needed to reconstruct the interleaved speech, the 538 receiver SHOULD NOT assume it is not available when it is 539 subsequently needed for interleaved speech reconstruction. 541 7. Bundling Codec Data Frames in Type 1 Packets 543 As discussed in Section 6, the bundling of codec data frames is a 544 special reduced case of interleaving with LLL value in the Interleave 545 Octet set to 0. 547 Bundling codec data frames indicates multiple data frames are 548 included consecutively in a packet, because the interleaving length 549 (LLL) is 0. The interleaving group is thus reduced to a single RTP 550 packet, and the reconstruction of the code data frames from RTP 551 packets becomes a much simpler process. 553 Furthermore, the additional restrictions on senders are reduced to: 555 o MUST NOT bundle more codec data frames in a single RTP packet than 556 indicated by maxptime (see Section 12) if it is signaled. 558 o SHOULD NOT bundle more codec data frames in a single RTP packet 559 than will fit in the MTU of the underlying network. 561 8. Handling Missing Codec Data Frames 563 The vocoders covered by this payload format support erasure frame as 564 an indication when frames are not available. The erasure frames are 565 normally used internally by a receiver to advance the state of the 566 voice decoder by exactly one frame time for each missing frame. Using 567 the information from packet sequence number, time stamp, and the M 568 bit, the receiver can detect missing codec data frames from RTP 569 packet loss and/or silence suppression, and generate corresponding 570 erasure frames. Erasure frames MUST also be used in storage mode to 571 record missing frames. 573 9. Implementation Issues 575 9.1. Interleaving Length 577 The vocoder interpolates the missing speech content when given an 578 erasure frame. However, the best quality is perceived by the listener 579 when erasure frames are not consecutive. This makes interleaving 580 desirable as it increases speech quality when packet loss occurs. 582 On the other hand, interleaving can greatly increase the end-to-end 583 delay. Where an interactive session is desired, either Type 1 584 Interleaved/Bundled with interleaving length (field LLL) 0 or Type 2 585 Header-Free RTP payload types are RECOMMENDED. 587 When end-to-end delay is not a primary concern, an interleaving 588 length (field LLL) of 4 or 5 is RECOMMENDED as it offers a reasonable 589 compromise between robustness and latency. 591 9.2. Validation of Received Packets 593 When receiving an RTP packet, the receiver SHOULD check the validity 594 of the ToC fields and match the length of the packet with what is 595 indicated by the ToC fields. If any invalidity or mismatch is 596 detected, it is RECOMMENDED to discard the received packet to avoid 597 potential severe degradation of the speech quality. The discarded 598 packet is treated following the same procedure as a lost packet, and 599 the discarded data will be replaced with erasure frames. 601 On receipt of an RTP packet with an invalid value of the LLL or NNN 602 fields, the RTP packet SHOULD be treated as lost by the receiver for 603 the purpose of generating erasure frames as described in Section 8. 605 On receipt of an RTP packet in an interleave group with other than 606 the expected frame count value, the receiver MAY discard codec data 607 frames off the end of the RTP packet or add erasure codec data frames 608 to the end of the packet in order to manufacture a substitute packet 609 with the expected bundling value. The receiver MAY instead choose to 610 discard the whole interleave group. 612 10. Mode Request 614 The Mode Request signal requests a particular encoding mode for the 615 speech encoding in the reverse direction. All implementations are 616 RECOMMENDED to honor the Mode Request signal. The Mode Request signal 617 SHOULD only be used in one-to-one sessions. In multiparty sessions, 618 any received Mode Request signals SHOULD be ignored. 620 In addition, the Mode Request signal MAY also be sent through non-RTP 621 means, which is out of the scope of this specification. 623 The three-bit Mode Request field is used to signal the receiver to 624 set a particular encoding mode to its audio encoder. If the Mode 625 Request field is set to a non-zero value in RTP packets from node A 626 to node B, it is a request for node B to change to the requested 627 encoding mode for its audio encoder and therefore the bit rate of the 628 RTP stream from node B to node A. Once a node sets this field to a 629 non-zero value it SHOULD continue to set the field to the same value 630 in subsequent packets until the requested mode has changed. This 631 design helps to eliminate the scenario of getting the codec stuck in 632 an unintended state if one of the packets that carries the Mode 633 Request is lost. An otherwise silent node MAY send an RTP packet 634 containing a blank frame in order to send a Mode Request. 636 Each codec type using this format SHOULD define its own 637 interpretation of the Mode Request field. Codecs SHOULD follow the 638 convention that higher values of the three-bit field correspond to an 639 equal or lower average output bit rate. 641 For the EVRC codec, the Mode Request field MUST be interpreted 642 according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec 643 specifications [1]. Values above '100' (4) are currently reserved. 644 If an unknown value above '100' (4) is received, it MUST be handled 645 as if '100' (4) were received, for interoperability with potential 646 future revisions. 648 For SMV codec, the Mode Request field MUST be interpreted according 649 to Table 2.2-2 of the SMV codec specifications [2]. Values above 650 '101' (5) are currently reserved. If an unknown value above '101' (5) 651 is received, it MUST be handled as if '101' (5) were received, also 652 for interoperability with potential future revisions. 654 11. Storage Mode 656 The storage mode is used for storing speech frames, e.g., as a file 657 or e-mail attachment. 659 The file begins with a magic number to identify the vocoder that is 660 used. The magic number for EVRC corresponds to the ASCII character 661 string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A" in 662 network byte order. The magic number for SMV corresponds to the ASCII 663 character string "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a" in 664 network byte order. 666 The codec data frames are stored in consecutive order, with a single 667 TOC entry field, extended to one octet, prefixing each codec data 668 frame. The ToC field is extended to one octet by setting the four 669 most significant bits of the octet to zero. For example, a ToC value 670 of 4 (a full-rate frame) is stored as 0x04. 672 Speech frames lost in transmission and non-received frames MUST be 673 stored as erasure frames (frame type 5, see definition in Section 674 5.1) to maintain synchronization with the original media. 676 12. IANA Considerations 678 Two new MIME sub-types as described in this section are to be 679 registered. 681 The MIME-names for the EVRC and SMV codec are allocated from the IETF 682 tree since all the vocoders covered are expected to be widely used 683 for Voice-over-IP applications. 685 12.1. Registration of Media Type EVRC 687 Media Type Name: audio 689 Media Subtype Name: EVRC 690 Type 1 Interleaved/Bundled packet format for EVRC 692 Required Parameter: none 694 Optional parameters: 695 The following parameter applies to RTP mode only. 697 ptime: Defined as usual for RTP audio [6]. 699 maxptime: The maximum amount of media which can be encapsulated 700 in each packet, expressed as time in milliseconds. The time 701 SHALL be calculated as the sum of the time the media present 702 in the packet represents. The time SHOULD be a multiple of the 703 duration of a single codec data frame (20 msec). If not 704 signaled, the default maxptime value SHALL be 200 705 milliseconds. 707 maxinterleave: Maximum number for interleaving length (field LLL 708 in the Interleaving Octet). The interleaving lengths used in 709 the entire session MUST NOT exceed this maximum value. If not 710 signaled, the maxinterleave length SHALL be 5. 712 Encoding considerations: 713 For RTP mode, see Section 6 and Section 7 of RFC xxxx. 714 For storage mode, see Section 11 of RFC xxxx. 716 Security considerations: 717 See Section 14 "Security Considerations" of RFC xxxx. 719 Public specification: 720 RFC xxxx. 722 Additional information: 723 The following information applies for storage mode only. 725 Magic number: #!EVRC\n 726 File extensions: evc, EVC 727 Macintosh file type code: none 728 Object identifier or OID: none 730 Intended usage: 731 COMMON. It is expected that many VoIP applications (as well as 732 mobile applications) will use this type. 734 Person & email address to contact for further information: 735 Adam Li 736 adamli@icsl.ucla.edu 738 Author/Change controller: 739 Adam Li 740 adamli@icsl.ucla.edu 741 IETF Audio/Video Transport Working Group 743 12.2. Registration of Media Type EVRC0 745 Media Type Name: audio 747 Media Subtype Name: EVRC0 748 Type 2 Header-Free packet format for EVRC 750 Required Parameter: none 752 Optional parameters: none 754 Encoding considerations: none 756 Security considerations: 757 See Section 14 "Security Considerations" of RFC xxxx. 759 Public specification: 760 RFC xxxx. 762 Additional information: none 764 Intended usage: 765 COMMON. It is expected that many VoIP applications (as well as 766 mobile applications) will use this type. 768 Person & email address to contact for further information: 769 Adam Li 770 adamli@icsl.ucla.edu 772 Author/Change controller: 773 Adam Li 774 adamli@icsl.ucla.edu 775 IETF Audio/Video Transport Working Group 777 12.3. Registration of Media Type SMV 779 Media Type Name: audio 781 Media Subtype Name: SMV 782 Type 1 Interleaved/Bundled packet format for SMV 784 Required Parameter: none 786 Optional parameters: 787 The following parameter applies to RTP mode only. 789 ptime: Defined as usual for RTP audio [6]. 791 maxptime: The maximum amount of media which can be encapsulated 792 in each packet, expressed as time in milliseconds. The time 793 SHALL be calculated as the sum of the time the media present 794 in the packet represents. The time SHOULD be a multiple of the 795 duration of a single codec data frame (20 msec). If not 796 signaled, the default maxptime value SHALL be 200 797 milliseconds. 799 maxinterleave: Maximum number for interleaving length (field LLL 800 in the Interleaving Octet). The interleaving lengths used in 801 the entire session MUST NOT exceed this maximum value. If not 802 signaled, the maxinterleave length SHALL be 5. 804 Encoding considerations: 805 For RTP mode, see Section 6 and Section 7 of RFC xxxx. 806 For storage mode, see Section 11 of RFC xxxx. 808 Security considerations: 809 See Section 14 "Security Considerations" of RFC xxxx. 811 Public specification: 812 RFC xxxx. 814 Additional information: 815 The following information applies to storage mode only. 817 Magic number: #!SMV\n 818 File extensions: smv, SMV 819 Macintosh file type code: none 820 Object identifier or OID: none 822 Intended usage: 823 COMMON. It is expected that many VoIP applications (as well as 824 mobile applications) will use this type. 826 Person & email address to contact for further information: 827 Adam Li 828 adamli@icsl.ucla.edu 830 Author/Change controller: 831 Adam Li 832 adamli@icsl.ucla.edu 833 IETF Audio/Video Transport Working Group 835 12.4. Registration of Media Type SMV0 837 Media Type Name: audio 839 Media Subtype Name: SMV0 840 Type 2 Header-Free packet format for SMV 842 Required Parameter: none 844 Optional parameters: none 846 Encoding considerations: none 848 Security considerations: 849 See Section 14 "Security Considerations" of RFC xxxx. 851 Public specification: 852 RFC xxxx. 854 Additional information: none 856 Intended usage: 857 COMMON. It is expected that many VoIP applications (as well as 858 mobile applications) will use this type. 860 Person & email address to contact for further information: 861 Adam Li 862 adamli@icsl.ucla.edu 864 Author/Change controller: 865 Adam Li 866 adamli@icsl.ucla.edu 867 IETF Audio/Video Transport Working Group 869 13. Mapping to SDP Parameters 871 Please note that this section applies to the RTP mode only. 873 The information carried in the MIME media type specification has a 874 specific mapping to fields in the Session Description Protocol (SDP) 875 [6], which is commonly used to describe RTP sessions. When SDP is 876 used to specify sessions employing the EVRC or EMV codec, the mapping 877 is as follows: 879 o The MIME type ("audio") goes in SDP "m=" as the media name. 881 o The MIME subtype (payload format name) goes in SDP "a=rtpmap" 882 as the encoding name. 884 o The parameters "ptime" and "maxptime" go in the SDP "a=ptime" 885 and "a=maxptime" attributes, respectively. 887 o Any remaining parameters go in the SDP "a=fmtp" attribute by 888 copying them directly from the MIME media type string as a 889 semicolon separated list of parameter=value pairs. 891 Some examples of SDP session descriptions for EVRC and SMV encodings 892 follow below. 894 Example of usage of EVRC: 896 m = audio 49120 RTP/AVP 97 897 a = rtpmap:97 EVRC 898 a = fmtp:97 maxinterleave=2 899 a = maxptime:80 901 Example of usage of SMV 903 m = audio 49122 RTP/AVP 99 904 a = rtpmap:99 SMV0 905 a = fmtp:99 907 Note that the payload format (encoding) names are commonly shown in 908 upper case. MIME subtypes are commonly shown in lower case. These 909 names are case-insensitive in both places. Similarly, parameter names 910 are case-insensitive both in MIME types and in the default mapping to 911 the SDP a=fmtp attribute. 913 14. Security Considerations 915 RTP packets using the payload format defined in this specification 916 are subject to the security considerations discussed in the RTP 917 specification [4], and any appropriate profile (for example [5]). 918 This implies that confidentiality of the media streams is achieved by 919 encryption. Because the data compression used with this payload 920 format is applied end-to-end, encryption may be performed after 921 compression so there is no conflict between the two operations. 923 A potential denial-of-service threat exists for data encoding using 924 compression techniques that have non-uniform receiver-end 925 computational load. The attacker can inject pathological datagrams 926 into the stream which are complex to decode and cause the receiver to 927 become overloaded. However, the encodings covered in this document do 928 not exhibit any significant non-uniformity. 930 As with any IP-based protocol, in some circumstances, a receiver may 931 be overloaded simply by the receipt of too many packets, either 932 desired or undesired. Network-layer authentication may be used to 933 discard packets from undesired sources, but the processing cost of 934 the authentication itself may be too high. In a multicast 935 environment, pruning of specific sources may be implemented in 936 future versions of IGMP [7] and in multicast routing protocols to 937 allow a receiver to select which sources are allowed to reach it. 939 Interleaving MAY affect encryption. Depending on the used encryption 940 scheme there MAY be restrictions on for example the time when keys 941 can be changed. Specifically, the key change may need to occur at the 942 boundary between interleave groups. 944 15. Adding Support of Other Frame-Based Vocoders 946 As described above, the RTP packet format defined in this document is 947 very flexible and designed to be usable by other frame-based 948 vocoders. 950 Additional vocoders using this format MUST have properties as 951 described in Section 3.3. 953 For an eligible vocoder to use the payload format mechanisms defined 954 in this document, a new RTP payload format document needs to be 955 published as an RFC. That document can simply refer to this document 956 and then specify the following parameters: 958 o Define the unit used for RTP time stamp; 959 o Define the meaning of the Mode Request bits; 960 o Define corresponding codec data frame type values for ToC; 961 o Define the conversion procedure for vocoders output data frame; 962 o Define a magic number for storage mode, and complete the 963 corresponding MIME registration. 965 16. Acknowledgements 967 The following authors have made significant contributions to this 968 document: Adam H. Li, John D. Villasenor, Dong-Seek Park, Jeong-Hoon 969 Park, Keith Miller, S. Craig Greer, David Leon, Nikolai Leung, 970 Marcello Lioy, Kyle J. McKay, Magdalena L. Espelien, Randall Gellens, 971 Tom Hiller, Peter J. McCann, Stinson S. Mathai, Michael D. Turner, 972 Ajay Rajkumar, Dan Gal, Magnus Westerlund, Lars-Erik Jonsson, Greg 973 Sherwood, and Thomas Zeng. 975 17. References 977 [1] 3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service 978 Option 3 for Wideband Spread Spectrum Digital Systems", January 979 1997. 981 [2] C.S0030-0 v2.0, "Selectable Mode Vocoder, Service Option for 982 Wideband Spread Spectrum Communication Systems", May 2002. 984 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 985 Levels", BCP 14, RFC 2119, March 1997. 987 [4] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 988 "RTP: A Transport Protocol for Real-Time Applications", RFC 989 1889, January 1996. 991 [5] Schulzrinne, H., "RTP Profile for Audio and Video Conferences 992 with Minimal Control", RFC 1890, January 1996. 994 [6] M. Handley and V. Jacobson, "SDP: Session Description Protocol", 995 RFC 2327, April 1998. 997 [7] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 998 1112, August 1989. 1000 18. Authors' Address 1002 The editor will serve as the point of contact for technical issues. 1004 Adam H. Li 1005 Image Communication Lab 1006 Electrical Engineering Department 1007 University of California 1008 Los Angeles, CA 90095 1009 USA 1010 Phone: +1 310 825 5178 1011 Email: adamli@icsl.ucla.edu