idnits 2.17.1 draft-ietf-avt-evrc-smv-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' ** Obsolete normative reference: RFC 1889 (ref. '4') (Obsoleted by RFC 3550) ** Obsolete normative reference: RFC 1890 (ref. '5') (Obsoleted by RFC 3551) ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) Summary: 6 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft Adam H. Li 3 draft-ietf-avt-evrc-smv-00.txt UCLA 4 February 4, 2002 Editor 5 Expires: August 4, 2002 7 An RTP Payload Format for EVRC and SMV Vocoders 9 STATUS OF THIS MEMO 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC 2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that other 16 groups may also distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six months 19 and may be updated, replaced, or obsoleted by other documents at any 20 time. It is inappropriate to use Internet- Drafts as reference 21 material or to cite them other than as work in progress. 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html. 29 ABSTRACT 31 This document describes the RTP payload format for Enhanced Variable 32 Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech. 33 Two sub-formats are specified for different application scenarios. A 34 bundled/interleaved format is included to reduce the effect of packet 35 loss on speech quality and amortize the overhead of the RTP header 36 over more than one speech frame. A non-bundled format is also 37 supported for conversational applications. 39 Table of Contents 41 1. Introduction ................................................... 2 42 2. Background ..................................................... 2 43 3. The Codecs Supported ........................................... 3 44 3.1. EVRC ......................................................... 3 45 3.2. SMV .......................................................... 3 46 3.3. Other Frame-Based Vocoders ................................... 4 47 4. RTP/Vocoder Packet Format ...................................... 4 48 4.1. Type 1 Interleaved/Bundled Packet Format ..................... 4 49 4.2. Type 2 Header-Free Packet Format ............................. 6 50 4.3. Detecting the Format of Packets .............................. 6 51 5. Packet Table of Contents Entries and Codec Data Frame Format ... 7 52 5.1. Packet Table of Contents entries ............................. 7 53 5.2. Codec Data Frames ............................................ 8 54 6. Interleaving Codec Data Frames in Type 1 Packets ............... 9 55 6.1. Finding Interleave Group Boundaries ......................... 10 56 6.2. Reconstructing Interleaved Speech ........................... 11 57 6.3. Receiving Invalid Interleaving Values ....................... 12 58 6.4. Additional Receiver Responsibilities ........................ 12 59 7. Bundling Codec Data Frames in Type 1 Packets .................. 12 60 8. Handling Missing Codec Data Frames ............................ 12 61 9. Implementation Issues ......................................... 13 62 9.1. Interleaving Length ......................................... 13 63 9.2. Mode Request ................................................ 13 64 10. IANA Considerations .......................................... 14 65 10.1 Storage Mode ................................................ 14 66 10.2 EVRC MIME Registration ...................................... 15 67 10.3 SMV MIME Registration ....................................... 16 68 11. Mapping to SDP Parameters .................................... 17 69 12. Security Considerations ...................................... 17 70 13. Adding Support of Other Frame-Based Vocoders ................. 18 71 14. Acknowledgements ............................................. 18 72 15. References ................................................... 18 73 16. Authors' Address ............................................. 19 75 1. Introduction 77 This document describes how speech compressed with EVRC [1] or SMV 78 [2] may be formatted for use as an RTP payload type. The format is 79 also extensible to other codecs that generate a similar set of frame 80 types. Two methods are provided to packetize the codec data frames 81 into RTP packets: an interleaved/bundled format and a zero-header 82 format. The sender may choose the best format for each application 83 scenario, based on network conditions, bandwidth availability, delay 84 requirements, and packet-loss tolerance. 86 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 87 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 88 document are to be interpreted as described in RFC 2119 [3]. 90 2. Background 92 The 3rd Generation Partnership Project 2 (3GPP2) has published two 93 standards which define speech compression algorithms for CDMA 94 applications: EVRC [1] and SMV [2]. EVRC is currently deployed in 95 millions of first and second generation CDMA handsets. SMV is the 96 preferred speech codec standard for CDMA2000, and will be deployed in 97 third generation handsets in addition to EVRC. Improvements and new 98 codecs will keep emerging as technology improves, and future handsets 99 will likely support multiple codecs. 101 The formats of the EVRC and SMV codec frames are very similar. Many 102 other vocoders also share common characteristics, and have many 103 similar application scenarios. This parallelism enables an RTP 104 payload format to be designed for EVRC and SMV that may also support 105 other, similar vocoders with minimal additional specification work. 106 This can simplify the protocol for transporting vocoder data frames 107 through RTP and reduce the complexity of implementations. 109 3. The Codecs Supported 111 3.1. EVRC 113 The Enhanced Variable Rate Codec (EVRC) [1] compresses each 20 114 milliseconds of 8000 Hz, 16-bit sampled speech input into output 115 frames in one of the three different sizes: Rate 1 (171 bits), Rate 116 1/2 (80 bits), or Rate 1/8 (16 bits). In addition, there are two zero 117 bit codec frame types: null frames and erasure frames. Null frames 118 are produced as a result of the vocoder running at rate 0. Null 119 frames are zero bits long and are normally not transmitted. Erasure 120 frames are the frames substituted by the receiver to the codec for 121 the lost or damaged frames. Erasure frames are also zero bits long 122 and are normally not transmitted. 124 The codec chooses the output frame rate based on analysis of the 125 input speech and the current operating mode (either normal or one of 126 several reduced rate modes). For typical speech patterns, this 127 results in an average output of 4.2 kilobits/second for normal mode 128 and a lower average output for reduced rate modes. 130 3.2. SMV 132 The Selectable Mode Vocoder (SMV) [2] compresses each 20 milliseconds 133 of 8000 Hz, 16-bit sampled speech input into output frames of one of 134 the four different sizes: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 135 1/4 (40 bits), or Rate 1/8 (16 bits). In addition, there are two zero 136 bit codec frame types: null frames and erasure frames. Null frames 137 are produced as a result of the vocoder running at rate 0. Null 138 frames are zero bits long and are normally not transmitted. Erasure 139 frames are the frames substituted by the receiver to the codec for 140 the lost or damaged frames. Erasure frames are also zero bits long 141 and are normally not transmitted. 143 The SMV codec can operate in four modes. Each mode may produce frames 144 of any of the rates (full rate to 1/8 rate) for varying percentages 145 of time, based on the characteristics of the speech samples and the 146 selected mode. The SMV mode can change on a frame-by-frame basis. The 147 SMV codec does not need additional information other than the codec 148 data frames to correctly decode the data of various modes; therefore, 149 the mode of the encoder does not need to be transmitted with the 150 encoded frames. 152 The percentage of different frame rates and the average data rate 153 (ADR) for the four SMV modes are shown in the table below. 155 Mode 0 Mode 1 Mode 2 Mode 3 156 ------------------------------------------------------------- 157 Rate 1 68.90% 38.14% 15.43% 07.49% 158 Rate 1/2 06.03% 15.82% 38.34% 46.28% 159 Rate 1/4 00.00% 17.37% 16.38% 16.38% 160 Rate 1/8 25.07% 28.67% 29.85% 29.85% 161 ------------------------------------------------------------- 162 ADR 7205 bps 5182 bps 4073 bps 3692 bps 164 The SMV codec chooses the output frame rate based on an analysis of 165 the input speech and the current operating mode. For typical speech 166 patterns, this results in an average output of 4.2k bits/second for 167 Mode 0 and lower for other reduced rate modes. 169 SMV is more bandwidth efficient than EVRC. EVRC is equivalent in 170 performance to SMV mode 1. 172 3.3. Other Frame-Based Vocoders 174 Other frame-based vocoders can be carried in the packet format 175 defined in this document, as long as they possess the following 176 properties: 178 o The codec is frame-based; 179 o blank and erasure frames are supported; 180 o the total number of rates is less than 17; 181 o the maximum full rate frame can be transported in a single RTP 182 packet using this specific format. 184 Vocoders with the characteristics listed above can be transported 185 using the packet format specified in this document with some 186 additional specification work; the pieces that must be defined are 187 listed in Section 13. 189 4. RTP/Vocoder Packet Format 191 The RTP payload data MUST be transmitted in packets of one of the 192 following two types. 194 4.1. Type 1 Interleaved/Bundled Packet Format 196 This format is used to send one or more vocoder frames per packet. 197 Interleaving or bundling MAY be used. The RTP packet for this format 198 is as follows: 200 0 1 2 3 201 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 203 | RTP Header [4] | 204 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 205 |R|R| LLL | NNN | FFF | Count | TOC | ... | TOC |padding| 206 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 207 | one or more codec data frames, one per TOC entry | 208 | .... | 209 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 211 The RTP header has the expected values as described in the RTP 212 specification [4]. The RTP timestamp is in 1/8000 of a second units 213 for EVRC and SMV. For any other vocoders that use this packet format, 214 the timestamp unit needs to be defined explicitly. The M bit should 215 be set as specified in the applicable RTP profile, for example, RFC 216 1890 [5]. Note that RFC 1890 [5] specifies that if the sender does 217 not suppress silence, the M bit will always be zero. When multiple 218 codec data frames are present in a single RTP packet, the timestamp 219 is, as always, that of the oldest data represented in the RTP packet. 220 The assignment of an RTP payload type for this new packet format is 221 outside the scope of this document, and will not be specified here. 222 It is expected that the RTP profile for a particular class of 223 applications will assign a payload type for this encoding, or if that 224 is not done, then a payload type in the dynamic range shall be chosen 225 by the sender. 227 The first octet of a Type 1 Interleaved/Bundled format packet is the 228 Interleave Octet. The second octet contains the Mode Request and 229 Frame Count fields. The Table of Contents (ToC) field then follows. 230 The fields are specified as follows: 232 Reserved (RR): 2 bits 233 Reserved bits. MUST be set to zero by sender, SHOULD be ignored 234 by receiver. 236 Interleave Length (LLL): 3 bits 237 Indicates the length of interleave; a value of 0 indicates 238 bundling, a special case of interleaving. See Section 6 and 239 Section 7 for more detailed discussion. 241 Interleave Index (NNN): 3 bits 242 Indicates the index within an interleave group. MUST have a value 243 less than or equal to the value of LLL. Values of NNN greater 244 than the value of LLL are invalid. Packet with invalid NNN values 245 SHOULD be ignored by the receiver. 247 Mode Request (FFF): 3 bits 248 The Mode Request field is used to signal Mode Request 249 information. See Section 9.2 for details. 251 Frame Count (Count): 5 bits 252 Indicates the number of ToC fields (and therefore vocoder frames) 253 present. A value of zero indicates that the packet contains one 254 ToC field (and vocoder frame). A value of 31 indicates 32 ToC 255 fields (and vocoder frames) are in the packet. The number of ToC 256 fields (and vocoder frames) present is the value of the frame 257 count field plus one. 259 Padding (padding): 0 or 4 bits 260 This padding ensures that codec data frames start on an octet 261 boundary. When the frame count is odd, the sender MUST add 4 bits 262 of padding following the last TOC. When the frame count is even, 263 the sender MUST NOT add padding bits. If padding is present, the 264 padding bits MUST be set to zero by sender, and SHOULD be ignored 265 by receiver. 267 The Table of Contents field (ToC) provides information on the codec 268 data frame(s) in the packet. There is one ToC entry for each codec 269 data frame. The detailed formats of the ToC field and codec data 270 frames are specified in Section 5. 272 Multiple data frames may be included within a Type 1 273 Interleaved/Bundled packet using interleaving or bundling as 274 described in Section 6 and Section 7. 276 4.2. Type 2 Header-Free Packet Format 278 The Type 2 Header-Free Packet Format is designed for maximum 279 bandwidth efficiency and low latency. Only one codec data frame can 280 be sent in each Type 2 Header-Free format packet. None of the payload 281 header fields (LLL, NNN, FFF, Count) nor ToC entries are present. The 282 codec rate for the data frame can be determined from the length of 283 the codec data frame, since there is only one codec data frame in 284 each Type 2 Header-Free packet. 286 Use of the RTP header fields for Type 2 Header-Free RTP/Vocoder 287 Packet Format is the same as described in Section 4.1 for Type 1 288 Interleaved/Bundled RTP/Vocoder Packet Format. The detailed format of 289 the codec data frame is specified in Section 5. 291 0 1 2 3 292 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 293 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 294 | RTP Header [4] | 295 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 296 | | 297 + ONLY one codec data frame +-+-+-+-+-+-+-+-+ 298 | | 299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 301 4.3. Detecting the Format of Packets 303 All receivers MUST be able to process both types of packets. The 304 sender MAY choose to use one or both types of packets. 306 A receiver MUST have prior knowledge of the packet type to correctly 307 decode the RTP packets. The packet types used in an RTP session MUST 308 be specified by the sender, and signaled through out-of-band means, 309 for example by SDP during the setup of a session. 311 When packets of both formats are used within the same session, 312 different RTP payload type values MUST be used for each format to 313 distinguish the packet formats. The association of payload type 314 number with the packet format is done out-of-band, for example by SDP 315 during the setup of a session. 317 5. Packet Table of Contents Entries and Codec Data Frame Format 319 5.1. Packet Table of Contents entries 321 Each codec data frame in a Type 1 Interleaved/Bundled packet has a 322 corresponding Table of Contents (ToC) entry. The ToC entry indicates 323 the rate of the codec frame. (Type 2 Header-Free packets MUST NOT 324 have a ToC field, and there is always only one codec data frame in 325 each Type 2 Header-Free packet.) 327 Each ToC entry is occupies four bits. The format of the bits is 328 indicated below: 330 0 1 2 3 331 +-+-+-+-+ 332 |fr type| 333 +-+-+-+-+ 335 Frame Type: 4 bits 336 The frame type indicates the type of the corresponding codec data 337 frame in the RTP packet. 339 For EVRC and SMV codecs, the frame type values and size of the 340 associated codec data frame are described in the table below: 342 Value Rate Total codec data frame size (in octets) 343 --------------------------------------------------------- 344 0 Blank 0 (0 bit) 345 1 1/8 2 (16 bits) 346 2 1/4 5 (40 bits; not valid for EVRC) 347 3 1/2 10 (80 bits) 348 4 1 22 (171 bits; 5 padded at end with zeros) 349 5 Erasure 0 (SHOULD NOT be transmitted by sender) 351 All values not listed in the above table MUST be considered 352 reserved. A ToC entry with a reserved Frame Type value SHOULD be 353 considered invalid and substituted with an erasure frame. Note 354 that the EVRC codec does not have 1/4 rate frames, thus frame 355 type value 2 MUST be considered a reserved value when the EVRC 356 codec is in use. 358 Other vocoders that use this packet format need to specify their 359 own table of frame types and corresponding codec data frames. 361 5.2. Codec Data Frames 363 The output of the vocoder MUST be converted into codec data frames 364 for inclusion in the RTP payload. The conversions for EVRC and SMV 365 codecs are specified below. (Note: Because the EVRC codec does not 366 have Rate 1/4 frames, the specifications of 1/4 frames does not apply 367 to EVRC codec data frames). Other vocoders that use this packet 368 format need to specify how to convert vocoder output data into 369 frames. 371 The codec output data bits as numbered in EVRC and SMV are packed 372 into octets. The lowest numbered bit (bit 1 for Rate 1, Rate 1/2, 373 Rate 1/4 and Rate 1/8) is placed in the most significant bit 374 (internet bit 0) of octet 1 of the codec data frame, the second 375 lowest bit is placed in the second most significant bit of the first 376 octet, the third lowest in the third most significant bit of the 377 first octet, and so on. This continues until all of the bits have 378 been placed in the codec data frame. 380 The remaining unused bits of the last octet of the codec data frame 381 MUST be set to zero. Note that in EVRC and SMV this is only 382 applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits), 383 Rate 1/4 (40 bits, SMV only) and Rate 1/8 frames (16 bits) fit 384 exactly into a whole number of octets. 386 Following is a detailed listing showing a Rate 1 EVRC/SMV codec 387 output frame converted into a codec data frame: 389 The codec data frame for a EVRC/SMV codec Rate 1 frame is 22 octets 390 long. Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are 391 placed as indicated, with bits marked with "Z" set to zero. EVRC/SMV 392 codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly, 393 but do not require zero padding because they align on octet 394 boundaries. 396 Rate 1 codec data frame (octets 0 - 3) 398 0 1 2 3 399 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| 402 |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3| 403 |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2| 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 405 Rate 1 codec data frame (octets 19 - 21) 407 1 1 1 1 408 4 5 6 7 409 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 410 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 411 |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | | 412 |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z| 413 |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | | 414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 416 6. Interleaving Codec Data Frames in Type 1 Packets 418 As indicated in Section 4.1, more than one codec data frame MAY be 419 included in a single Type 1 Interleaved/Bundled packet by a sender. 420 This is accomplished by interleaving or bundling. 422 Bundling is used to spread the transmission overhead of the RTP and 423 payload header over multiple vocoder frames. Interleaving 424 additionally reduces the listener's perception of data loss by 425 spreading such loss over non-consecutive vocoder frames. EVRC, SMV, 426 and similar vocoders are able to compensate for an occasional lost 427 frame, but speech quality degrades exponentially with consecutive 428 frame loss. 430 Bundling is signaled by setting the LLL field to zero and the Count 431 field to greater than zero. Interleaving is indicated by setting the 432 LLL field to a value greater than zero. 434 The discussions on general interleaving apply to the bundling (which 435 can be viewed as a reduced case of interleaving) with reduced 436 complexity. The bundling case is discussed in detail in Section 7. 438 Senders MAY support interleaving and/or bundling. All receivers MUST 439 support interleaving and bundling. 441 Given a time-ordered sequence of output frames from the EVRC codec 442 numbered 0..n, a bundling value B (in the Count field), and an 443 interleave length L where n = B * (L+1) - 1, the output frames are 444 placed into RTP packets as follows (the values of the fields LLL and 445 NNN are indicated for each RTP packet): 447 First RTP Packet in Interleave group: 448 LLL=L, NNN=0 449 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of 450 B frames 452 Second RTP Packet in Interleave group: 453 LLL=L, NNN=1 454 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a 455 total of B frames 457 This continues to the last RTP packet in the interleave group: 459 L+1 RTP Packet in Interleave group: 460 LLL=L, NNN=L 461 Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a 462 total of B frames 464 Within each interleave group, the RTP packets making up the 465 interleave group MUST be transmitted in value-increasing order of the 466 NNN field. While this does not guarantee reduced end-to-end delay on 467 the receiving end, when packets are delivered in order by the 468 underlying transport, delay will be reduced to the minimum possible. 470 Receivers MAY signal the maximum number of codec data frames (i.e., 471 the maximum acceptable bundling value B) they can handle in a single 472 RTP packet using the OPTIONAL maxptime RTP mode parameter identified 473 in Section 10. 475 Receivers MAY signal the maximum interleave length (i.e., the maximum 476 acceptable LLL value in the Interleaving Octet) they will accept 477 using the OPTIONAL maxinterleave RTP mode parameter identified in 478 Section 10. 480 Additionally, senders have the following restrictions: 482 o MUST NOT bundle more codec data frames in a single RTP packet than 483 indicated by maxptime (see Section 10) if it is signaled. 485 o SHOULD NOT bundle more codec data frames in a single RTP packet 486 than will fit in the MTU of the underlying network. 488 o Once beginning a session with a given maximum interleaving value 489 set by maxinterleave in Section 10, MUST NOT increase the 490 interleaving value (LLL) to exceed the maximum interleaving value 491 that is signaled. 493 o MAY change the interleaving value only between interleave groups. 495 o Silence suppression MAY only be used between interleave groups. A 496 ToC with Frame Type 0 (Blank Frame, Section 5.1) MUST be used 497 within interleaving groups if the codec outputs a blank frame. 498 The M bits in the RTP header MUST NOT be set, as the stream is 499 continuous in time. Because there is only one time stamp for each 500 RTP packet, silence suppression used within an interleave group 501 will cause ambiguities when reconstructing the speech at the 502 receiver side, and thus is prohibited. 504 6.1. Finding Interleave Group Boundaries 506 Given an RTP packet with sequence number S, interleave length (field 507 LLL) L, interleave index value (field NNN) N, and bundling value B, 508 the interleave group consists of this RTP packet and other RTP 509 packets with sequence numbers from S-N to S-N+L inclusive. (The 510 sequence numbers used here are for illustrative purposes. When 511 wrapping around happens, the sequence numbers need to be adjusted 512 accordingly). In other words, the interleave group always consists of 513 L+1 RTP packets with sequential sequence numbers. The bundling value 514 for all RTP packets in an interleave group MUST be the same. 516 The receiver determines the expected bundling value for all RTP 517 packets in an interleave group by the number of codec data frames 518 bundled in the first RTP packet of the interleave group received. 519 Note that this may not be the first RTP packet of the interleave 520 group if packets are delivered out of order by the underlying 521 transport. 523 On receipt of an RTP packet in an interleave group with other than 524 the expected bundling value, the receiver MAY discard codec data 525 frames off the end of the RTP packet or add erasure codec data frames 526 to the end of the packet in order to manufacture a substitute packet 527 with the expected bundling value. The receiver MAY instead choose to 528 discard the whole interleave group. 530 6.2. Reconstructing Interleaved Speech 532 Given an RTP sequence number ordered set of RTP packets in an 533 interleave group numbered 0..L, where L is the interleave length and 534 B is the bundling value, and codec data frames within each RTP packet 535 that are numbered in order from first to last with the numbers 1..B, 536 the original, time-ordered sequence of output frames from the EVRC 537 codec may be reconstructed as follows: 539 First L+1 frames: 540 Frame 0 from packet 0 of interleave group 541 Frame 0 from packet 1 of interleave group 542 And so on up to... 543 Frame 0 from packet L of interleave group 545 Second L+1 frames: 546 Frame 1 from packet 0 of interleave group 547 Frame 1 from packet 1 of interleave group 548 And so on up to... 549 Frame 1 from packet L of interleave group 551 And so on up to... 553 Bth L+1 frames: 554 Frame B from packet 0 of interleave group 555 Frame B from packet 1 of interleave group 556 And so on up to... 557 Frame B from packet L of interleave group 559 6.3. Receiving Invalid Interleaving Values 561 On receipt of an RTP packet with an invalid value of the LLL or NNN 562 fields, the RTP packet SHOULD be treated as lost by the receiver for 563 the purpose of generating erasure frames as described in Section 8. 565 6.4. Additional Receiver Responsibilities 567 Assume that the receiver has begun playing frames from an interleave 568 group. The time has come to play frame x from packet n of the 569 interleave group. Further assume that packet n of the interleave 570 group has not been received. As described in section 8, an erasure 571 frame will be sent to the receiving vocoder. 573 Now, assume that packet n of the interleave group arrives before 574 frame x+1 of that packet is needed. Receivers SHOULD use frame x+1 of 575 the newly received packet n rather than substituting an erasure 576 frame. In other words, just because packet n was not available the 577 first time it was needed to reconstruct the interleaved speech, the 578 receiver SHOULD NOT assume it is not available when it is 579 subsequently needed for interleaved speech reconstruction. 581 7. Bundling Codec Data Frames in Type 1 Packets 583 As discussed in Section 6, the bundling of codec data frames is a 584 special reduced case of interleaving with LLL value in the Interleave 585 Octet set to 0. 587 Bundling codec data frames indicates multiple data frames are 588 included consecutively in a packet, because the interleaving length 589 (LLL) is 0. The interleaving group is thus reduced to a single RTP 590 packet, and the reconstruction of the code data frames from RTP 591 packets becomes a much simpler process. 593 Furthermore, the additional restrictions on senders are reduced to: 595 o MUST NOT bundle more codec data frames in a single RTP packet than 596 indicated by maxptime (see Section 10) if it is signaled. 598 o SHOULD NOT bundle more codec data frames in a single RTP packet 599 than will fit in the MTU of the underlying network. 601 8. Handling Missing Codec Data Frames 603 The vocoders covered by this payload format support erasure frame as 604 an indication when frames are not available. While an erasure frame 605 MUST NOT be transmitted by an RTP sender, it MAY be used internally 606 by a receiver to advance the state of the voice decoder by exactly 607 one frame time for each missing frame. Using the information from 608 packet sequence number, time stamp, and the M bit, the receiver can 609 detect missing codec data frames from RTP packet loss and/or silence 610 suppression, and generate corresponding erasure frames. Erasure 611 frames SHOULD also be used in storage mode to record missing frames. 613 9. Implementation Issues 615 9.1. Interleaving Length 617 The vocoder interpolates the missing speech content when given an 618 erasure frame. However, the best quality is perceived by the listener 619 when erasure frames are not consecutive. This makes interleaving 620 desirable as it increases speech quality when packet loss occurs. 622 On the other hand, interleaving can greatly increase the end-to-end 623 delay. Where an interactive session is desired, either Type 1 624 Interleaved/Bundled with interleaving length (field LLL) 0 or Type 2 625 Header-Free RTP payload types are RECOMMENDED. 627 When end-to-end delay is not a concern, an interleaving length (field 628 LLL) of 4 or 5 is RECOMMENDED. 630 The parameters maxptime and maxinterleave are exchanged at the 631 initial setup of the session so that the receiver can allocate a 632 known amount of buffer space that will be sufficient for all future 633 reception in that session. During the session, the sender may 634 decrease the bundling value or interleaving length (so that less 635 buffer space is required at the receiver), but never require more 636 buffer space. This prevents the situation where a receiver needs to 637 allocate more buffer space in the middle of a session but is unable 638 to do so. 640 9.2. Mode Request 642 The Mode Request signal requests a particular encoding mode for the 643 speech encoding in the reverse direction. All implementations are 644 RECOMMENDED to honor the Mode Request signal. The Mode Request signal 645 SHOULD only be used in one-to-one sessions. In multiparty sessions, 646 any received Mode Request signals SHOULD be ignored. 648 In addition, the Mode Request signal MAY also be sent through non-RTP 649 means, which is out of the scope of this specification. 651 The three-bit Mode Request field is used to signal the receiver to 652 set a particular encoding mode to its audio encoder. If the Mode 653 Request field is set to a non-zero value in RTP packets from node A 654 to node B, it is a request for node B to change to the requested 655 encoding mode for its audio encoder and therefore the bit rate of the 656 RTP stream from node B to node A. Once a node sets this field to a 657 non-zero value it SHOULD continue to set the field to the same value 658 in subsequent packets until the requested mode has changed. This 659 design helps to eliminate the scenario of getting the codec stuck in 660 an unintended state if one of the packets that carries the Mode 661 Request is lost. An otherwise silent node MAY send an RTP packet 662 containing a blank frame in order to send a Mode Request. 664 Each codec type using this format SHOULD define its own 665 interpretation of the Mode Request field. Codecs SHOULD follow the 666 convention that higher values of the three-bit field correspond to an 667 equal or lower average output bit rate. 669 For the EVRC codec, the Mode Request field MUST be interpreted 670 according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec 671 specifications [1]. Values above '100' (4) are currently reserved. 672 If an unknown value above '100' (4) is received, it MUST be handled 673 as if '100' (4) were received. 675 For SMV codec, the Mode Request field MUST be interpreted according 676 to Table 2.2-2 of the SMV codec specifications [2]. Values above 677 '101' (5) are currently reserved. If an unknown value above '101' (5) 678 is received, it MUST be handled as if '101' (5) were received. 680 10. IANA Considerations 682 Two new MIME sub-types as described in this section are to be 683 registered. 685 The MIME-names for the EVRC and SMV codec are allocated from the IETF 686 tree since all the vocoders covered are expected to be widely used 687 for Voice-over-IP applications. 689 The RTP mode has been described in the previous sections. 691 10.1. Storage Mode 693 The storage mode is used for storing speech frames, e.g., as a file 694 or e-mail attachment. 696 The file begins with a magic number to identify the vocoder that is 697 used. The magic number for EVRC corresponds to the ASCII character 698 string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A" in 699 network byte order. The magic number for SMV corresponds to the ASCII 700 character string "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a" in 701 network byte order. 703 The codec data frames are stored in consecutive order, with a single 704 TOC entry field, expanded to one octet, prefixing each codec data 705 frame. The ToC field is expanded to one octet by setting the left- 706 most four bits of the octet to zero. For example, a ToC value of 4 (a 707 full-rate frame) is stored as 0x04. 709 Speech frames lost in transmission and non-received frames MUST be 710 stored as erasure frames (frame type 5, see definition in Section 711 5.1) to maintain synchronization with the original media. 713 10.2. EVRC MIME Registration 715 Media Type Name: audio 717 Media Subtype Name: EVRC 719 Required Parameter for RTP mode: 721 ptype: Indicates the Type of the RTP/Vocoder packets. The 722 valid values are 1 (Type 1 Interleaved/Bundled) or 2 (Type 2 723 Header-Free). 725 Optional parameters for RTP mode: 727 ptime: Defined as usual for RTP audio [6]. 729 maxptime: The maximum amount of media which can be encapsulated 730 in each packet, expressed as time in milliseconds. The time 731 SHALL be calculated as the sum of the time the media present 732 in the packet represents. The time SHOULD be a multiple of the 733 duration of a single codec data frame (20 msec). If not 734 signaled, the default maxptime value SHALL be 200 735 milliseconds. 737 maxinterleave: Maximum number for interleaving length (field LLL 738 in the Interleaving Octet). The interleaving lengths used in 739 the entire session MUST NOT exceed this maximum value. If not 740 signaled, the maxinterleave length SHALL be 5. 742 Optional parameters for storage mode: none 744 Encoding considerations for RTP mode: see Section 6 and Section 7 of 745 RFC xxxx. 747 Encoding considerations for storage mode: see Section 10.1 of RFC 748 xxxx. 750 Security considerations: see Section 12 "Security Considerations" of 751 RFC xxxx. 753 Public specification: RFC xxxx. 755 Additional information for storage mode: 756 Magic number: #!EVRC\n 757 File extensions: evc, EVC 758 Macintosh file type code: none 759 Object identifier or OID: none 761 Intended usage: COMMON. It is expected that many VoIP applications 762 (as well as mobile applications) will use this type. 764 Person & email address to contact for further information: 765 Adam Li 766 adamli@icsl.ucla.edu 768 Author/Change controller: 769 Adam Li 770 adamli@icsl.ucla.edu 771 IETF Audio/Video Transport Working Group 773 10.3. SMV MIME Registration 775 Media Type Name: audio 777 Media Subtype Name: SMV 779 Required Parameter for RTP mode: 781 ptype: Indicates the Type of the RTP/Vocoder packets. The 782 valid values are 1 (Type 1 Interleaved/Bundled) or 2 (Type 2 783 Header-Free). 785 Optional parameters for RTP mode: 787 ptime: Defined as usual for RTP audio [6]. 789 maxptime: The maximum amount of media which can be encapsulated 790 in each packet, expressed as time in milliseconds. The time 791 SHALL be calculated as the sum of the time the media present 792 in the packet represents. The time SHOULD be a multiple of the 793 duration of a single codec data frame (20 msec). If not 794 signaled, the default maxptime value SHALL be 200 795 milliseconds. 797 maxinterleave: Maximum number for interleaving length (field LLL 798 in the Interleaving Octet). The interleaving lengths used in 799 the entire session MUST NOT exceed this maximum value. If not 800 signaled, the maxinterleave length SHALL be 5. 802 Optional parameters for storage mode: none 804 Encoding considerations for RTP mode: see Section 6 and Section 7 of 805 RFC xxxx. 807 Encoding considerations for storage mode: see Section 10.1 of RFC 808 xxxx. 810 Security considerations: see Section 12 "Security Considerations" of 811 RFC xxxx. 813 Public specification: RFC xxxx. 815 Additional information for storage mode: 816 Magic number: #!SMV\n 817 File extensions: smv, SMV 818 Macintosh file type code: none 819 Object identifier or OID: none 821 Intended usage: COMMON. It is expected that many VoIP applications 822 (as well as mobile applications) will use this type. 824 Person & email address to contact for further information: 825 Adam Li 826 adamli@icsl.ucla.edu 828 Author/Change controller: 829 Adam Li 830 adamli@icsl.ucla.edu 831 IETF Audio/Video Transport Working Group 833 11. Mapping to SDP Parameters 835 Please note that this section applies to the RTP mode only. 837 Parameters are mapped to SDP [6] as usual. 838 Example usage in SDP: 839 m = audio 49120 RTP/AVP 97 840 a = rtpmap:97 EVRC 841 a = fmtp:97 ptype=1; maxinterleave=2 842 a = maxptime:80 844 12. Security Considerations 846 RTP packets using the payload format defined in this specification 847 are subject to the security considerations discussed in the RTP 848 specification [4], and any appropriate profile (for example [5]). 849 This implies that confidentiality of the media streams is achieved by 850 encryption. Because the data compression used with this payload 851 format is applied end-to-end, encryption may be performed after 852 compression so there is no conflict between the two operations. 854 A potential denial-of-service threat exists for data encoding using 855 compression techniques that have non-uniform receiver-end 856 computational load. The attacker can inject pathological datagrams 857 into the stream which are complex to decode and cause the receiver to 858 become overloaded. However, the encodings covered in this document do 859 not exhibit any significant non-uniformity. 861 As with any IP-based protocol, in some circumstances, a receiver may 862 be overloaded simply by the receipt of too many packets, either 863 desired or undesired. Network-layer authentication may be used to 864 discard packets from undesired sources, but the processing cost of 865 the authentication itself may be too high. In a multicast 866 environment, pruning of specific sources may be implemented in 867 future versions of IGMP [7] and in multicast routing protocols to 868 allow a receiver to select which sources are allowed to reach it. 870 Interleaving MAY affect encryption. Depending on the used encryption 871 scheme there MAY be restrictions on for example the time when keys 872 can be changed. 874 13. Adding Support of Other Frame-Based Vocoders 876 As described above, the RTP packet format defined in this document is 877 very flexible and designed to be usable by other frame-based 878 vocoders. 880 Additional vocoders using this format MUST have properties as 881 described in Section 3.3. 883 The following need to be done in order for any eligible vocoders to 884 use the RTP payload format defined in this document: 886 o Define the unit used for RTP time stamp; 887 o Define the meaning of the Mode Request bits; 888 o Define corresponding codec data frame type values for ToC; 889 o Define the conversion procedure for vocoders output data frame; 890 o Define a magic number for storage mode, and complete the 891 corresponding MIME registration. 893 14. Acknowledgements 895 The following authors have made significant contributions to this 896 document: Adam H. Li, John D. Villasenor, Dong-Seek Park, Jeong-Hoon 897 Park, Keith Miller, S. Craig Greer, David Leon, Nikolai Leung, 898 Marcello Lioy, Kyle J. McKay, Magdalena L. Espelien, Randall Gellens, 899 Tom Hiller, Peter J. McCann, Stinson S. Mathai, Michael D. Turner, 900 Ajay Rajkumar, Dan Gal, Magnus Westerlund, Lars-Erik Jonsson, Greg 901 Sherwood, and Thomas Zeng. 903 15. References 905 [1] 3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service 906 Option 3 for Wideband Spread Spectrum Digital Systems", January 907 1997. 909 [2] 3GPP2 C.S0030, "Selectable Mode Vocoder", August 2001. 911 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 912 Levels", BCP 14, RFC 2119, March 1997. 914 [4] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 915 "RTP: A Transport Protocol for Real-Time Applications", RFC 916 1889, January 1996. 918 [5] Schulzrinne, H., "RTP Profile for Audio and Video Conferences 919 with Minimal Control", RFC 1890, January 1996. 921 [6] M. Handley and V. Jacobson, "SDP: Session Description Protocol", 922 RFC 2327, April 1998. 924 [7] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 925 1112, August 1989. 927 16. Authors' Address 929 The editor will serve as the point of contact for technical issues. 931 Adam H. Li 932 Image Communication Lab 933 Electrical Engineering Department 934 University of California 935 Los Angeles, CA 90095 936 USA 937 Phone: +1 310 825 5178 938 Email: adamli@icsl.ucla.edu