idnits 2.17.1 draft-ietf-avt-rtp-dsr-codecs-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3667, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 806. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 783. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 790. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 796. ** The document seems to lack an RFC 3978 Section 5.1 IPR Disclosure Acknowledgement -- however, there's a paragraph with a matching beginning. Boilerplate error? ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. ** The document uses RFC 3667 boilerplate or RFC 3978-like boilerplate instead of verbatim RFC 3978 boilerplate. After 6 May 2005, submission of drafts without verbatim RFC 3978 boilerplate is not accepted. The following non-3978 patterns matched text found in the document. That text should be removed or replaced: By submitting this Internet-Draft, I certify that any applicable patent or other IPR claims of which I am aware have been disclosed, or will be disclosed, and any of which I become aware will be disclosed, in accordance with RFC 3668. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 82 instances of too long lines in the document, the longest one being 1 character in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 17, 2004) is 7246 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '4' is defined on line 716, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '3' ** Obsolete normative reference: RFC 2327 (ref. '6') (Obsoleted by RFC 4566) ** Obsolete normative reference: RFC 3267 (ref. '8') (Obsoleted by RFC 4867) Summary: 8 errors (**), 0 flaws (~~), 4 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Audio Video Transport WG Q. Xie 3 Internet-Draft D. Pearce 4 Expires: December 16, 2004 Motorola 5 June 17, 2004 7 RTP Payload Formats for European Telecommunications Standards 8 Institute (ETSI) European Standard ES 202 050, ES 202 211, and ES 202 9 212 Distributed Speech Recognition Encoding 10 draft-ietf-avt-rtp-dsr-codecs-03.txt 12 Status of this Memo 14 By submitting this Internet-Draft, I certify that any applicable 15 patent or other IPR claims of which I am aware have been disclosed, 16 and any of which I become aware will be disclosed, in accordance with 17 RFC 3668. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as 22 Internet-Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on December 16, 2004. 37 Abstract 39 This document specifies RTP payload formats for encapsulating ETSI 40 Standard ES 202 050 DSR Advanced Front-end (AFE), ES 202 211 DSR 41 Extended Front-end (XFE), and ES 202 212 DSR Extended Advanced 42 Front-end (XAFE) signal processing feature streams for distributed 43 speech recognition (DSR) systems. 45 Table of Contents 47 1. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 48 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 49 2.1 ETSI ES 202 050 Advanced DSR Front-end Codec . . . . . . . 4 50 2.2 ETSI ES 202 211 Extended DSR Front-end Codec . . . . . . . 4 51 2.3 ETSI ES 202 212 Extended Advanced DSR Front-end Codec . . 5 52 3. DSR RTP Payload Formats . . . . . . . . . . . . . . . . . . . 6 53 3.1 Common Considerations of the Three DSR RTP Payload 54 Formats . . . . . . . . . . . . . . . . . . . . . . . . . 6 55 3.1.1 Number of FPs in Each RTP Packet . . . . . . . . . . . 6 56 3.1.2 Support for Discontinuous Transmission . . . . . . . . 6 57 3.1.3 RTP header usage . . . . . . . . . . . . . . . . . . . 6 58 3.2 Payload Format for ES 202 050 DSR . . . . . . . . . . . . 7 59 3.2.1 Frame Pair Formats . . . . . . . . . . . . . . . . . . 7 60 3.3 Payload Format for ES 202 211 DSR . . . . . . . . . . . . 9 61 3.3.1 Frame Pair Formats . . . . . . . . . . . . . . . . . . 9 62 3.4 Payload Format ES 202 212 DSR . . . . . . . . . . . . . . 11 63 3.4.1 Frame Pair Formats . . . . . . . . . . . . . . . . . . 11 64 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 65 4.1 Mapping MIME Parameters into SDP . . . . . . . . . . . . . 15 66 4.2 Usage in Offer/Answer . . . . . . . . . . . . . . . . . . 16 67 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 68 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 16 69 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 70 7.1 Normative References . . . . . . . . . . . . . . . . . . . . 16 71 7.2 Informative References . . . . . . . . . . . . . . . . . . . 17 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 17 73 Intellectual Property and Copyright Statements . . . . . . . . 19 75 1. Conventions 77 The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, 78 SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when 79 they appear in this document, are to be interpreted as described in 80 RFC 2119 [5]. 82 The following acronyms are used in this document: 84 DSR - Distributed Speech Recognition 85 ETSI - the European Telecommunications Standards Institute 86 FP - Frame Pair 87 DTX - Discontinuous Transmission 88 VAD - Voice Activity Detection 90 2. Introduction 92 Distributed speech recognition (DSR) technology is intended for a 93 remote device acting as a thin client, also known as the front-end, 94 to communicate with a speech recognition server, also called a speech 95 engine, over a network connection to obtain speech recognition 96 services. More details on DSR over Internet can be found in RFC 3557 97 [11]. 99 To achieve interoperability with different client devices and speech 100 engines, the first ETSI standard DSR front-end ES 201 108 was 101 published in early 2000 [12], and an RTP packetization for ES 201 108 102 frames is defined in RFC 3557 [11] by IETF. 104 In ES 202 050 [1], ETSI issues another standard for an Advanced DSR 105 front-end that provides substantially improved recognition 106 performance when background noise is present. The codecs in ES 202 107 050 uses a slightly different frame format from that of ES 201 108 108 and thus the two do not inter-operate with each other. 110 The RTP packetization for ES 202 050 front-end defined in this 111 document uses the same RTP packet format layout as that defined in 112 RFC 3557 [11]. The differences are in the DSR codec frame bit 113 definition and the payload type MIME registration. 115 The two further standards, ES 202 211 and ES 202 212, provided 116 extensions to each of the DSR front-end standards. The extensions 117 allow the speech waveform to be reconstructed for human audition and 118 can also be used to improve recognition performance for tonal 119 languages. This is done by sending additional pitch and voicing 120 information for each frame along with the recognition features. 122 The RTP packet format for these extended standards are also defined 123 in this document. 125 It is worthwhile to note that the performance of most speech 126 recognizers are extremely sensitive to consecutive frame losses and 127 the DSR speech recognizers are no exception. If a DSR over RTP 128 session is expected to endure high packet loss ratio between the 129 front-end and the speech engine, one should consider limiting the 130 maximum number of DSR frames allowed in a packet, or employing other 131 loss management techniques, such as FEC or interleaving, to minimize 132 the chance of losing consecutive frames. 134 2.1 ETSI ES 202 050 Advanced DSR Front-end Codec 136 Some relevant characteristics of ES 202 050 Advanced DSR front-end 137 codec are summarized below. 139 The front-end calculation is a frame-based scheme that produces an 140 output vector every 10 ms. In the front-end feature extraction, 141 noise reduction by two stages of Wiener filtering is performed first. 142 Then, waveform processing is applied to the de-noised signal and 143 mel-cepstral features are calculated. At the end, blind equalization 144 is applied to the cepstral features. The front-end algorithm 145 produces at its output a mel-cepstral representation in the same 146 format as ES 210 108, i.e., 12 cepstral coeffients [C1 - C12], C0 and 147 log Energy. Voice activity detection (VAD) for the classification of 148 each frame as speech or non-speech is also implemented in Feature 149 Extraction. The VAD information is included in the payload format 150 for each frame pair to be sent to the remote recognition engine as 151 part of the payload. This information may optionally be used by the 152 receiving recognition engine to drop non-speech frames. The 153 front-end supports three raw sampling rates: 8 kHz, 11 kHz, and 16 154 kHz (It is worthwhile to note that unlike some other speech codecs, 155 the feature frame size of DSR presented to RTP packetization is not 156 dependent on the number of speech samples used in each 10 ms sample 157 frame. This will become more evident in the following sections). 159 After calculation of the mel-cepstral representation, the 160 representation is first quantized via split-vector quantization to 161 reduce the data rate of the encoded stream. Then, the quantized 162 vectors from two consecutive frames are put into an frame pair (FP), 163 as described in more detail in Section 3.2 below. 165 2.2 ETSI ES 202 211 Extended DSR Front-end Codec 167 Some relevant characteristics of ES 202 211 Extended DSR front-end 168 codec are summarized below. 170 ES 202 211 is an extension of the mel-cepstrum DSR Front-end standard 171 ES 201 108 [12]. The mel-cepstrum front-end provides the features 172 for speech recognition but these are not available for human 173 listening. The purpose of the extension is allow the reconstruction 174 of the speech waveform from these features so that they can be 175 replayed. The front-end feature extraction part of the processing is 176 exactly the same as for ES 201 108. To allow speech reconstruction 177 additional fundamental frequency (perceived as pitch) and voicing 178 class (e.g. non-speech, voiced, unvoiced and mixed) information is 179 needed. This is the extra information that is provided by the 180 extended front-end processing algorithms at the device side that is 181 compressed and transmitted along with the front-end features to the 182 server. This extra information may also be useful for improved 183 speech recognition performance with tonal languages such as Mandarin, 184 Cantonese and Thai. 186 Full information about the client side signal processing algorithms 187 used in the standard are described in the specification ES 202 211 188 [2]. 190 The additional fundamental frequency and voicing class information is 191 compressed for each frame pair. The pitch for the first frame of the 192 FP is quantised to 7 bits and the second frame is differentially 193 quantized with 5 bits. The voicing class is indicated with one bit 194 for each frame. The total for the extension information for a frame 195 pair therefore consists of 14 bits plus and additional 2 bits of CRC 196 error protection computed over these extension bits only. 198 The total information for the frame pair is made up of 92 bits for 199 the two compressed front-end feature frames (including 4 bits for 200 their CRC) plus 16 bits for the extension (including 2 bits for their 201 CRC) and 4 bits of null padding to give a total of 14 octets per 202 frame pair. As for ES 201 208 the extended frame pair also 203 corresponds to 20ms of speech. The extended front-end supports three 204 raw sampling rates: 8 kHz, 11 kHz, and 16 kHz. 206 The quantized vectors from two consecutive frames are put into an FP, 207 as described in more detail in Section 3.3 below. 209 The parameters received at the remote server from the RTP extended 210 DSR payload specified here can be used to synthesize an intelligible 211 speech waveform for replay. The algorithms to do this are described 212 in the specification ES 202 211 [2]. 214 2.3 ETSI ES 202 212 Extended Advanced DSR Front-end Codec 216 ES 202 212 is the extension for the DSR Advanced Front-end ES 202 050 217 [1]. It provides the same capabilities as the extended mel-cepstrum 218 front-end described in section 2.2 but for the DSR Advanced 219 Front-end. 221 3. DSR RTP Payload Formats 223 3.1 Common Considerations of the Three DSR RTP Payload Formats 225 The three DSR RTP payload formats defined in this document share the 226 following consideration or behaviours. 228 3.1.1 Number of FPs in Each RTP Packet 230 Any number of FPs MAY be aggregate together in an RTP payload and 231 they MUST be consecutive in time. However, one SHOULD always keep 232 the RTP payload size smaller than the MTU in order to avoid IP 233 fragmentation and SHOULD follow the recommendations given in Section 234 3.1 in RFC 3557 [11] when determining the proper number of FPs in an 235 RTP payload. 237 3.1.2 Support for Discontinuous Transmission 239 Same considerations described in Section 3.2 of RFC 3557 [11] apply 240 to all the three DSR RTP payloads defined in this document. 242 3.1.3 RTP header usage 244 The format of the RTP header is specified in RFC 3550 [9]. The three 245 payload formats defined here use the fields of the header in a manner 246 consistent with that specification. 248 The RTP timestamp corresponds to the sampling instant of the first 249 sample encoded for the first FP in the packet. The timestamp clock 250 frequency is the same as the sampling frequency, so the timestamp 251 unit is in samples. 253 As defined by all the three front-end codecs, the duration of one FP 254 is 20 ms, corresponding to 160, 220, or 320 encoded samples with 255 sampling rate of 8, 11, or 16 kHz being used at the front-end, 256 respectively. Thus, the timestamp is increased by 160, 220, or 320 257 for each consecutive FP, respectively. 259 The DSR payload for all these three front-end codecs is always an 260 integral number of octets. If additional padding is required for 261 some other purpose, then the P bit in the RTP in the header may be 262 set and padding appended as specified in RFC 3550 [9]. 264 The RTP header marker bit (M) MUST be set following the general rules 265 for audio codecs as defined in Section 4.1 in RFC 3551 [10]. 267 The assignment of an RTP payload type for these three new packet 268 formats is outside the scope of this document, and will not be 269 specified here. It is expected that the RTP profile under which any 270 of these payload formats is being used will assign a payload type for 271 this encoding or specify that the payload type is to be bound 272 dynamically. 274 3.2 Payload Format for ES 202 050 DSR 276 An ES 202 050 DSR RTP payload datagram uses exactly the same layout 277 as defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header 278 followed by a DSR payload containing a series of DSR FPs. 280 The size of each ES 202 050 FP is still 96 bits or 12 octets (defined 281 in the following sections). This ensures that a DSR RTP payload will 282 always end on an octet boundary. 284 3.2.1 Frame Pair Formats 286 3.2.1.1 Format of Speech and Non-speech FPs 288 The following mel-cepstral frame MUST be used, as defined in [1]: 290 As defined in [1], pairs of the quantized 10ms mel-cepstral frames 291 MUST be grouped together and protected with a 4-bit CRC, forming a 292 92-bit long FP. At the end, each FP MUST be padded with 4 zeros to 293 the MSB 4 bits of the last octet in order to make the FP aligned to 294 the octet boundary. 296 The following diagram shows a complete ES 202 050 FP: 298 Frame #1 in FP: 299 =============== 300 (MSB) (LSB) 301 0 1 2 3 4 5 6 7 302 +-----+-----+-----+-----+-----+-----+-----+-----+ 303 : idx(2,3) | idx(0,1) | Octet 1 304 +-----+-----+-----+-----+-----+-----+-----+-----+ 305 : idx(4,5) | idx(2,3) (cont) : Octet 2 306 +-----+-----+-----+-----+-----+-----+-----+-----+ 307 | idx(6,7) |idx(4,5)(cont) Octet 3 308 +-----+-----+-----+-----+-----+-----+-----+-----+ 309 idx(10,11)| VAD | idx(8,9) | Octet 4 310 +-----+-----+-----+-----+-----+-----+-----+-----+ 311 : idx(12,13) | idx(10,11) (cont) : Octet 5 312 +-----+-----+-----+-----+-----+-----+-----+-----+ 313 | idx(12,13) (cont) : Octet 6/1 314 +-----+-----+-----+-----+ 316 Frame #2 in FP: 317 =============== 318 (MSB) (LSB) 319 0 1 2 3 4 5 6 7 320 +-----+-----+-----+-----+ 321 : idx(0,1) | Octet 6/2 322 +-----+-----+-----+-----+-----+-----+-----+-----+ 323 | idx(2,3) |idx(0,1)(cont) Octet 7 324 +-----+-----+-----+-----+-----+-----+-----+-----+ 325 : idx(6,7) | idx(4,5) | Octet 8 326 +-----+-----+-----+-----+-----+-----+-----+-----+ 327 : idx(8,9) | idx(6,7) (cont) : Octet 9 328 +-----+-----+-----+-----+-----+-----+-----+-----+ 329 | idx(10,11) | VAD |idx(8,9)(cont) Octet 10 330 +-----+-----+-----+-----+-----+-----+-----+-----+ 331 | idx(12,13) | Octet 11 332 +-----+-----+-----+-----+-----+-----+-----+-----+ 334 CRC for Frame #1 and Frame #2 and padding in FP: 335 ================================================ 336 (MSB) (LSB) 337 0 1 2 3 4 5 6 7 338 +-----+-----+-----+-----+-----+-----+-----+-----+ 339 | 0 | 0 | 0 | 0 | CRC | Octet 12 340 +-----+-----+-----+-----+-----+-----+-----+-----+ 342 The 4-bit CRC in the FP MUST be calculated using the formula 343 (including the bit-order rules) defined in 7.2 in [1]. 345 Therefore, each FP represents 20ms of original speech. Note, as 346 shown above, each FP MUST be padded with 4 zeros to the MSB 4 bits of 347 the last octet in order to make the FP aligned to the octet boundary. 348 This makes the total size of an FP 96 bits, or 12 octets. Note, this 349 padding is separate from padding indicated by the P bit in the RTP 350 header. 352 The definition of the indices and 'VAD' flag are described in [1] and 353 their value is only set and examined by the codecs in the front-end 354 client and the recognizer. 356 3.2.1.2 Format of Null FP 358 Null FPs are sent to mark the end of a transmission segment. Details 359 on transmission segment and the use of Null FPs can be found in RFC 360 3557 [11]. 362 A Null FP for the ES 202 050 front-end codec is defined by setting 363 the content of the first and second frame in the FP to null (i.e., 364 filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be 365 calculated the same way as described in 7.2.4 in [1], and 4 zeros 366 MUST be padded to the end of the Null FP to made it octet aligned. 368 3.3 Payload Format for ES 202 211 DSR 370 An ES 202 211 DSR RTP payload datagram is very similar to that 371 defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header 372 followed by a DSR payload containing a series of DSR FPs. 374 The size of each ES 202 211 FP is 112 bits or 14 octets (defined in 375 the following sections). This ensures that a DSR RTP payload will 376 always end on an octet boundary. 378 3.3.1 Frame Pair Formats 380 3.3.1.1 Format of Speech and Non-speech FPs 382 The following mel-cepstral frame MUST be used, as defined in Section 383 6.2.4 in [2]: 385 As defined in Section 6.2.4 in [2], after two frames (Frame #1 and 386 Frame #2) worth of codebook indices, or 88 bits, a 4-bit CRC 387 calculated on these 88 bits immediately follows it. The pitch 388 indices of the first frame (Pidx1: 7 bits) and the second frame 389 (Pidx2: 5 bits) of the frame pair then follow. The class indices of 390 the two frames in the frame pair worth 1 bit each (Cidx1 and Cidx2) 391 next follow. Finally, a 2-bit CRC calculated on the pitch and class 392 bits (total: 14 bits) of the frame pair is included (PC-CRC). The 393 total number of bits in frame pair packet is therefore 44 + 44 + 4 + 394 7 + 5 + 1 + 1 + 2 = 108. At the end, each FP MUST be padded with 4 395 zeros to the MSB 4 bits of the last octet in order to make the FP 396 aligned to the octet boundary. 398 The following diagram shows a complete ES 202 211 FP: 400 Frame #1 in FP: 401 =============== 402 (MSB) (LSB) 403 0 1 2 3 4 5 6 7 404 +-----+-----+-----+-----+-----+-----+-----+-----+ 405 : idx(2,3) | idx(0,1) | Octet 1 406 +-----+-----+-----+-----+-----+-----+-----+-----+ 407 : idx(4,5) | idx(2,3) (cont) : Octet 2 408 +-----+-----+-----+-----+-----+-----+-----+-----+ 409 | idx(6,7) |idx(4,5)(cont) Octet 3 410 +-----+-----+-----+-----+-----+-----+-----+-----+ 411 idx(10,11) | idx(8,9) | Octet 4 412 +-----+-----+-----+-----+-----+-----+-----+-----+ 413 : idx(12,13) | idx(10,11) (cont) : Octet 5 414 +-----+-----+-----+-----+-----+-----+-----+-----+ 415 | idx(12,13) (cont) : Octet 6/1 416 +-----+-----+-----+-----+ 418 Frame #2 in FP: 419 =============== 420 (MSB) (LSB) 421 0 1 2 3 4 5 6 7 422 +-----+-----+-----+-----+ 423 : idx(0,1) | Octet 6/2 424 +-----+-----+-----+-----+-----+-----+-----+-----+ 425 | idx(2,3) |idx(0,1)(cont) Octet 7 426 +-----+-----+-----+-----+-----+-----+-----+-----+ 427 : idx(6,7) | idx(4,5) | Octet 8 428 +-----+-----+-----+-----+-----+-----+-----+-----+ 429 : idx(8,9) | idx(6,7) (cont) : Octet 9 430 +-----+-----+-----+-----+-----+-----+-----+-----+ 431 | idx(10,11) |idx(8,9)(cont) Octet 10 432 +-----+-----+-----+-----+-----+-----+-----+-----+ 433 | idx(12,13) | Octet 11 434 +-----+-----+-----+-----+-----+-----+-----+-----+ 436 CRC for Frame #1 and Frame #2 in FP: 437 ==================================== 438 (MSB) (LSB) 439 0 1 2 3 4 5 6 7 440 +-----+-----+-----+-----+ 441 | CRC | Octet 12/1 442 +-----+-----+-----+-----+ 444 Extension information and padding in FP: 445 ======================================== 446 (MSB) (LSB) 447 0 1 2 3 4 5 6 7 448 +-----+-----+-----+-----+ 449 : Pidx1 | Octet 12/2 450 +-----+-----+-----+-----+-----+-----+-----+-----+ 451 | Pidx2 | Pidx1 (cont) : Octet 13 452 +-----+-----+-----+-----+-----+-----+-----+-----+ 453 | 0 | 0 | 0 | 0 | PC-CRC |Cidx2|Cidx1| Octet 14 454 +-----+-----+-----+-----+-----+-----+-----+-----+ 456 The 4-bit CRC and the 2-bit PC-CRC in the FP MUST be calculated using 457 the formula (including the bit-order rules) defined in 6.2.4 in [2]. 459 Therefore, each FP represents 20ms of original speech. Note, as 460 shown above, each FP MUST be padded with 4 zeros to the MSB 4 bits of 461 the last octet in order to make the FP aligned to the octet boundary. 462 This makes the total size of an FP 112 bits, or 14 octets. Note, 463 this padding is separate from padding indicated by the P bit in the 464 RTP header. 466 3.3.1.2 Format of Null FP 468 A Null FP for the ES 202 211 front-end codec is defined by setting 469 all the 112 bits of the FP with 0's. Null FPs are sent to mark the 470 end of a transmission segment. Details on transmission segment and 471 the use of Null FPs can be found in RFC 3557 [11]. 473 3.4 Payload Format ES 202 212 DSR 475 Similar to other ETSI DSR front-end encoding schemes, the encoded DSR 476 feature stream of ES 202 212 is transmitted in a sequence of frame 477 pairs (FPs), where each FP represents two consecutive original voice 478 frames. 480 An ES 202 212 DSR RTP payload datagram is very similar to that 481 defined in Section 3 of RFC 3557 [11], i.e., a standard RTP header 482 followed by a DSR payload containing a series of DSR FPs. 484 The size of each ES 202 212 FP is 112 bits or 14 octets (defined in 485 the following sections). This ensures that an ES 202 212 DSR RTP 486 payload will always end on an octet boundary. 488 3.4.1 Frame Pair Formats 489 3.4.1.1 Format of Speech and Non-speech FPs 491 The following mel-cepstral frame MUST be used, as defined in Section 492 7.2.4 in [3]: 494 As defined in Section 7.2.4 in [3], after two frames (Frame #1 and 495 Frame #2) worth of codebook indices, or 88 bits, a 4-bit CRC 496 calculated on these 88 bits immediately follows it. The pitch 497 indices of the first frame (Pidx1: 7 bits) and the second frame 498 (Pidx2: 5 bits) of the frame pair then follow. The class indices of 499 the two frames in the frame pair worth 1 bit each next follow (Cidx1 500 and Cidx2). Finally, a 2-bit CRC (PC-CRC) calculated on the pitch 501 and class bits (total: 14 bits) of the frame pair is included. The 502 total number of bits in frame pair packet is therefore 44 + 44 + 4 + 503 7 + 5 + 1 + 1 + 2 = 108. At the end, each FP MUST be padded with 4 504 zeros to the MSB 4 bits of the last octet in order to make the FP 505 aligned to the octet boundary. The padding brings the total size of 506 a FP to 112 bits, or 14 octets. Note, this padding is separate from 507 padding indicated by the P bit in the RTP header. 509 The following diagram shows a complete ES 202 212 FP: 511 Frame #1 in FP: 512 =============== 513 (MSB) (LSB) 514 0 1 2 3 4 5 6 7 515 +-----+-----+-----+-----+-----+-----+-----+-----+ 516 : idx(2,3) | idx(0,1) | Octet 1 517 +-----+-----+-----+-----+-----+-----+-----+-----+ 518 : idx(4,5) | idx(2,3) (cont) : Octet 2 519 +-----+-----+-----+-----+-----+-----+-----+-----+ 520 | idx(6,7) |idx(4,5)(cont) Octet 3 521 +-----+-----+-----+-----+-----+-----+-----+-----+ 522 idx(10,11)| VAD | idx(8,9) | Octet 4 523 +-----+-----+-----+-----+-----+-----+-----+-----+ 524 : idx(12,13) | idx(10,11) (cont) : Octet 5 525 +-----+-----+-----+-----+-----+-----+-----+-----+ 526 | idx(12,13) (cont) : Octet 6/1 527 +-----+-----+-----+-----+ 529 Frame #2 in FP: 530 =============== 531 (MSB) (LSB) 532 0 1 2 3 4 5 6 7 533 +-----+-----+-----+-----+ 534 : idx(0,1) | Octet 6/2 535 +-----+-----+-----+-----+-----+-----+-----+-----+ 536 | idx(2,3) |idx(0,1)(cont) Octet 7 537 +-----+-----+-----+-----+-----+-----+-----+-----+ 538 : idx(6,7) | idx(4,5) | Octet 8 539 +-----+-----+-----+-----+-----+-----+-----+-----+ 540 : idx(8,9) | idx(6,7) (cont) : Octet 9 541 +-----+-----+-----+-----+-----+-----+-----+-----+ 542 | idx(10,11) | VAD |idx(8,9)(cont) Octet 10 543 +-----+-----+-----+-----+-----+-----+-----+-----+ 544 | idx(12,13) | Octet 11 545 +-----+-----+-----+-----+-----+-----+-----+-----+ 547 CRC for Frame #1 and Frame #2 in FP: 548 ==================================== 549 (MSB) (LSB) 550 0 1 2 3 4 5 6 7 551 +-----+-----+-----+-----+ 552 | CRC | Octet 12/1 553 +-----+-----+-----+-----+ 555 Extension information and padding in FP: 556 ======================================== 557 (MSB) (LSB) 558 0 1 2 3 4 5 6 7 559 +-----+-----+-----+-----+ 560 : Pidx1 | Octet 12/2 561 +-----+-----+-----+-----+-----+-----+-----+-----+ 562 | Pidx2 | Pidx1 (cont) : Octet 13 563 +-----+-----+-----+-----+-----+-----+-----+-----+ 564 | 0 | 0 | 0 | 0 | PC-CRC |Cidx2|Cidx1| Octet 14 565 +-----+-----+-----+-----+-----+-----+-----+-----+ 567 The codebook indices, VAD flag, pitch index, and class index are 568 specified in Section 6 of [3]. The 4-bit CRC and the 2-bit PC-CRC in 569 the FP MUST be calculated using the formula (including the bit-order 570 rules) defined in 7.2.4 in [3]. 572 3.4.1.2 Format of Null FP 574 A Null FP for the ES 202 212 front-end codec is defined by setting 575 all the 112 bits of the FP with 0's. Null FPs are sent to mark the 576 end of a transmission segment. Details on transmission segment and 577 the use of Null FPs can be found in RFC 3557 [11]. 579 4. IANA Considerations 581 For each of the three ETSI DSR front-end codecs covered in this 582 document, a new MIME subtype registration is required for the 583 corresponding payload type, as described below. 585 Media Type name: audio 587 Media subtype names: 589 dsr-es202050 (for ES 202 050 front-end) 591 dsr-es202211 (for ES 202 211 front-end) 593 dsr-es202212 (for ES 202 212 front-end) 595 Required parameters: none 597 Optional parameters: 599 rate: Indicates the sample rate of the speech. Valid values include: 600 8000, 11000, and 16000. If this parameter is not present, 8000 601 sample rate is assumed. 603 maxptime: see RFC 3267 [8]. If this parameter is not present, 604 maxptime is assumed to be 80ms. 606 Note, since the performance of most speech recognizers are 607 extremely sensitive to consecutive FP losses, if the user of the 608 payload format expects a high packet loss ratio for the session, 609 it MAY consider to explicitly choose a maxptime value for the 610 session that is shorter than the default value. 612 ptime: see RFC 2327 [6]. 614 Encoding considerations: These types are defined for transfer via RTP 615 [9] as described in Section 3 of RFC XXXX. 617 Security considerations: See Section 5 of RFC XXXX. 619 Person & email address to contact for further information: 620 Qiaobing.Xie@motorola.com 622 Intended usage: COMMON. It is expected that many VoIP applications 623 (as well as mobile applications) will use this type. 625 Author/Change controller: 627 * Qiaobing.Xie@motorola.com 629 * IETF Audio/Video transport working group 631 4.1 Mapping MIME Parameters into SDP 633 The information carried in the MIME media type specification has a 634 specific mapping to fields in the Session Description Protocol (SDP) 635 [6], which is commonly used to describe RTP sessions. When SDP is 636 used to specify sessions employing ES 202 050, ES 202 211, or ES 202 637 212 DSR codec, the mapping is as follows: 639 o The MIME type ("audio") goes in SDP "m=" as the media name. 641 o The MIME subtype ("dsr-es202050", "dsr-es202211", or 642 "dsr-es202212") goes in SDP "a=rtpmap" as the encoding name. 644 o The optional parameter "rate" also goes in "a=rtpmap" as clock 645 rate. If no rate is given, then the default value (i.e., 8000) is 646 used in SDP. 648 o The optional parameters "ptime" and "maxptime" go in the SDP 649 "a=ptime" and "a=maxptime" attributes, respectively. 651 Example of usage of ES 202 050 DSR: 653 m=audio 49120 RTP/AVP 101 654 a=rtpmap:101 dsr-es202050/8000 655 a=maxptime:40 657 Example of usage of ES 202 211 DSR: 659 m=audio 49120 RTP/AVP 101 660 a=rtpmap:101 dsr-es202211/8000 661 a=maxptime:40 663 Example of usage of ES 202 212 DSR: 665 m=audio 49120 RTP/AVP 101 666 a=rtpmap:101 dsr-es202212/8000 667 a=maxptime:40 669 4.2 Usage in Offer/Answer 671 All SDP parameters in this payload format are declarative, and all 672 reasonable values are expected to be supported. Thus, the standard 673 usage of Offer/Answer as described in RFC 3264 [7] should be 674 followed. 676 5. Security Considerations 678 Implementations using the payload defined in this specification are 679 subject to the security considerations discussed in the RTP 680 specification RFC 3550 [9] and any RTP profile, e.g. RFC 3551 [10]. 681 This payload does not specify any different security services. 683 Congestion control for RTP MUST be used in accordance with RFC 3550 684 [9], and any applicable RTP profile, e.g. RFC 3551 [10]. 686 6. Acknowledgments 688 The design presented here is based on that of RFC 3557 [11]. The 689 authors wish to thank for the review and comments from Magnus 690 Westerlund and others. 692 7. References 694 7.1 Normative References 696 [1] European Telecommunications Standards Institute (ETSI) Standard 697 ES 202 050, "Speech Processing, Transmission and Quality 698 Aspects (STQ); Distributed Speech Recognition; Front-end 699 Feature Extraction Algorithm; Compression Algorithms", (http:// 700 pda.etsi.org/pda/) , October 2002. 702 [2] European Telecommunications Standards Institute (ETSI) Standard 703 ES 202 211, "Speech Processing, Transmission and Quality 704 Aspects (STQ); Distributed Speech Recognition; Extended 705 front-end feature extraction algorithm; Compression algorithms; 706 Back-end speech reconstruction algorithm", 707 (http://pda.etsi.org/pda/) , November 2003. 709 [3] European Telecommunications Standards Institute (ETSI) Standard 710 ES 202 212, "Speech Processing, Transmission and Quality 711 aspects (STQ); Distributed speech recognition; Extended 712 advanced front-end feature extraction algorithm; Compression 713 algorithms; Back-end speech reconstruction algorithm", (http:// 714 pda.etsi.org/pda/) , November 2003. 716 [4] Bradner, S., "The Internet Standards Process -- Revision 3", 717 BCP 9, RFC 2026, October 1996. 719 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement 720 Levels", BCP 14, RFC 2119, March 1997. 722 [6] Handley, M. and V. Jacobson, "SDP: Session Description 723 Protocol", RFC 2327, April 1998. 725 [7] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with 726 the Session Description Protocol (SDP)", RFC 3264, June 2002. 728 [8] Sjoberg, J., Westerlund, M., Lakaniemi, A. and Q. Xie, 729 "Real-Time Transport Protocol (RTP) Payload Format and File 730 Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive 731 Multi-Rate Wideband (AMR-WB) Audio Codecs", RFC 3267, June 732 2002. 734 [9] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 735 "RTP: A Transport Protocol for Real-Time Applications", RFC 736 3550, July 2003. 738 [10] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video 739 Conferences with Minimal Control", RFC 3551, July 2003. 741 [11] Xie, Q., "RTP Payload Format for European Telecommunications 742 Standards Institute (ETSI) European Standard ES 201 108 743 Distributed Speech Recognition Encoding", RFC 3557, July 2003. 745 7.2 Informative References 747 [12] European Telecommunications Standards Institute (ETSI) Standard 748 ES 201 108, "Speech Processing, Transmission and Quality 749 Aspects (STQ); Distributed Speech Recognition; Front-end 750 Feature Extraction Algorithm; Compression Algorithms", (http:// 751 webapp.etsi.org/pda/) , April 2000. 753 Authors' Addresses 755 Qiaobing Xie 756 Motorola, Inc. 757 1501 W. Shure Drive, 2-F9 758 Arlington Heights, IL 60004 759 US 761 Phone: +1-847-632-3028 762 EMail: qxie1@email.mot.com 763 David Pearce 764 Motorola Labs 765 UK Research Laboratory 766 Jays Close 767 Viables Industrial Estate 768 Basingstoke, HANTS RG22 4PD 769 UK 771 Phone: +44 (0)1256 484 436 772 EMail: bdp003@motorola.com 774 Intellectual Property Statement 776 The IETF takes no position regarding the validity or scope of any 777 Intellectual Property Rights or other rights that might be claimed to 778 pertain to the implementation or use of the technology described in 779 this document or the extent to which any license under such rights 780 might or might not be available; nor does it represent that it has 781 made any independent effort to identify any such rights. Information 782 on the procedures with respect to rights in RFC documents can be 783 found in BCP 78 and BCP 79. 785 Copies of IPR disclosures made to the IETF Secretariat and any 786 assurances of licenses to be made available, or the result of an 787 attempt made to obtain a general license or permission for the use of 788 such proprietary rights by implementers or users of this 789 specification can be obtained from the IETF on-line IPR repository at 790 http://www.ietf.org/ipr. 792 The IETF invites any interested party to bring to its attention any 793 copyrights, patents or patent applications, or other proprietary 794 rights that may cover technology that may be required to implement 795 this standard. Please address the information to the IETF at 796 ietf-ipr@ietf.org. 798 Disclaimer of Validity 800 This document and the information contained herein are provided on an 801 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 802 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 803 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 804 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 805 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 806 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 808 Full Copyright Statement 810 Copyright (C) The Internet Society (2004). This document is subject 811 to the rights, licenses and restrictions contained in BCP 78, and 812 except as set forth therein, the authors retain all their rights. 814 Acknowledgment 816 Funding for the RFC Editor function is currently provided by the 817 Internet Society.