idnits 2.17.1 draft-ietf-avt-dsr-es202050-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 17, 2003) is 7498 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: '2' is defined on line 369, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 2327 (ref. '4') (Obsoleted by RFC 4566) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Audio Video Transport WG Q. Xie 3 Internet-Draft D. Pearce 4 Expires: April 16, 2004 Motorola 5 October 17, 2003 7 RTP Payload Format for European Telecommunications Standards 8 Institute (ETSI) European Standard ES 202 050 Distributed Speech 9 Recognition Encoding 10 draft-ietf-avt-dsr-es202050-01.txt 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that other 19 groups may also distribute working documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at http:// 27 www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on April 16, 2004. 34 Copyright Notice 36 Copyright (C) The Internet Society (2003). All Rights Reserved. 38 Abstract 40 This document specifies an RTP payload format for encapsulating ETSI 41 Standard ES 202 050 advanced front-end signal processing feature 42 streams for distributed speech recognition (DSR) systems. 44 Table of Contents 46 1. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 47 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 48 2.1 ETSI ES 202 050 Advanced DSR Front-end Codec . . . . . . . . . 3 49 3. ES 202 050 DSR RTP Payload Format . . . . . . . . . . . . . . 4 50 3.1 Consideration on Number of FPs in Each RTP Packet . . . . . . 4 51 3.2 Support for Discontinuous Transmission . . . . . . . . . . . . 4 52 4. Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . . 4 53 4.1 Format of Speech and Non-speech FPs . . . . . . . . . . . . . 4 54 4.2 Format of Null FP . . . . . . . . . . . . . . . . . . . . . . 6 55 4.3 RTP header usage . . . . . . . . . . . . . . . . . . . . . . . 7 56 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 57 5.1 Mapping MIME Parameters into SDP . . . . . . . . . . . . . . . 8 58 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 59 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 9 60 Normative References . . . . . . . . . . . . . . . . . . . . . 9 61 Informative References . . . . . . . . . . . . . . . . . . . . 10 62 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 10 63 Intellectual Property and Copyright Statements . . . . . . . . 11 65 1. Conventions 67 The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, 68 SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when 69 they appear in this document, are to be interpreted as described in 70 [3]. 72 The following acronyms are used in this document: 74 DSR - Distributed Speech Recognition 75 ETSI - the European Telecommunications Standards Institute 76 FP - Frame Pair 77 DTX - Discontinuous Transmission 78 VAD - Voice Activity Detection 80 2. Introduction 82 Distributed speech recognition (DSR) technology is intended for a 83 remote device acting as a thin client, also known as the front-end, 84 to communicate with a speech recognition server, also called a speech 85 engine, over a network connection to obtain speech recognition 86 services. More details on DSR over Internet can be found in [7]. 88 To achieve interoperability with different client devices and speech 89 engines, the first ETSI standard DSR front-end ES 201 108 was 90 published in early 2000 [8], and an RTP packetization for ES 201 108 91 frames is defined in [7] in IETF. 93 In ES 202 050 [1], ETSI issues another standard for an Advanced DSR 94 front-end that provides substantially improved recognition 95 performance when background noise is present. The codecs in ES 202 96 050 uses a slightly different frame format from that of ES 201 108 97 and thus the two do not inter-operate with each other. 99 The RTP packetization for ES 202 050 front-end defined in this 100 document uses the same RTP packet format layout as that defined in 101 [7]. The differences are in the DSR codec frame bit definition and 102 the payload type MIME registration. 104 2.1 ETSI ES 202 050 Advanced DSR Front-end Codec 106 Some relevant characteristics of ES 202 050 Advanced DSR front-end 107 codec are summarized below. 109 The front-end calculation is a frame-based scheme that produces an 110 output vector every 10 ms. In the front-end feature extraction, noise 111 reduction by two stages of Wiener filtering is performed first. Then, 112 waveform processing is applied to the de-noised signal and 113 mel-cepstral features are calculated. At the end, blind equalization 114 is applied to the cepstral features. The front-end algorithm produces 115 at its output a mel-cepstral representation in the same format as ES 116 210 108, i.e., 12 cepstral coeffients [C1 - C12], C0 and log Engergy. 117 Voice activity detection (VAD) for the clasification of each frame as 118 speech or non-speech is also implemented in Feature Extraction. The 119 VAD information is included in the payload format for each frame pair 120 to be sent to the remote recognition engine as part of the payload. 121 This information may optionally be used by the receiving recognition 122 engine to drop non-speech frames. The front-end supports three raw 123 sampling rates: 8 kHz, 11 kHz, and 16 kHz (It is worthwhile to note 124 that unlike some other speech codecs, the feature frame size of DSR 125 presented to RTP packetization is not dependent on the number of 126 speech samples used in each 10 ms sample frame. This will become more 127 evident in the following sections). 129 After calculation of the mel-cepstral representation, the 130 representation is first quantized via split-vector quantization to 131 reduce the data rate of the encoded stream. Then, the quantized 132 vectors from two consecutive frames are put into an FP, as described 133 in more detail in Section 4.1 below. 135 3. ES 202 050 DSR RTP Payload Format 137 An ES 202 050 DSR RTP payload datagram uses exactly the same layout 138 as defined in Section 3 of [7], i.e., a standard RTP header followed 139 by a DSR payload containing a series of DSR FPs. 141 The size of each ES 202 050 FP is still 96 bits or 12 octets (see 142 Sections 4 below). This ensures that a DSR RTP payload will always 143 end on an octet boundary. 145 3.1 Consideration on Number of FPs in Each RTP Packet 147 Same considerations described in Section 3.1 of [7] apply to ES 202 148 050 RTP payload. 150 3.2 Support for Discontinuous Transmission 152 Same considerations described in Section 3.2 of [7] apply to ES 202 153 050 RTP payload. 155 4. Frame Pair Formats 157 4.1 Format of Speech and Non-speech FPs 159 The following mel-cepstral frame MUST be used, as defined in [1]: 161 As defined in [1], pairs of the quantized 10ms mel-cepstral frames 162 MUST be grouped together and protected with a 4-bit CRC, forming a 163 92-bit long FP: 165 0 1 2 3 166 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 167 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 168 | | 169 + + 170 | Frame Pair (88 bits) = Frame #1 + Frame #2 | 171 + +-+-+-+-+-+-+-+-+ 172 | | CRC |0|0|0|0| 173 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 175 Here Frame #1 and Frame #2 above MUST use the following mel-cepstral 176 frame formats: 178 Frame #1 in FP: 179 =============== 180 (MSB) (LSB) 181 0 1 2 3 4 5 6 7 182 +-----+-----+-----+-----+-----+-----+-----+-----+ 183 : idx(2,3) | idx(0,1) | Octet 1 184 +-----+-----+-----+-----+-----+-----+-----+-----+ 185 : idx(4,5) | idx(2,3) (cont) : Octet 2 186 +-----+-----+-----+-----+-----+-----+-----+-----+ 187 | idx(6,7) |idx(4,5)(cont) Octet 3 188 +-----+-----+-----+-----+-----+-----+-----+-----+ 189 idx(10,11)| VAD | idx(8,9) | Octet 4 190 +-----+-----+-----+-----+-----+-----+-----+-----+ 191 : idx(12,13) | idx(10,11) (cont) : Octet 5 192 +-----+-----+-----+-----+-----+-----+-----+-----+ 193 | idx(12,13) (cont) : Octet 6/1 194 +-----+-----+-----+-----+ 196 Frame #2 in FP: 197 =============== 198 (MSB) (LSB) 199 0 1 2 3 4 5 6 7 200 +-----+-----+-----+-----+ 201 : idx(0,1) | Octet 6/2 202 +-----+-----+-----+-----+-----+-----+-----+-----+ 203 | idx(2,3) |idx(0,1)(cont) Octet 7 204 +-----+-----+-----+-----+-----+-----+-----+-----+ 205 : idx(6,7) | idx(4,5) | Octet 8 206 +-----+-----+-----+-----+-----+-----+-----+-----+ 207 : idx(8,9) | idx(6,7) (cont) : Octet 9 208 +-----+-----+-----+-----+-----+-----+-----+-----+ 209 | idx(10,11) | VAD |idx(8,9)(cont) Octet 10 210 +-----+-----+-----+-----+-----+-----+-----+-----+ 211 | idx(12,13) | Octet 11 212 +-----+-----+-----+-----+-----+-----+-----+-----+ 214 The 4-bit CRC in the FP MUST be calculated using the formula 215 (including the bit-order rules) defined in 7.2 in [1]. 217 Therefore, each FP represents 20ms of original speech. Note, as shown 218 above, each FP MUST be padded with 4 zeros to the LSB 4 bits of the 219 last octet in order to make the FP aligned to the 32-bit word 220 boundary. This makes the total size of an FP 96 bits, or 12 octets. 221 Note, this padding is separate from padding indicated by the P bit in 222 the RTP header. 224 The definition of the indices and 'VAD' flag are described in [1] and 225 their value is only set and examined by the codecs in the front-end 226 client and the recognizer. 228 Any number of FPs MAY be aggregate together in an RTP payload and 229 they MUST be consecutive in time. However, one SHOULD always keep the 230 RTP payload size smaller than the MTU in order to avoid IP 231 fragmentation and SHOULD follow the recommendations given in Section 232 3.1 in [7] when determining the proper number of FPs in an RTP 233 payload. 235 4.2 Format of Null FP 237 A Null FP for the ES 202 050 front-end codec is defined by setting 238 the content of the first and second frame in the FP to null (i.e., 239 filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be 240 calculated the same way as described in 7.2.4 in [1], and 4 zeros 241 MUST be padded to the end of the Null FP to made it 32-bit word 242 aligned. 244 4.3 RTP header usage 246 The format of the RTP header is specified in [5]. This payload format 247 uses the fields of the header in a manner consistent with that 248 specification. 250 The RTP timestamp corresponds to the sampling instant of the first 251 sample encoded for the first FP in the packet. The timestamp clock 252 frequency is the same as the sampling frequency, so the timestamp 253 unit is in samples. 255 As defined by ES 202 050 front-end codec, the duration of one FP is 256 20 ms, corresponding to 160, 220, or 320 encoded samples with 257 sampling rate of 8, 11, or 16 kHz being used at the front-end, 258 respectively. Thus, the timestamp is increased by 160, 220, or 320 259 for each consecutive FP, respectively. 261 The DSR payload for ES 202 050 front-end codes is always an integral 262 number of octets. If additional padding is required for some other 263 purpose, then the P bit in the RTP in the header may be set and 264 padding appended as specified in [5]. 266 The RTP header marker bit (M) should be set following the general 267 rules for audio codecs as defined in Section 4.1 in [6]. 269 The assignment of an RTP payload type for this new packet format is 270 outside the scope of this document, and will not be specified here. 271 It is expected that the RTP profile under which this payload format 272 is being used will assign a payload type for this encoding or specify 273 that the payload type is to be bound dynamically. 275 5. IANA Considerations 277 One new MIME subtype registration is required for this payload type, 278 as described below. 280 Media Type name: audio 282 Media subtype name: dsr-es202050 284 Required parameters: none 286 Optional parameters: 288 rate: Indicates the sample rate of the speech. Valid values include: 289 8000, 11000, and 16000. If this parameter is not present, 8000 290 sample rate is assumed. 292 maxptime: The maximum amount of media which can be encapsulated in 293 each packet, expressed as time in milliseconds. The time shall be 294 calculated as the sum of the time the media present in the packet 295 represents. The time SHOULD be a multiple of the frame pair size 296 (i.e., one FP => 20ms). 298 If this parameter is not present, maxptime is assumed to be 80ms. 300 Note, since the performance of most speech recognizers are 301 extremely sensitive to consecutive FP losses, if the user of the 302 payload format expects a high packet loss ratio for the session, 303 it MAY consider to explicitly choose a maxptime value for the 304 session that is shorter than the default value. 306 ptime: see RFC2327 [4]. 308 Encoding considerations: This type is defined for transfer via RTP 309 [5] as described in Sections 3 and 4 of RFC XXXX. 311 Security considerations: See Section 6 of RFC XXXX. 313 Person & email address to contact for further information: 314 Qiaobing.Xie@motorola.com 316 Intended usage: COMMON. It is expected that many VoIP applications 317 (as well as mobile applications) will use this type. 319 Author/Change controller: 321 * Qiaobing.Xie@motorola.com 323 * IETF Audio/Video transport working group 325 5.1 Mapping MIME Parameters into SDP 327 The information carried in the MIME media type specification has a 328 specific mapping to fields in the Session Description Protocol (SDP) 329 [4], which is commonly used to describe RTP sessions. When SDP is 330 used to specify sessions employing ES 202 050 DSR codec, the mapping 331 is as follows: 333 o The MIME type ("audio") goes in SDP "m=" as the media name. 335 o The MIME subtype ("dsr-es202050") goes in SDP "a=rtpmap" as the 336 encoding name. 338 o The optional parameter "rate" also goes in "a=rtpmap" as clock 339 rate. 341 o The optional parameters "ptime" and "maxptime" go in the SDP 342 "a=ptime" and "a=maxptime" attributes, respectively. 344 Example of usage of ES 202 050 DSR: 346 m=audio 49120 RTP/AVP 101 347 a=rtpmap:101 dsr-es202050/8000 348 a=maxptime:40 350 6. Security Considerations 352 Implementations using the payload defined in this specification are 353 subject to the security considerations discussed in the RTP 354 specification [5] and the RTP profile [6]. This payload does not 355 specify any different security services. 357 7. Acknowledgments 359 The design presented here is based on that of [7]. 361 Normative References 363 [1] European Telecommunications Standards Institute (ETSI) Standard 364 ES 202 050, "Speech Processing, Transmission and Quality Aspects 365 (STQ); Distributed Speech Recognition; Front-end Feature 366 Extraction Algorithm; Compression Algorithms", (http:// 367 pda.etsi.org/pda/home.asp?wki_id=6402) , October 2002. 369 [2] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 370 9, RFC 2026, October 1996. 372 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 373 Levels", BCP 14, RFC 2119, March 1997. 375 [4] Handley, M. and V. Jacobson, "SDP: Session Description 376 Protocol", RFC 2327, April 1998. 378 [5] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 379 "RTP: A Transport Protocol for Real-Time Applications", RFC 380 3550, July 2003. 382 [6] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video 383 Conferences with Minimal Control", RFC 3551, July 2003. 385 [7] Xie, Q., "RTP Payload Format for European Telecommunications 386 Standards Institute (ETSI) European Standard ES 201 108 387 Distributed Speech Recognition Encoding", RFC 3557, July 2003. 389 Informative References 391 [8] European Telecommunications Standards Institute (ETSI) Standard 392 ES 201 108, "Speech Processing, Transmission and Quality Aspects 393 (STQ); Distributed Speech Recognition; Front-end Feature 394 Extraction Algorithm; Compression Algorithms", http:// 395 webapp.etsi.org/pda/home.asp?wki_id=9948 , April 2000. 397 Authors' Addresses 399 Qiaobing Xie 400 Motorola, Inc. 401 1501 W. Shure Drive, 2-F9 402 Arlington Heights, IL 60004 403 US 405 Phone: +1-847-632-3028 406 EMail: qxie1@email.mot.com 408 David Pearce 409 Motorola Labs 410 UK Research Laboratory 411 Jays Close 412 Viables Industrial Estate 413 Basingstoke, HANTS RG22 4PD 414 UK 416 Phone: +44 (0)1256 484 436 417 EMail: bdp003@motorola.com 419 Intellectual Property Statement 421 The IETF takes no position regarding the validity or scope of any 422 intellectual property or other rights that might be claimed to 423 pertain to the implementation or use of the technology described in 424 this document or the extent to which any license under such rights 425 might or might not be available; neither does it represent that it 426 has made any effort to identify any such rights. Information on the 427 IETF's procedures with respect to rights in standards-track and 428 standards-related documentation can be found in BCP-11. Copies of 429 claims of rights made available for publication and any assurances of 430 licenses to be made available, or the result of an attempt made to 431 obtain a general license or permission for the use of such 432 proprietary rights by implementors or users of this specification can 433 be obtained from the IETF Secretariat. 435 The IETF invites any interested party to bring to its attention any 436 copyrights, patents or patent applications, or other proprietary 437 rights which may cover technology that may be required to practice 438 this standard. Please address the information to the IETF Executive 439 Director. 441 Full Copyright Statement 443 Copyright (C) The Internet Society (2003). All Rights Reserved. 445 This document and translations of it may be copied and furnished to 446 others, and derivative works that comment on or otherwise explain it 447 or assist in its implementation may be prepared, copied, published 448 and distributed, in whole or in part, without restriction of any 449 kind, provided that the above copyright notice and this paragraph are 450 included on all such copies and derivative works. However, this 451 document itself may not be modified in any way, such as by removing 452 the copyright notice or references to the Internet Society or other 453 Internet organizations, except as needed for the purpose of 454 developing Internet standards in which case the procedures for 455 copyrights defined in the Internet Standards process must be 456 followed, or as required to translate it into languages other than 457 English. 459 The limited permissions granted above are perpetual and will not be 460 revoked by the Internet Society or its successors or assignees. 462 This document and the information contained herein is provided on an 463 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 464 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 465 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 466 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 467 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 469 Acknowledgement 471 Funding for the RFC Editor function is currently provided by the 472 Internet Society.