idnits 2.17.1 draft-xie-avt-dsr-es202050-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 11, 2002) is 7800 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC1889' on line 135 == Unused Reference: '3' is defined on line 408, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 1889 (ref. '2') (Obsoleted by RFC 3550) ** Obsolete normative reference: RFC 2327 (ref. '5') (Obsoleted by RFC 4566) == Outdated reference: A later version (-13) exists of draft-ietf-avt-profile-new-12 -- Possible downref: Normative reference to a draft: ref. '6' == Outdated reference: A later version (-05) exists of draft-ietf-avt-dsr-04 Summary: 4 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Audio Video Transport WG Q. Xie 3 Internet-Draft D. Pearce 4 Expires: June 11, 2003 Motorola 5 December 11, 2002 7 RTP Payload Format for ETSI ES 202 050 Distributed Speech 8 Recognition Encoding 9 draft-xie-avt-dsr-es202050-00.txt 11 Status of this Memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC2026. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at http:// 27 www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on June 11, 2003. 34 Copyright Notice 36 Copyright (C) The Internet Society (2002). All Rights Reserved. 38 Abstract 40 This document specifies an RTP payload format for encapsulating ETSI 41 Standard ES 202 050 advanced front-end signal processing feature 42 streams for distributed speech recognition (DSR) systems. 44 Table of Contents 46 1. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 47 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 48 2.1 ETSI ES 202 050 DSR Front-end Codec . . . . . . . . . . . . . 3 49 3. ES 202 050 DSR RTP Payload Format . . . . . . . . . . . . . . 4 50 3.1 Consideration on Number of FPs in Each RTP Packet . . . . . . 5 51 3.2 Support for Discontinuous Transmission . . . . . . . . . . . . 5 52 4. Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . . 5 53 4.1 Format of Speech and Non-speech FPs . . . . . . . . . . . . . 5 54 4.2 Format of Null FP . . . . . . . . . . . . . . . . . . . . . . 7 55 4.3 RTP header usage . . . . . . . . . . . . . . . . . . . . . . . 7 56 5. DSR MIME Type Registration . . . . . . . . . . . . . . . . . . 8 57 5.1 Mapping MIME Parameters into SDP . . . . . . . . . . . . . . . 9 58 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 59 7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10 60 Normative References . . . . . . . . . . . . . . . . . . . . . 10 61 Informative References . . . . . . . . . . . . . . . . . . . . 10 62 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 11 63 Full Copyright Statement . . . . . . . . . . . . . . . . . . . 12 65 1. Conventions 67 The keywords MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, 68 SHOULD NOT, RECOMMENDED, NOT RECOMMENDED, MAY, and OPTIONAL, when 69 they appear in this document, are to be interpreted as described in 70 [4]. 72 The following acronyms are used in this document: 74 DSR - Distributed Speech Recognition 75 ETSI - the European Telecommunications Standards Institute 76 FP - Frame Pair 77 DTX - Discontinuous Transmission 79 2. Introduction 81 Distributed speech recognition (DSR) technology is intended for a 82 remote device acting as a thin client, also known as the front-end, 83 to communicate with a speech recognition server, also called a speech 84 engine, over a network connection to obtain speech recognition 85 services. More details on DSR over Internet can be found in [7]. 87 To achieve interoperability with different client devices and speech 88 engines, the first ETSI standard DSR front-end ES 201 108 was 89 published in early 2000 [8], and an RTP packetization for ES 210 108 90 frames is defined in [7] in IETF. 92 In ES 202 050 [1], ETSI issues another standard for an Advanced DSR 93 front-end that is meant to provide substantially improved recognition 94 performance in background noise. The codecs in ES 202 050 uses a 95 different frame format from that of ES 201 108 and the two do not 96 inter-operate with each other. Thus, this document defines a 97 separate RTP packetization for ES 202 050 front-end. 99 2.1 ETSI ES 202 050 DSR Front-end Codec 101 Some relevant characteristics of ES 202 050 DSR front-end codec are 102 summarized below. 104 The coding algorithm, a standard mel-cepstral technique common to 105 many speech recognition systems, supports three raw sampling rates: 8 106 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame- 107 based scheme that produces an output vector every 10 ms. 109 After calculation of the mel-cepstral representation, the 110 representation is first quantized via split-vector quantization to 111 reduce the data rate of the encoded stream. Then, the quantized 112 vectors from two consecutive frames are put into an FP, as described 113 in more detail in Section 4.1. 115 3. ES 202 050 DSR RTP Payload Format 117 An ES 202 050 DSR RTP payload datagram consists of a standard RTP 118 header [2] followed by a DSR payload. The DSR payload itself is 119 formed by concatenating a series of ES 202 050 DSR FPs (defined in 120 Section 4). 122 FPs are always packed bit-contiguously into the payload octets 123 beginning with the most significant bit. For ES 202 050 front-end, 124 the size of each FP is 96 bits or 12 octets (see Sections 4.1 and 125 4.2). This ensures that a DSR payload will always end on an octet 126 boundary. 128 The following example shows a DSR RTP datagram carrying a DSR payload 129 containing three 96-bit-long FPs (bit 0 is the MSB): 131 0 1 2 3 132 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 134 \ \ 135 / RTP header in [RFC1889] / 136 \ \ 137 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 138 | | 139 + + 140 | FP #1 (96 bits) | 141 + + 142 | | 143 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 144 | | 145 + + 146 | FP #2 (96 bits) | 147 + + 148 | | 149 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 150 | | 151 + + 152 | FP #3 (96 bits) | 153 + + 154 | | 155 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 157 3.1 Consideration on Number of FPs in Each RTP Packet 159 The number of FPs per payload packet should be determined by the 160 latency and bandwidth requirements of the DSR application using this 161 payload format. In particular, using a smaller number of FPs per 162 payload packet in a session will result in lowered bandwidth 163 efficiency due to the RTP/UDP/IP header overhead, while using a 164 larger number of FPs per packet will cause longer end-to-end delay 165 and hence increased recognition latency. Furthermore, carrying a 166 larger number of FPs per packet will increase the possibility of 167 catastrophic packet loss; the loss of a large number of consecutive 168 FPs is a situation most speech recognizers have difficulty dealing 169 with. 171 It is therefore RECOMMENDED that the number of FPs per DSR payload 172 packet be minimized, subject to meeting the application's 173 requirements on network bandwidth efficiency. RTP header compression 174 techniques, such as those defined in [9] and [10], should be 175 considered to improve network bandwidth efficiency. 177 3.2 Support for Discontinuous Transmission 179 The DSR RTP payloads may be used to support discontinuous 180 transmission (DTX) of speech, which allows that DSR FPs are sent only 181 when speech has been detected at the terminal equipment. 183 In DTX a set of DSR frames coding an unbroken speech segment 184 transmitted from the terminal to the server is called a transmission 185 segment. A DSR frame inside such a transmission segment can be 186 either a speech frame or a non-speech frame, depending on the nature 187 of the section of the speech signal it represents. 189 The end of a transmission segment is determined at the sending end 190 equipment when the number of consecutive non-speech frames exceeds a 191 pre-set threshold, called the hangover time. A typical value used 192 for the hangover time is 1.5 seconds. 194 After all FPs in a transmission segment are sent, the front-end 195 SHOULD indicate the end of the current transmission segment by 196 sending one or more Null FPs (defined in Section 4.2). 198 4. Frame Pair Formats 200 4.1 Format of Speech and Non-speech FPs 202 Similar to the frame pairing format defined in Section 7.2.4 in [1], 203 pairs of the quantized 10ms mel-cepstral front-end frames, of 44 bits 204 each, MUST be grouped together and protected with a 4-bit CRC. 206 Together, these two front-end frames and the CRC field form a 92-bit 207 long Frame-pair (FP): 209 0 1 2 3 210 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 212 | Frame #1 (44 bits) | 213 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 214 | | Frame #2 (44 bits) | 215 +-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ 216 | | CRC |0|0|0|0| 217 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 219 Therefore, each FP represents 20ms of original speech. Note, as 220 shown above, each FP MUST be padded with 4 zeros to the LSB 4 bits of 221 the last octet in order to make the FP aligned to the 32-bit word 222 boundary. This makes the total size of an FP 96 bits, or 12 octets. 223 Note, this padding is separate from padding indicated by the P bit in 224 the RTP header. 226 The 4-bit CRC MUST be calculated using the formula defined in 7.2.4 227 in [1]. 229 To be consistent with the bit-order used in [1], the following mel- 230 cepstral frame formats MUST be used when forming an FP: 232 Frame #1 in FP: 233 =============== 234 (MSB) (LSB) 235 0 1 2 3 4 5 6 7 236 +-----+-----+-----+-----+-----+-----+-----+-----+ 237 : idx(2,3) | idx(0,1) | Octet 1 238 +-----+-----+-----+-----+-----+-----+-----+-----+ 239 : idx(4,5) | idx(2,3) (cont) : Octet 2 240 +-----+-----+-----+-----+-----+-----+-----+-----+ 241 | idx(6,7) |idx(4,5)(cont) Octet 3 242 +-----+-----+-----+-----+-----+-----+-----+-----+ 243 idx(10,11)| VAD | idx(8,9) | Octet 4 244 +-----+-----+-----+-----+-----+-----+-----+-----+ 245 : idx(12,13) | idx(10,11) (cont) : Octet 5 246 +-----+-----+-----+-----+-----+-----+-----+-----+ 247 | idx(12,13) (cont) : Octet 6/1 248 +-----+-----+-----+-----+ 250 Frame #2 in FP: 251 =============== 252 (MSB) (LSB) 253 0 1 2 3 4 5 6 7 254 +-----+-----+-----+-----+ 255 : idx(0,1) | Octet 6/2 256 +-----+-----+-----+-----+-----+-----+-----+-----+ 257 | idx(2,3) |idx(0,1)(cont) Octet 7 258 +-----+-----+-----+-----+-----+-----+-----+-----+ 259 : idx(6,7) | idx(4,5) | Octet 8 260 +-----+-----+-----+-----+-----+-----+-----+-----+ 261 : idx(8,9) | idx(6,7) (cont) : Octet 9 262 +-----+-----+-----+-----+-----+-----+-----+-----+ 263 | idx(10,11) | VAD |idx(8,9)(cont) Octet 10 264 +-----+-----+-----+-----+-----+-----+-----+-----+ 265 | idx(12,13) | Octet 11 266 +-----+-----+-----+-----+-----+-----+-----+-----+ 268 The usage of the index fields and 'VAD' flag are defined in [1] and 269 their value is only set and examined by the codecs in the front-end 270 client and the recognizer. 272 4.2 Format of Null FP 274 A Null FP for the ES 202 050 front-end codec is defined by setting 275 the content of the first and second frame in the FP to null (i.e., 276 filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be 277 calculated the same way as described in 6.2.4 in [1], and 4 zeros 278 MUST be padded to the end of the Null FP to made it 32-bit word 279 aligned. 281 4.3 RTP header usage 283 The format of the RTP header is specified in [2]. This payload 284 format uses the fields of the header in a manner consistent with that 285 specification. 287 The RTP timestamp corresponds to the sampling instant of the first 288 sample encoded for the first FP in the packet. The timestamp clock 289 frequency is the same as the sampling frequency, so the timestamp 290 unit is in samples. 292 As defined by ES 202 050 front-end codec, the duration of one FP is 293 20 ms, corresponding to 160, 220, or 320 encoded samples with 294 sampling rate of 8, 11, or 16 kHz being used at the front-end, 295 respectively. Thus, the timestamp is increased by 160, 220, or 320 296 for each consecutive FP, respectively. 298 The DSR payload for ES 202 050 front-end codes is always an integral 299 number of octets. If additional padding is required for some other 300 purpose, then the P bit in the RTP in the header may be set and 301 padding appended as specified in [2]. 303 The RTP header marker bit (M) should be set following the general 304 rules defined in [6]. 306 The assignment of an RTP payload type for this new packet format is 307 outside the scope of this document, and will not be specified here. 308 It is expected that the RTP profile under which this payload format 309 is being used will assign a payload type for this encoding or specify 310 that the payload type is to be bound dynamically. 312 5. DSR MIME Type Registration 314 Media Type name: audio 316 Media subtype name: dsr-es202050 318 Required parameters: none 320 Optional parameters for RTP mode: 322 rate: Indicates the sample rate of the speech. Valid values 323 include: 8000, 11000, and 16000. If this parameter is not 324 present, 8000 sample rate is assumed. 326 maxptime: The maximum amount of media which can be encapsulated in 327 each packet, expressed as time in milliseconds. The time shall 328 be calculated as the sum of the time the media present in the 329 packet represents. The time SHOULD be a multiple of the frame 330 pair size (i.e., one FP == 20ms). 332 If this parameter is not present, maxptime is assumed to be 333 80ms. 335 Note, since the performance of most speech recognizers are 336 extremely sensitive to consecutive FP losses, if the user of 337 the payload format expects a high packet loss ratio for the 338 session, it MAY consider to explicitly choose a maxptime value 339 for the session that is shorter than the default value. 341 ptime: see RFC2327 [5]. 343 Encoding considerations: This type is defined for transfer via RTP 344 [2] as described in Sections 3 and 4 of RFC XXXX. 346 Security considerations: See Section 6 of RFC XXXX. 348 Person & email address to contact for further information: 349 Qiaobing.Xie@motorola.com 351 Intended usage: COMMON. It is expected that many VoIP applications 352 (as well as mobile applications) will use this type. 354 Author/Change controller: 356 * Qiaobing.Xie@motorola.com 358 * IETF Audio/Video transport working group 360 5.1 Mapping MIME Parameters into SDP 362 The information carried in the MIME media type specification has a 363 specific mapping to fields in the Session Description Protocol (SDP) 364 [5], which is commonly used to describe RTP sessions. When SDP is 365 used to specify sessions employing ES 201 018 DSR codec, the mapping 366 is as follows: 368 o The MIME type ("audio") goes in SDP "m=" as the media name. 370 o The MIME subtype ("dsr-es202050") goes in SDP "a=rtpmap" as the 371 encoding name. 373 o The optional parameter "rate" also goes in "a=rtpmap" as clock 374 rate. 376 o The optional parameters "ptime" and "maxptime" go in the SDP 377 "a=ptime" and "a=maxptime" attributes, respectively. 379 Example of usage of ES 202 050 DSR: 381 m=audio 49120 RTP/AVP 101 382 a=rtpmap:101 dsr-es202050/8000 383 a=maxptime:40 385 6. Security Considerations 387 Implementations using the payload defined in this specification are 388 subject to the security considerations discussed in the RTP 389 specification [2] and the RTP profile [6]. This payload does not 390 specify any different security services. 392 7. Acknowledgments 394 The design presented here is based on that of [7]. 396 Normative References 398 [1] European Telecommunications Standards Institute (ETSI) Standard 399 ES 202 050, "Speech Processing, Transmission and Quality Aspects 400 (STQ); Distributed Speech Recognition; Front-end Feature 401 Extraction Algorithm; Compression Algorithms", Ver. 1.1.1 402 (http://pda.etsi.org/pda/home.asp?wki_id=6402), October 2002. 404 [2] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, 405 "RTP: A Transport Protocol for Real-Time Applications", RFC 406 1889, January 1996. 408 [3] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 409 9, RFC 2026, October 1996. 411 [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement 412 Levels", BCP 14, RFC 2119, March 1997. 414 [5] Handley, M. and V. Jacobson, "SDP: Session Description 415 Protocol", RFC 2327, April 1998. 417 [6] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video 418 Conferences with Minimal Control", draft-ietf-avt-profile-new- 419 12.txt (work in progress), November 2001. 421 Informative References 423 [7] Xie, Q., "RTP Payload Format for ETSI ES 201 108 Distributed 424 Speech Recognition Encoding", draft-ietf-avt-dsr-04 (work in 425 progress), October 2002. 427 [8] European Telecommunications Standards Institute (ETSI) Standard 428 ES 201 108, "Speech Processing, Transmission and Quality 429 Aspects (STQ); Distributed Speech Recognition; Front-end 430 Feature Extraction Algorithm; Compression Algorithms", Ver. 431 1.1.2, http://webapp.etsi.org/pda/home.asp?wki_id=9948, April 432 2000. 434 [9] Casner, S. and V. Jacobson, "Compressing IP/UDP/RTP Headers for 435 Low-Speed Serial Links", RFC 2508, February 1999. 437 [10] Bormann, C., Burmeister, C., Degermark, M., Fukushima, H., 438 Hannu, H., Jonsson, L-E., Hakenberg, R., Koren, T., Le, K., 439 Liu, Z., Martensson, A., Miyazaki, A., Svanbro, K., Wiebke, T., 440 Yoshimura, T. and H. Zheng, "RObust Header Compression (ROHC): 441 Framework and four profiles: RTP, UDP, ESP, and uncompressed", 442 RFC 3095, July 2001. 444 Authors' Addresses 446 Qiaobing Xie 447 Motorola, Inc. 448 1501 W. Shure Drive, 2-F9 449 Arlington Heights, IL 60004 450 US 452 Phone: +1-847-632-3028 453 EMail: qxie1@email.mot.com 455 David Pearce 456 Motorola Labs 457 UK Research Laboratory 458 Jays Close 459 Viables Industrial Estate 460 Basingstoke, HANTS RG22 4PD 461 UK 463 Phone: +44 (0)1256 484 436 464 EMail: bdp003@motorola.com 466 Full Copyright Statement 468 Copyright (C) The Internet Society (2002). All Rights Reserved. 470 This document and translations of it may be copied and furnished to 471 others, and derivative works that comment on or otherwise explain it 472 or assist in its implementation may be prepared, copied, published 473 and distributed, in whole or in part, without restriction of any 474 kind, provided that the above copyright notice and this paragraph are 475 included on all such copies and derivative works. However, this 476 document itself may not be modified in any way, such as by removing 477 the copyright notice or references to the Internet Society or other 478 Internet organizations, except as needed for the purpose of 479 developing Internet standards in which case the procedures for 480 copyrights defined in the Internet Standards process must be 481 followed, or as required to translate it into languages other than 482 English. 484 The limited permissions granted above are perpetual and will not be 485 revoked by the Internet Society or its successors or assigns. 487 This document and the information contained herein is provided on an 488 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 489 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 490 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 491 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 492 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 494 Acknowledgement 496 Funding for the RFC Editor function is currently provided by the 497 Internet Society.