idnits 2.17.1 draft-xie-avt-dsr-00.txt: ** The Abstract section seems to be numbered Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 424 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 13 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 6, 2001) is 8329 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2026' is mentioned on line 18, but not defined == Missing Reference: 'RFC-2119' is mentioned on line 45, but not defined == Unused Reference: 'RFC2016' is defined on line 342, but no explicit reference was found in the text == Unused Reference: 'RFC2119' is defined on line 345, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ES201108' ** Obsolete normative reference: RFC 1889 (Obsoleted by RFC 3550) == Outdated reference: A later version (-13) exists of draft-ietf-avt-profile-new-08 -- Possible downref: Normative reference to a draft: ref. 'RFC1890' Summary: 7 errors (**), 0 flaws (~~), 7 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Q. Xie, Motorola 3 Audio Video Transport WG D. Pearce, Motorola 4 INTERNET-DRAFT S. Balasuriya, Motorola 5 Y. Kim, VerbalTek 6 S. H. Maes, IBM 7 Hari Garudadri, Qualcomm 9 Expires in six months July 6, 2001 11 RTP Payload Format for Distributed Speech Recognition 12 14 Status of this Memo 16 This document is an Internet-Draft and is in full conformance with 17 all provisions of Section 10 of [RFC2026]. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. Internet-Drafts are draft documents valid for a maximum of 23 six months and may be updated, replaced, or obsoleted by other 24 documents at any time. It is inappropriate to use Internet- Drafts 25 as reference material or to cite them other than as "work in 26 progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 1. Abstract 35 This document specifies an RTP payload format for encapsulating a 36 front-end signal processing feature streams for distributed speech 37 recognition (DSR) systems, with the ETSI Standard ES 201 108 front-end 38 being the default codec. 40 2. Conventions 42 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 43 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 44 this document are to be interpreted as described in [RFC-2119]. 46 3. Introduction 48 Motivated by technology advances in the field of speech recognition, 49 voice interfaces to a variety of services (such as airline 50 information systems, unified messaging, and the like) are becoming 51 more and more prevalent. In parallel, the popularity of mobile 52 computing and communications devices has also increased 53 dramatically. However, the voice codecs typically employed in mobile 54 systems were designed to optimize audible voice quality and not 55 speech recognition accuracy, and using these codecs with speech 56 recognizers can result in poor recognition performance. For systems 57 that can be accessed from multiple networks using multiple speech 58 codecs, recognition system designers are further challenged to 59 accommodate the characteristics of these differences in a robust 60 manner. Channel errors and lost data packets in these networks result 61 in further degradation of the speech signal. 63 In traditional systems as described above, the entire speech 64 recognizer lies on the server appliance. It is forced to use 65 incoming speech in whatever condition it arrives in after the 66 network decodes the vocoded speech. A solution that combats this 67 uses a scheme called "distributed speech recognition" (DSR). In this 68 system, the remote device acts as a thin client in communication 69 with a speech recognition server, also called a speech engine (SE). The 70 remote device processes the speech, compresses, and error protects the 71 bitstream in a manner optimal for speech recognition. The speech engin 72 then uses this representation directly, minimizing the signal 73 processing necessary and benefiting from enhanced error concealment. 75 To achieve interoperability with different client devices and speech 76 engins, a common format is needed. Within the "Aurora" DSR working 77 group of the European Telecommunications Standards Institute (ETSI), a 78 payload has been defined and was published as a standard in February 79 2000 [ES201108]. 81 For interactive voice user interface dialogues between a caller and a 82 voice service, low latency is also a high priority along with accurate 83 speech recognition. While jitter in the speech recognizer input is not 84 particularly important, many issues related to speech interaction over 85 an IP-based connection are still relevant. Therefore, it will be 86 desirable to use the DSR payload in an RTP-based session. 88 3.1 Typical Scenarios for Using DSR Payload Format 90 The following diagrams show some typical use scenarios of the DSR RTP 91 payload format. 93 +--------+ +----------+ 94 |IP USER | IP/UDP/RTP/DSR |IP SPEECH | 95 |TREMINAL|-------------------->| ENGINE | 96 | | | | 97 +--------+ +----------+ 99 +--------+ DSR over +-------+ +----------+ 100 | Non-IP | Circuit link | | IP/UDP/RTP/DSR |IP SPEECH | 101 | USER |:::::::::::::::>|GATEWAY|--------------->| ENGINE | 102 |TERMINAL| ETSI payload | | | | 103 +--------+ format +-------+ +----------+ 105 +--------+ +-------+ DSR over +----------+ 106 |IP USER | IP/UDP/RTP/DSR | | circuit link | Non-IP | 107 |TREMINAL|----------------->|GATEWAY|::::::::::::::::>| SPEECH | 108 | | | | ETSI payload | ENGINE | 109 +--------+ +-------+ format +----------+ 111 Figure 1: Typical Scenarios for Using DSR Payload Format. 113 For the different scenarios in Figure 1, the speech recognizer resides 114 in the speech engin, while a DSR front-end encoder inside the User 115 Terminal performs front-end speech processing and sends the resultant 116 data to the speech engin in the form of "frame-pairs" (FPs). Each 117 frame-pair normally contains two sets of encoded speech vectors 118 representing 20ms of original speech. 120 4. DSR RTP Payload Format 122 4.1 Payload Header 124 Each DSR payload MUST begin with the follow payload header of one 125 octet length: 127 0 128 0 1 2 3 4 5 6 7 129 +-+-+-+-+-+-+-+-+ 130 | FPC |E|R|R|R| 131 +-+-+-+-+-+-+-+-+ 133 Figure 2: Payload header. 135 FPC - Frame-Pair Count, indicating the number of Frame-pairs (FPs) 136 included in this payload packet. 138 E - End of speech segment flag. When set to 1, indicating the last 139 frame pair in this payload packet is the end of the current 140 speech segment. 142 R - reserved bits. Must be set to 0 by the sender of the payload 143 and ignored by the receiver. 145 4.2 Payload Body 147 The DSR payload is formed by concatenating the above payload header 148 and FPC number of frame-pairs. 150 Each DSR payload MUST be octet-aligned at the end, i.e., if a DSR 151 payload does not end on an octet boundary, it then MUST be padded at 152 the end with zeros to the next octet boundary. 154 The following example shows a DSR payload carrying 3 frame pairs: 156 0 1 2 3 157 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 158 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 159 | FPC=3 |E|0|0|0| | 160 +-+-+-+-+-+-+-+-+ + 161 | FP #1 | 162 + +-+-+-+-+ 163 | | | 164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 165 | | 166 + + 167 | FP #2 | 168 + +-+-+-+-+-+-+-+-+ 169 | | | 170 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 171 | | 172 + FP #3 + 173 | | 174 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 175 | |0|0|0|0| 176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 178 In this example, the payload is shown with 4 zeros padded at the end 179 to make it octet-aligned. 181 The number of FPs per payload packet should be determined by the 182 latency and bandwidth requirements of the DSR application. 184 A decreasing number of FPs per payload packet reduces the bandwidth 185 efficiency due to the RTP header overhead, while an increacing number 186 of FPs per packet causes longer end-to-end delay and hence bigger 187 recognition latency. 189 Furthermore, an increacing number of FPs per packet rises the 190 potential of the loss of a large number of consecutive frame-pairs, 191 which is a situation most speech recogziers have difficult to deal 192 with. 194 Therefore, it is RECOMMENDED that the number of FPs per DSR 195 payload packet be minimized, subject to meeting the application's 196 requirements on network bandwidth efficiency. 198 RTP header compression [RFC2508] SHOULD be considered to improve 199 network bandwidth efficiency. 201 5. Frame-pair Format 203 Depending on the type of the DSR front-end encoder to be used in the 204 present DSR RTP session, the frame-pair format may be different. 206 When setting up a DSR RTP sessions, the user terminal will inform the 207 speech engine the type of the front-end encoder, using the 208 front-end-type MIME parameter as defined in Section 7. 210 In this memo, we only define the frame-pair format that MUST be used 211 when the ESTI ES 201 108 Front-end Codec [ES201108] is used. Frame- 212 pair formats for future DSR front-end codecs may be defined in 213 separate IETF documents. 215 5.1. Frame-Pair Format For ETSI ES 201 108 Front-end Codec 217 The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal 218 processing front-end and compression scheme for speech input to a 219 speech recognition system. Some relevant characteristics of this ETSI 220 DSR front-end codec are summarized below. 222 The coding algorithm, a standard mel-cepstral technique common to many 223 speech recognition systems, supports three raw sampling rates: 8 kHz, 224 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame- based 225 scheme that produces an output vector every 10 ms. 227 After calculation of the mel-cepstral representation, the 228 representation is quantized via split-vector quantization to reduce 229 the data rate of the encoded stream. This is a lossy compression, with 230 the output being a frame containing an integer representation of the 231 encoded speech. 233 For ES 201 108 Front-end Codec, the following mel-cepstral frame MUST 234 be used, as defined in [ES201108]: 236 0 1 2 3 237 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 239 | idx(0,1) | idx(2,3) | idx(4,5) | idx(6,7) | idx(8,9) |idx 240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 241 (10,11) | idx(12,13) | 242 +-+-+-+-+-+-+-+-+-+-+-+-+ 244 The length of a frame is 44 bits representing 10ms of voice. 246 As defined in [ES201108], pairs of the quantized 10ms mel-cepstral 247 frames MUST be grouped together and protected with a 4-bit CRC, 248 forming a 92-bit long frame-pair: 250 0 1 2 3 251 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 253 | Frame #1 (44 bits) | 254 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 255 | | Frame #2 (44 bits) | 256 +-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ 257 | | CRC | 258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 260 Therefore, each frame-pair represents 20ms of original speech. 262 The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in 263 [ES201108]. 265 6. DSR MIME Type Registration 267 Media Type name: audio 269 Media subtype name: DSR 271 Required parameters: none 273 Optional parameters for RTP mode: 275 sample-rate: Indicating the sample rate of the speech. Valid values 276 include: 8k, 11k, and 16k. 278 If this parameter is not present, 8k sample rate is 279 assumed. 281 front-end-type: Indicating the type of the front-end codec to be used 282 for this DSR session. Valid values are: 284 etsi_mfcc - indicates that ETSI ES 201 108 Front-end 285 Codec as defined in [ES201108] will be used. 287 unspecified - indicates that other front-end codec 288 will be used. 290 If this parameter is absent, ETSI ES 201 108 291 Front-end will be assumed. 293 maxptime: The maximum amount of media which can be encapsulated in 294 each packet, expressed as time in milliseconds. The time 295 shall be calculated as the sum of the time the media 296 present in the packet represents. The time SHOULD be a 297 multiple of the frame pair size (i.e., one FP <-> 20ms). 299 If this parameter is not present, maxptime will be assumed 300 to 60ms. 302 Encoding considerations : 304 Security considerations : 306 Interoperability considerations : 308 Person & email address to contact for further information: 310 Intended usage: COMMON. It is expected that many VoIP applications 311 (as well as mobile applications) will use this type. 313 Author/Change controller: 314 315 IETF Audio/Video transport working group 317 7. Security Considerations 319 Implementations using the payload defined in this specification are 320 subject to the security considerations discussed in the RTP 321 specification [RFC1889] and the RTP profile [RFC1890]. This payload 322 does not specify any different security services. 324 8. References 326 [ES201108] European Telecommunications Standards Institute (ETSI) 327 Standard ES 201 108, "Speech Processing, Transmission and Quality 328 Aspects (STQ); Distributed Speech Recognition; Front-end Feature 329 Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 330 11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948 332 [RFC1889] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, 333 "RTP: A transport protocol for real-time applications," Internet 334 Draft, Internet Engineering Task Force, Feb. 1999 Work in progress, 335 revision to RFC 1889. 337 [RFC1890] H. Schulzrinne and S. Casner, "RTP Profile for Audio and 338 Video Conferences with Minimal Control," Internet Draft 339 draft-ietf-avt-profile-new-08.txt, Work in Progress January 14, 340 2000, revision to RFC 1890. 342 [RFC2016] Bradner, S., "The Internet Standards Process -- Revision 3", 343 BCP 9, RFC 2026, October 1996. 345 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 346 Requirement Levels", BCP 14, RFC 2119, March 1997 348 [RFC2508] S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers 349 for Low-Speed Serial Links," RFC 2508, February 1999. 351 9. Acknowledgments 353 The design presented here benefits greatly from an earlier work on DSR 354 RTP payload design by Jeff Meunier. 356 10. Author's Addresses 358 Qiaobing Xie Tel: +1-847-632-3028 359 Motorola, Inc. EMail: qxie1@email.mot.com 360 1501 W. Shure Drive, 2-F9 361 Arlington Heights, IL 60004, USA 363 David Pearce Tel: +44 (0)1256 484 436 364 Motorola Labs EMail: bdp003@motorola.com 365 UK Research Laboratory 366 Jays Close 367 Viables Industrial Estate 368 Basingstoke, HANTS, RG22 4PD 370 Senaka Balasuriya Tel: +1-630-353-8347 371 Motorola, Inc. EMail: Senaka.Balasuriya@motorola.com 372 1411 Opus Place, Suite 350 373 Downers Grover, IL 60515, USA 375 Yoon Kim Tel: +1-408-768-4974 376 VerbalTek, Inc. EMail: yoonie@verbaltek.com 377 2921 Copper Rd. 378 Santa Clara, CA 95051 380 Stephane H. Maes Tel: +1-914-945-2908 381 IBM EMail: smaes@us.ibm.com 382 TJ Watson Research Center 383 P.O. Box 218, 384 Yorktown Heights, NY 10598, USA. 386 Hari Garudadri Tel: 387 Qualcomm EMail: hgarudad@qualcomm.com 389 This Internet Draft expires in 6 months from July 2001.