Internet Engineering Task Force Q. Xie, Motorola Audio Video Transport WG D. Pearce, Motorola INTERNET-DRAFT S. Balasuriya, Motorola Y. Kim, VerbalTek S. H. Maes, IBM Hari Garudadri, Qualcomm Expires in six months Nov. 12, 2001 RTP Payload Format for Distributed Speech Recognition Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document specifies an RTP payload format for encapsulating a front-end signal processing feature streams for distributed speech recognition (DSR) systems, with the ETSI Standard ES 201 108 front-end being the default codec. 2. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. 3. Introduction Motivated by technology advances in the field of speech recognition, voice interfaces to a variety of services (such as airline information systems, unified messaging, and the like) are becoming more and more prevalent. In parallel, the popularity of mobile computing and communications devices has also increased dramatically. However, the voice codecs typically employed in mobile systems were designed to optimize audible voice quality and not speech recognition accuracy, and using these codecs with speech recognizers can result in poor recognition performance. For systems that can be accessed from multiple networks using multiple speech codecs, recognition system designers are further challenged to accommodate the characteristics of these differences in a robust manner. Channel errors and lost data packets in these networks result in further degradation of the speech signal. In traditional systems as described above, the entire speech recognizer lies on the server appliance. It is forced to use incoming speech in whatever condition it arrives in after the network decodes the vocoded speech. A solution that combats this uses a scheme called "distributed speech recognition" (DSR). In this system, the remote device acts as a thin client in communication with a speech recognition server, also called a speech engine (SE). The remote device processes the speech, compresses, and error protects the bitstream in a manner optimal for speech recognition. The speech engin then uses this representation directly, minimizing the signal processing necessary and benefiting from enhanced error concealment. To achieve interoperability with different client devices and speech engins, a common format is needed. Within the "Aurora" DSR working group of the European Telecommunications Standards Institute (ETSI), a payload has been defined and was published as a standard in February 2000 [ES201108]. For interactive voice user interface dialogues between a caller and a voice service, low latency is also a high priority along with accurate speech recognition. While jitter in the speech recognizer input is not particularly important, many issues related to speech interaction over an IP-based connection are still relevant. Therefore, it will be desirable to use the DSR payload in an RTP-based session. 3.1 Typical Scenarios for Using DSR Payload Format The following diagrams show some typical use scenarios of the DSR RTP payload format. +--------+ +----------+ |IP USER | IP/UDP/RTP/DSR |IP SPEECH | |TERMINAL|-------------------->| ENGINE | | | | | +--------+ +----------+ +--------+ DSR over +-------+ +----------+ | Non-IP | Circuit link | | IP/UDP/RTP/DSR |IP SPEECH | | USER |:::::::::::::::>|GATEWAY|--------------->| ENGINE | |TERMINAL| ETSI payload | | | | +--------+ format +-------+ +----------+ +--------+ +-------+ DSR over +----------+ |IP USER | IP/UDP/RTP/DSR | | circuit link | Non-IP | |TERMINAL|----------------->|GATEWAY|::::::::::::::::>| SPEECH | | | | | ETSI payload | ENGINE | +--------+ +-------+ format +----------+ Figure 1: Typical Scenarios for Using DSR Payload Format. For the different scenarios in Figure 1, the speech recognizer resides in the speech engin, while a DSR front-end encoder inside the User Terminal performs front-end speech processing and sends the resultant data to the speech engin in the form of "frame-pairs" (FPs). Each frame-pair normally contains two sets of encoded speech vectors representing 20ms of original speech. 4. DSR RTP Payload Format The DSR RTP payload is formed by concatenating a series of DSR frame-pairs. The format of the DSR frame-pair may vary from one front-end type to another, see Section 5 for details. Each DSR payload MUST be octet-aligned at the end, i.e., if a DSR payload does not end on an octet boundary, it then MUST be padded at the end with zeros to the next octet boundary. The following example shows a DSR payload carrying 3 92-bit-long frame pairs: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | FP #1 (92 bits) | + +-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + + | FP #2 (92 bits) | + +-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | + FP #3 (92 bits) + | | + +-+-+-+-+-+-+-+-+-+-+-+-+ | |0|0|0|0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ In this example, the payload is shown with 4 zeros padded at the end to make it octet-aligned. The number of FPs per payload packet should be determined by the latency and bandwidth requirements of the DSR application. A decreasing number of FPs per payload packet reduces the bandwidth efficiency due to the RTP header overhead, while an increacing number of FPs per packet causes longer end-to-end delay and hence bigger recognition latency. Furthermore, an increasing number of FPs per packet rises the potential of the loss of a large number of consecutive frame-pairs, which is a situation most speech recogziers have difficult to deal with. Therefore, it is RECOMMENDED that the number of FPs per DSR payload packet be minimized, subject to meeting the application's requirements on network bandwidth efficiency. RTP header compression [RFC2508] SHOULD be considered to improve network bandwidth efficiency. 4.1. Support for Discontinuous Transmission The DSR RTP payloads may be used for discontinuous transmission so that DSR FPs are only sent when speech has been detected at the terminal equipment. A contiguous segment of DSR frames to be transmitted from the terminal to the server is called a Transmission Segment. A DSR frame inside a transmission segment can be either a speech frame or a non-speech frame, depending on the nature of the section of the speech signal it represents. The end of a transmission segment is determined at the sending end equipment when the number of consecutive non-speech frames exceeds the hangover time. A typical value used for the hangover time is 1.5 seconds. After all FPs in a transmission segment are sent, the front-end SHOULD indicate the end of the current transmission segment by sending one or more Null FPs. 5. Frame-pair Format Depending on the type of the DSR front-end encoder to be used in the present DSR RTP session, the frame-pair format may be different. When setting up a DSR RTP sessions, the user terminal will inform the speech engine the type of the front-end encoder, using the front-end-type MIME parameter as defined in Section 7. In this memo, we only define the frame-pair formats that MUST be used when the ESTI ES 201 108 Front-end Codec [ES201108] is used. Frame- pair formats for future DSR front-end codecs may be defined in separate IETF documents. 5.1. Frame-Pair Formats For ETSI ES 201 108 Front-end Codec The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal processing front-end and compression scheme for speech input to a speech recognition system. Some relevant characteristics of this ETSI DSR front-end codec are summarized below. The coding algorithm, a standard mel-cepstral technique common to many speech recognition systems, supports three raw sampling rates: 8 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame-based scheme that produces an output vector every 10 ms. After calculation of the mel-cepstral representation, the representation is quantized via split-vector quantization to reduce the data rate of the encoded stream. This is a lossy compression, with the output being a frame containing an integer representation of the encoded speech. 5.1.1 Format of Speech and Non-speech FPs For ES 201 108 Front-end Codec, the following mel-cepstral frame MUST be used, as defined in [ES201108]: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | idx(0,1) | idx(2,3) | idx(4,5) | idx(6,7) | idx(8,9) |idx +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ (10,11) | idx(12,13) | +-+-+-+-+-+-+-+-+-+-+-+-+ The length of a frame is 44 bits representing 10ms of voice. As defined in [ES201108], pairs of the quantized 10ms mel-cepstral frames MUST be grouped together and protected with a 4-bit CRC, forming a 92-bit long frame-pair: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Frame #1 (44 bits) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Frame #2 (44 bits) | +-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+ | | CRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Therefore, each frame-pair represents 20ms of original speech. The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in [ES201108]. 5.1.2 Format of Null FPs Null FP MUST be defined by setting the content of the first and second frame in the FP to null (i.e., filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be calculated the same way as described for speech FPs in 5.1.1. 6. DSR MIME Type Registration Media Type name: audio Media subtype name: DSR Required parameters: none Optional parameters for RTP mode: sample-rate: Indicating the sample rate of the speech. Valid values include: 8k, 11k, and 16k. If this parameter is not present, 8k sample rate is assumed. front-end-type: Indicating the type of the front-end codec to be used for this DSR session. Valid values are: etsi_mfcc - indicates that ETSI ES 201 108 Front-end Codec as defined in [ES201108] will be used. unspecified - indicates that other front-end codec will be used. If this parameter is absent, ETSI ES 201 108 Front-end will be assumed. maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time shall be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the frame pair size (i.e., one FP <-> 20ms). If this parameter is not present, maxptime will be assumed to 60ms. Encoding considerations : Security considerations : Interoperability considerations : Person & email address to contact for further information: Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Author/Change controller: IETF Audio/Video transport working group 7. Security Considerations Implementations using the payload defined in this specification are subject to the security considerations discussed in the RTP specification [RFC1889] and the RTP profile [RFC1890]. This payload does not specify any different security services. 8. References [ES201108] European Telecommunications Standards Institute (ETSI) Standard ES 201 108, "Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 11, 2000. http://webapp.etsi.org/pda/home.asp?wki_id=9948 [RFC1889] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A transport protocol for real-time applications," Internet Draft, Internet Engineering Task Force, Feb. 1999 Work in progress, revision to RFC 1889. [RFC1890] H. Schulzrinne and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control," Internet Draft draft-ietf-avt-profile-new-08.txt, Work in Progress January 14, 2000, revision to RFC 1890. [RFC2016] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997 [RFC2508] S. Casner and V. Jacobson, "Compressing IP/UDP/RTP Headers for Low-Speed Serial Links," RFC 2508, February 1999. 9. Acknowledgments The design presented here benefits greatly from an earlier work on DSR RTP payload design by Jeff Meunier. 10. Author's Addresses Qiaobing Xie Tel: +1-847-632-3028 Motorola, Inc. EMail: qxie1@email.mot.com 1501 W. Shure Drive, 2-F9 Arlington Heights, IL 60004, USA David Pearce Tel: +44 (0)1256 484 436 Motorola Labs EMail: bdp003@motorola.com UK Research Laboratory Jays Close Viables Industrial Estate Basingstoke, HANTS, RG22 4PD Senaka Balasuriya Tel: +1-630-353-8347 Motorola, Inc. EMail: Senaka.Balasuriya@motorola.com 1411 Opus Place, Suite 350 Downers Grover, IL 60515, USA Yoon Kim Tel: +1-408-768-4974 VerbalTek, Inc. EMail: yoonie@verbaltek.com 2921 Copper Rd. Santa Clara, CA 95051 Stephane H. Maes Tel: +1-914-945-2908 IBM EMail: smaes@us.ibm.com TJ Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598, USA. Hari Garudadri Tel: Qualcomm EMail: hgarudad@qualcomm.com This Internet Draft expires in 6 months from Nov. 2001.