SPEECHSC S. Shanmugham Internet-Draft Cisco Systems, Inc. Expires: April 26, 2006 October 23, 2005 Media Resource Control Protocol Version 2 (MRCPv2) draft-ietf-speechsc-mrcpv2-08 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 26, 2006. Copyright Notice Copyright (C) The Internet Society (2005). Abstract The MRCPv2 protocol allows client hosts to control media service resources such as speech synthesizers, recognizers, verifiers and identifiers residing in servers on the network. MRCPv2 is not a "stand-alone" protocol - it relies on a session management protocol such as the Session Initiation Protocol (SIP) to establish the MRCPv2 control session between the client and the server, and for rendezvous and capability discovery. It also depends on SIP and SDP to establish the media sessions and associated parameters between the media source or sink and the media server. Once this is done, the Shanmugham Expires April 26, 2006 [Page 1] Internet-Draft MRCPv2 October 2005 MRCPv2 protocol exchange operates over the control session established above, allowing the client to control the media processing resources on the speech resource server. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 8 2. Document Conventions . . . . . . . . . . . . . . . . . . . . 9 2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 9 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1. MRCPv2 Media Resource Types . . . . . . . . . . . . . . 11 3.2. Server and Resource Addressing . . . . . . . . . . . . . 12 4. MRCPv2 Protocol Basics . . . . . . . . . . . . . . . . . . . 12 4.1. Connecting to the Server . . . . . . . . . . . . . . . . 13 4.2. Managing Resource Control Channels . . . . . . . . . . . 13 4.3. Media Streams and RTP Ports . . . . . . . . . . . . . . 19 4.4. MRCPv2 Message Transport . . . . . . . . . . . . . . . . 21 5. MRCPv2 Specification . . . . . . . . . . . . . . . . . . . . 21 5.1. Common Protocol Elements . . . . . . . . . . . . . . . . 21 5.2. Request . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3. Response . . . . . . . . . . . . . . . . . . . . . . . . 24 5.4. Status Codes . . . . . . . . . . . . . . . . . . . . . . 24 5.5. Events . . . . . . . . . . . . . . . . . . . . . . . . . 25 6. MRCPv2 Generic Methods and Headers . . . . . . . . . . . . . 26 6.1. Generic Methods . . . . . . . . . . . . . . . . . . . . 27 6.1.1. SET-PARAMS . . . . . . . . . . . . . . . . . . . . . 27 6.1.2. GET-PARAMS . . . . . . . . . . . . . . . . . . . . . 27 6.2. Generic Message Headers . . . . . . . . . . . . . . . . 28 6.2.1. Channel-Identifier . . . . . . . . . . . . . . . . . 29 6.2.2. Active-Request-Id-List . . . . . . . . . . . . . . . 30 6.2.3. Proxy-Sync-Id . . . . . . . . . . . . . . . . . . . 30 6.2.4. Accept-Charset . . . . . . . . . . . . . . . . . . . 31 6.2.5. Content-Type . . . . . . . . . . . . . . . . . . . . 31 6.2.6. Content-ID . . . . . . . . . . . . . . . . . . . . . 31 6.2.7. Content-Base . . . . . . . . . . . . . . . . . . . . 31 6.2.8. Content-Encoding . . . . . . . . . . . . . . . . . . 31 6.2.9. Content-Location . . . . . . . . . . . . . . . . . . 32 6.2.10. Content-Length . . . . . . . . . . . . . . . . . . . 32 6.2.11. Cache-Control . . . . . . . . . . . . . . . . . . . 33 6.2.12. Logging-Tag . . . . . . . . . . . . . . . . . . . . 34 6.2.13. Set-Cookie and Set-Cookie2 . . . . . . . . . . . . . 34 6.2.14. Vendor Specific Parameters . . . . . . . . . . . . . 36 7. Resource Discovery . . . . . . . . . . . . . . . . . . . . . 37 8. Speech Synthesizer Resource . . . . . . . . . . . . . . . . . 38 8.1. Synthesizer State Machine . . . . . . . . . . . . . . . 39 8.2. Synthesizer Methods . . . . . . . . . . . . . . . . . . 40 8.3. Synthesizer Events . . . . . . . . . . . . . . . . . . . 40 Shanmugham Expires April 26, 2006 [Page 2] Internet-Draft MRCPv2 October 2005 8.4. Synthesizer Header Fields . . . . . . . . . . . . . . . 41 8.4.1. Jump-Size . . . . . . . . . . . . . . . . . . . . . 41 8.4.2. Kill-On-Barge-In . . . . . . . . . . . . . . . . . . 42 8.4.3. Speaker Profile . . . . . . . . . . . . . . . . . . 42 8.4.4. Completion Cause . . . . . . . . . . . . . . . . . . 43 8.4.5. Completion Reason . . . . . . . . . . . . . . . . . 43 8.4.6. Voice-Parameters . . . . . . . . . . . . . . . . . . 44 8.4.7. Prosody-Parameters . . . . . . . . . . . . . . . . . 44 8.4.8. Speech Marker . . . . . . . . . . . . . . . . . . . 45 8.4.9. Speech Language . . . . . . . . . . . . . . . . . . 45 8.4.10. Fetch Hint . . . . . . . . . . . . . . . . . . . . . 45 8.4.11. Audio Fetch Hint . . . . . . . . . . . . . . . . . . 46 8.4.12. Fetch Timeout . . . . . . . . . . . . . . . . . . . 46 8.4.13. Failed URI . . . . . . . . . . . . . . . . . . . . . 46 8.4.14. Failed URI Cause . . . . . . . . . . . . . . . . . . 46 8.4.15. Speak Restart . . . . . . . . . . . . . . . . . . . 47 8.4.16. Speak Length . . . . . . . . . . . . . . . . . . . . 47 8.4.17. Load-Lexicon . . . . . . . . . . . . . . . . . . . . 47 8.4.18. Lexicon-Search-Order . . . . . . . . . . . . . . . . 48 8.5. Synthesizer Message Body . . . . . . . . . . . . . . . . 48 8.5.1. Synthesizer Speech Data . . . . . . . . . . . . . . 48 8.5.2. Lexicon Data . . . . . . . . . . . . . . . . . . . . 50 8.6. SPEAK Method . . . . . . . . . . . . . . . . . . . . . . 51 8.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.8. BARGE-IN-OCCURED . . . . . . . . . . . . . . . . . . . . 54 8.9. PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . 55 8.10. RESUME . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.11. CONTROL . . . . . . . . . . . . . . . . . . . . . . . . 57 8.12. SPEAK-COMPLETE . . . . . . . . . . . . . . . . . . . . . 60 8.13. SPEECH-MARKER . . . . . . . . . . . . . . . . . . . . . 60 8.14. DEFINE-LEXICON . . . . . . . . . . . . . . . . . . . . . 62 9. Speech Recognizer Resource . . . . . . . . . . . . . . . . . 63 9.1. Recognizer State Machine . . . . . . . . . . . . . . . . 64 9.2. Recognizer Methods . . . . . . . . . . . . . . . . . . . 65 9.3. Recognizer Events . . . . . . . . . . . . . . . . . . . 66 9.4. Recognizer Header Fields . . . . . . . . . . . . . . . . 66 9.4.1. Confidence Threshold . . . . . . . . . . . . . . . . 68 9.4.2. Sensitivity Level . . . . . . . . . . . . . . . . . 68 9.4.3. Speed Vs Accuracy . . . . . . . . . . . . . . . . . 69 9.4.4. N Best List Length . . . . . . . . . . . . . . . . . 69 9.4.5. Input Type . . . . . . . . . . . . . . . . . . . . . 69 9.4.6. No Input Timeout . . . . . . . . . . . . . . . . . . 69 9.4.7. Recognition Timeout . . . . . . . . . . . . . . . . 70 9.4.8. Waveform URI . . . . . . . . . . . . . . . . . . . . 70 9.4.9. Media Type . . . . . . . . . . . . . . . . . . . . . 71 9.4.10. Input-Waveform-URI . . . . . . . . . . . . . . . . . 71 9.4.11. Completion Cause . . . . . . . . . . . . . . . . . . 71 9.4.12. Completion Reason . . . . . . . . . . . . . . . . . 73 Shanmugham Expires April 26, 2006 [Page 3] Internet-Draft MRCPv2 October 2005 9.4.13. Recognizer Context Block . . . . . . . . . . . . . . 73 9.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 74 9.4.15. Speech Complete Timeout . . . . . . . . . . . . . . 74 9.4.16. Speech Incomplete Timeout . . . . . . . . . . . . . 75 9.4.17. DTMF Interdigit Timeout . . . . . . . . . . . . . . 75 9.4.18. DTMF Term Timeout . . . . . . . . . . . . . . . . . 75 9.4.19. DTMF-Term-Char . . . . . . . . . . . . . . . . . . . 76 9.4.20. Fetch Timeout . . . . . . . . . . . . . . . . . . . 76 9.4.21. Failed URI . . . . . . . . . . . . . . . . . . . . . 76 9.4.22. Failed URI Cause . . . . . . . . . . . . . . . . . . 76 9.4.23. Save Waveform . . . . . . . . . . . . . . . . . . . 77 9.4.24. New Audio Channel . . . . . . . . . . . . . . . . . 77 9.4.25. Speech-Language . . . . . . . . . . . . . . . . . . 77 9.4.26. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 77 9.4.27. Recognition-Mode . . . . . . . . . . . . . . . . . . 78 9.4.28. Cancel-If-Queue . . . . . . . . . . . . . . . . . . 78 9.4.29. Hotword-Max-Duration . . . . . . . . . . . . . . . . 78 9.4.30. Hotword-Min-Duration . . . . . . . . . . . . . . . . 79 9.4.31. Interpret-Text . . . . . . . . . . . . . . . . . . . 79 9.4.32. DTMF-Buffer-Time . . . . . . . . . . . . . . . . . . 79 9.4.33. Clear-DTMF-Buffer . . . . . . . . . . . . . . . . . 80 9.4.34. Num-Min-Consistent-Pronunciations . . . . . . . . . 80 9.4.35. Consistency-Threshold . . . . . . . . . . . . . . . 80 9.4.36. Clash-Threshold . . . . . . . . . . . . . . . . . . 80 9.4.37. Personal-Grammar-URI . . . . . . . . . . . . . . . . 81 9.4.38. Enroll-Utterance . . . . . . . . . . . . . . . . . . 81 9.4.39. Phrase-Id . . . . . . . . . . . . . . . . . . . . . 81 9.4.40. Phrase-NL . . . . . . . . . . . . . . . . . . . . . 81 9.4.41. Weight . . . . . . . . . . . . . . . . . . . . . . . 82 9.4.42. Save-Best-Waveform . . . . . . . . . . . . . . . . . 82 9.4.43. New-Phrase-Id . . . . . . . . . . . . . . . . . . . 82 9.4.44. Confusable-Phrases-URI . . . . . . . . . . . . . . . 82 9.4.45. Abort-Phrase-Enrollment . . . . . . . . . . . . . . 83 9.5. Recognizer Message Body . . . . . . . . . . . . . . . . 83 9.5.1. Recognizer Grammar Data . . . . . . . . . . . . . . 83 9.5.2. Recognizer Result Data . . . . . . . . . . . . . . . 87 9.5.3. Enrollment Result Data . . . . . . . . . . . . . . . 88 9.5.4. Recognizer Context Block . . . . . . . . . . . . . . 88 9.6. Natural Language Semantic Markup Language . . . . . . . 88 9.6.1. Markup Functions . . . . . . . . . . . . . . . . . . 89 9.6.2. Overview of NLSML Elements and their Relationships . 90 9.6.3. Elements and Attributes . . . . . . . . . . . . . . 90 9.7. Enrollment Results . . . . . . . . . . . . . . . . . . . 95 9.7.1. Num-Clashes . . . . . . . . . . . . . . . . . . . . 96 9.7.2. Num-Good-Repetitions . . . . . . . . . . . . . . . . 96 9.7.3. Num-Repetitions-Still-Needed . . . . . . . . . . . . 96 9.7.4. Consistency-Status . . . . . . . . . . . . . . . . . 96 9.7.5. Clash-Phrase-Ids . . . . . . . . . . . . . . . . . . 96 Shanmugham Expires April 26, 2006 [Page 4] Internet-Draft MRCPv2 October 2005 9.7.6. Transcriptions . . . . . . . . . . . . . . . . . . . 96 9.7.7. Confusable-Phrases . . . . . . . . . . . . . . . . . 97 9.8. DEFINE-GRAMMAR . . . . . . . . . . . . . . . . . . . . . 97 9.9. RECOGNIZE . . . . . . . . . . . . . . . . . . . . . . . 100 9.10. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.11. GET-RESULT . . . . . . . . . . . . . . . . . . . . . . . 105 9.12. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 106 9.13. INPUT-TIMERS . . . . . . . . . . . . . . . . . . . . . . 106 9.14. RECOGNITION-COMPLETE . . . . . . . . . . . . . . . . . . 107 9.15. START-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . 109 9.16. ENROLLMENT-ROLLBACK . . . . . . . . . . . . . . . . . . 110 9.17. END-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . . 111 9.18. MODIFY-PHRASE . . . . . . . . . . . . . . . . . . . . . 111 9.19. DELETE-PHRASE . . . . . . . . . . . . . . . . . . . . . 112 9.20. INTERPRET . . . . . . . . . . . . . . . . . . . . . . . 112 9.21. INTERPRETATION-COMPLETE . . . . . . . . . . . . . . . . 113 9.22. DTMF Detection . . . . . . . . . . . . . . . . . . . . . 115 10. Recorder Resource . . . . . . . . . . . . . . . . . . . . . . 115 10.1. Recorder State Machine . . . . . . . . . . . . . . . . . 115 10.2. Recorder Methods . . . . . . . . . . . . . . . . . . . . 115 10.3. Recorder Events . . . . . . . . . . . . . . . . . . . . 116 10.4. Recorder Header Fields . . . . . . . . . . . . . . . . . 116 10.4.1. Sensitivity Level . . . . . . . . . . . . . . . . . 116 10.4.2. No Input Timeout . . . . . . . . . . . . . . . . . . 117 10.4.3. Completion Cause . . . . . . . . . . . . . . . . . . 117 10.4.4. Completion Reason . . . . . . . . . . . . . . . . . 117 10.4.5. Failed URI . . . . . . . . . . . . . . . . . . . . . 118 10.4.6. Failed URI Cause . . . . . . . . . . . . . . . . . . 118 10.4.7. Record URI . . . . . . . . . . . . . . . . . . . . . 118 10.4.8. Media Type . . . . . . . . . . . . . . . . . . . . . 118 10.4.9. Max Time . . . . . . . . . . . . . . . . . . . . . . 119 10.4.10. Trim-Length . . . . . . . . . . . . . . . . . . . . 119 10.4.11. Final Silence . . . . . . . . . . . . . . . . . . . 119 10.4.12. Capture On Speech . . . . . . . . . . . . . . . . . 119 10.4.13. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 120 10.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 120 10.4.15. New Audio Channel . . . . . . . . . . . . . . . . . 120 10.5. Recorder Message Body . . . . . . . . . . . . . . . . . 120 10.6. RECORD . . . . . . . . . . . . . . . . . . . . . . . . . 121 10.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 122 10.8. RECORD-COMPLETE . . . . . . . . . . . . . . . . . . . . 122 10.9. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 123 10.10. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 123 11. Speaker Verification and Identification . . . . . . . . . . . 123 11.1. Speaker Verification State Machine . . . . . . . . . . . 125 11.2. Speaker Verification Methods . . . . . . . . . . . . . . 125 11.3. Verification Events . . . . . . . . . . . . . . . . . . 126 11.4. Verification Header Fields . . . . . . . . . . . . . . . 126 Shanmugham Expires April 26, 2006 [Page 5] Internet-Draft MRCPv2 October 2005 11.4.1. Repository-URI . . . . . . . . . . . . . . . . . . . 127 11.4.2. Voiceprint-Identifier . . . . . . . . . . . . . . . 127 11.4.3. Verification-Mode . . . . . . . . . . . . . . . . . 128 11.4.4. Adapt-Model . . . . . . . . . . . . . . . . . . . . 129 11.4.5. Abort-Model . . . . . . . . . . . . . . . . . . . . 129 11.4.6. Min-Verification-Score . . . . . . . . . . . . . . . 129 11.4.7. Num-Min-Verification-Phrases . . . . . . . . . . . . 129 11.4.8. Num-Max-Verification-Phrases . . . . . . . . . . . . 130 11.4.9. No-Input-Timeout . . . . . . . . . . . . . . . . . . 130 11.4.10. Save-Waveform . . . . . . . . . . . . . . . . . . . 130 11.4.11. Media Type . . . . . . . . . . . . . . . . . . . . . 130 11.4.12. Waveform-URI . . . . . . . . . . . . . . . . . . . . 131 11.4.13. Voiceprint-Exists . . . . . . . . . . . . . . . . . 131 11.4.14. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 131 11.4.15. Input-Waveform-Uri . . . . . . . . . . . . . . . . . 132 11.4.16. Completion-Cause . . . . . . . . . . . . . . . . . . 132 11.4.17. Completion Reason . . . . . . . . . . . . . . . . . 133 11.4.18. Speech Complete Timeout . . . . . . . . . . . . . . 133 11.4.19. New Audio Channel . . . . . . . . . . . . . . . . . 133 11.4.20. Abort-Verification . . . . . . . . . . . . . . . . . 133 11.4.21. Start Input Timers . . . . . . . . . . . . . . . . . 134 11.5. Verification Result Elements . . . . . . . . . . . . . . 134 11.5.1. VoicePrint . . . . . . . . . . . . . . . . . . . . . 136 11.5.2. Cumulative . . . . . . . . . . . . . . . . . . . . . 137 11.5.3. Incremental . . . . . . . . . . . . . . . . . . . . 137 11.5.4. Decision . . . . . . . . . . . . . . . . . . . . . . 137 11.5.5. Utterance-Length . . . . . . . . . . . . . . . . . . 137 11.5.6. Device . . . . . . . . . . . . . . . . . . . . . . . 137 11.5.7. Gender . . . . . . . . . . . . . . . . . . . . . . . 137 11.5.8. Adapted . . . . . . . . . . . . . . . . . . . . . . 137 11.5.9. Verification-Score . . . . . . . . . . . . . . . . . 138 11.5.10. Vendor-Specific-Results . . . . . . . . . . . . . . 138 11.6. START-SESSION . . . . . . . . . . . . . . . . . . . . . 138 11.7. END-SESSION . . . . . . . . . . . . . . . . . . . . . . 139 11.8. QUERY-VOICEPRINT . . . . . . . . . . . . . . . . . . . . 140 11.9. DELETE-VOICEPRINT . . . . . . . . . . . . . . . . . . . 141 11.10. VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 142 11.11. VERIFY-FROM-BUFFER . . . . . . . . . . . . . . . . . . . 142 11.12. VERIFY-ROLLBACK . . . . . . . . . . . . . . . . . . . . 145 11.13. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 145 11.14. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 146 11.15. VERIFICATION-COMPLETE . . . . . . . . . . . . . . . . . 147 11.16. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 147 11.17. CLEAR-BUFFER . . . . . . . . . . . . . . . . . . . . . . 148 11.18. GET-INTERMEDIATE-RESULT . . . . . . . . . . . . . . . . 148 12. Security Considerations . . . . . . . . . . . . . . . . . . . 149 12.1. Rendezvous and Session Establishment . . . . . . . . . . 150 12.2. Control channel protection . . . . . . . . . . . . . . . 150 Shanmugham Expires April 26, 2006 [Page 6] Internet-Draft MRCPv2 October 2005 12.3. Media session protection . . . . . . . . . . . . . . . . 150 12.4. Indirect Content Access . . . . . . . . . . . . . . . . 150 12.5. Protection of stored media . . . . . . . . . . . . . . . 151 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 151 13.1. New registries . . . . . . . . . . . . . . . . . . . . . 151 13.1.1. MRCPv2 resource types . . . . . . . . . . . . . . . 151 13.1.2. MRCPv2 methods and events . . . . . . . . . . . . . 151 13.1.3. MRCPv2 headers . . . . . . . . . . . . . . . . . . . 151 13.1.4. MRCPv2 status codes . . . . . . . . . . . . . . . . 152 13.1.5. Grammar Reference List Parameters . . . . . . . . . 152 13.1.6. MRCPv2 vendor-specific parameters . . . . . . . . . 152 13.2. NLSML-related registrations . . . . . . . . . . . . . . 153 13.2.1. application/nlsml+xml MIME type registration . . . . 153 13.3. NLSML XML DTD registration . . . . . . . . . . . . . . . 153 13.4. NLSML XML Schema registration . . . . . . . . . . . . . 154 13.5. NLSML XML Name space registration . . . . . . . . . . . 154 13.6. text/grammar-ref-list Mime Type Registration . . . . . . 154 13.7. session URL scheme registration . . . . . . . . . . . . 155 13.8. SDP parameter registrations . . . . . . . . . . . . . . 156 14. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 157 14.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . 157 14.2. Recognition Result Examples . . . . . . . . . . . . . . 166 14.2.1. Simple ASR Ambiguity . . . . . . . . . . . . . . . . 166 14.2.2. Mixed Initiative . . . . . . . . . . . . . . . . . . 166 14.2.3. DTMF Input . . . . . . . . . . . . . . . . . . . . . 167 14.2.4. Interpreting Meta-Dialog and Meta-Task Utterances . 168 14.2.5. Anaphora and Deixis . . . . . . . . . . . . . . . . 169 14.2.6. Distinguishing Individual Items from Sets with One Member . . . . . . . . . . . . . . . . . . . . . 169 14.2.7. Extensibility . . . . . . . . . . . . . . . . . . . 170 15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 170 16. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 184 16.1. NLSML Schema Definition . . . . . . . . . . . . . . . . 184 16.2. Enrollment Results Schema Definition . . . . . . . . . . 186 16.3. Verification Results Schema Definition . . . . . . . . . 187 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 190 17.1. Normative References . . . . . . . . . . . . . . . . . . 190 17.2. Informative References . . . . . . . . . . . . . . . . . 192 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 193 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 194 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 195 Intellectual Property and Copyright Statements . . . . . . . . . 196 Shanmugham Expires April 26, 2006 [Page 7] Internet-Draft MRCPv2 October 2005 1. Introduction The MRCPv2 protocol is designed for a client device to control media processing resources on the network. Some of these media processing resources include speech recognition engines, speech synthesis engines, speaker verification and speaker identification engines. MRCPv2 enables the implementation of distributed Interactive Voice Response platforms using VoiceXML [12] browsers or other client applications while maintaing separate back-end speech processing capabilities on specialized speech processing servers. MRCPv2 is based on the earlier Media Resource Control Protocol (MRCP) [30] developed jointly by Cisco Systems, Inc., Nuance Communications, and Speechworks Inc. The protocol requirements of SPEECHSC[1] dictate that the solution be capable of reaching a media processing server and setting up communication channels to the media resources, to send/receive control messages and media streams to/from the server. The Session Initiation Protocol (SIP) [3] meets these requirements. MRCPv2 hence is designed to leverage and build upon SIP and the Session Description Protocol (SDP) [4]. MRCPv2 uses SIP to setup and tear down media and control sessions with the server. In addition, the client can use the SIP re-INVITE method to change the characteristics of these media and control session while maintining the SIP dialog between the client and server. SDP is used to describe the parameters of the media sessions associated with that dialog. It is mandatory to support SIP as the session establishment protocol to ensure interoperability. Other protocols can be used for session establishment by prior agreement. MRCPv2 uses SIP and SDP to create the client/server dialog and set up the media channels to the server. It also uses SIP and SDP to establish MRCPv2 control sessions between the client and the server for each media processing resource required for that dialog. The MRCPv2 protocol exchange between the client and the media resource is carried on that control session. MRCPv2 protocol exchanges do not change the state of the SIP dialog, the media sessions, or other parameters of the dialog SIP initiated. It controls and affects the state of the media processing resource associated with the MRCPv2 session(s). MRCPv2 defines the messages to control the different media processing resources and the state machines required to guide their operation. It also describes how these messages are carried over a transport layer protocol such as TCP or TLS (Note: SCTP is a viable transport for MRCPv2 as well, but the mapping onto SCTP is not described in this specification). Shanmugham Expires April 26, 2006 [Page 8] Internet-Draft MRCPv2 October 2005 2. Document Conventions RFC2119 [5] provides the interpretations for the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" found in this document. Since many of the definitions and syntax are identical to HTTP/1.1 (RFC2616 [6]), this specification refers to the section where they are defined rather than copying it. For brevity, [HX.Y] is to be taken to refer to Section X.Y of RFC2616. All the mechanisms specified in this document are described in both prose and an augmented Backus-Naur form (ABNF [9]). The complete message format in ABNF form is provided in Section 15 and is the normative format definition. 2.1. Definitions Media Resource An entity on the speech processing server that can be controlled through the MRCPv2 protocol. MRCP Server Aggregate of one or more "Media Resource" entities on a Server, exposed through the MRCPv2 protocol ("Server" for short). MRCP Client An entity controlling one or more Media Resources through the MRCPv2 protocol ("Client" for short). DTMF Dual Tone Multi-Frequency; a method of transmitting key presses in-band, either as actual tones (Q.23 [28]) or as named tone events (RFC2833 [29]). Hotword Mode A mode of speech recognition where a stream of utterances is evaluated for match against a small set of command words. This is genrally employed to teither trigger some action, or to control the subsequent grammar to be used for furhter recognition 3. Architecture A system using MRCPv2 consists of a client that requires the generation and/or consumption of media streams and a media resource server that has the resources or "engines" to process these streams as input or generate these streams as output. The client use SIP and SDP to establish an MRCPv2 control channel with the server to use its Shanmugham Expires April 26, 2006 [Page 9] Internet-Draft MRCPv2 October 2005 media processing resources. MRCPv2 servers are addressed using SIP URIs. The session management protocol (SIP) uses SDP with the offer/answer model described in RFC3264 [7] to set up the MRCPv2 control channels and describe their characteristics. A separate MRCPv2 session is needed to control each of the media processing resources associated with the SIP dialog between the client and server. Within a SIP dialog, the individual resource control channels for the different resources are added or removed through SDP offer/answer carried in a SIP re-INVITE transaction. The server, through the SDP exchange, provides the client with an unambiguous channel identifier and a TCP port number. The client MAY then open a new TCP connection with the server using this port number. Multiple MRCPv2 channels can share a TCP connection between the client and the server. All MRCPv2 messages exchanged between the client and the server carry the specified channel identifier that the server MUST ensure are unambiguous among all MRCPv2 control channels that are active on that server. The client uses this channel identifier to indicate the media processing resource associated with that channel. The session management protocol (SIP) also establishes the media sessions between the client (or other source/sink of media) and the MRCPv2 server using SDP m-lines. One or more media processing resources may share a media session under a SIP session, or each media processing resource may have its own media session. Shanmugham Expires April 26, 2006 [Page 10] Internet-Draft MRCPv2 October 2005 MRCPv2 client MRCPv2 Media Resource Server |--------------------| |-----------------------------| ||------------------|| ||---------------------------|| || Application Layer|| || TTS | ASR | SV | SI || ||------------------|| ||Engine|Engine|Engine|Engine|| ||Media Resource API|| ||---------------------------|| ||------------------|| || Media Resource Management || || SIP | MRCPv2 || ||---------------------------|| ||Stack | || || SIP | MRCPv2 || || | || || Stack | || ||------------------|| ||---------------------------|| || TCP/IP Stack ||----MRCPv2---|| TCP/IP Stack || || || || || ||------------------||-----SIP-----||---------------------------|| |--------------------| |-----------------------------| | / SIP / | / |-------------------| RTP | | / | Media Source/Sink |-------------/ | | |-------------------| Figure 1: Architectural Diagram 3.1. MRCPv2 Media Resource Types An MRCPv2 server may offer one or more of the following media processing resources to its clients. Basic Synthesizer A speech synthesizer resource with very limited capabilities, that can be generate its media stream exclusively from concatenated audio clips. The speech data is described using a limited subset of SSML [25] elements. A basic synthesizer MUST support the SSML tags ,