idnits 2.17.1 draft-ietf-speechsc-mrcpv2-20.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC2606-compliant FQDNs in the document. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 11, 2009) is 5371 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFCXXXX' is mentioned on line 7636, but not defined ** Obsolete normative reference: RFC 2326 (Obsoleted by RFC 7826) ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 4572 (Obsoleted by RFC 8122) ** Obsolete normative reference: RFC 3388 (Obsoleted by RFC 5888) ** Obsolete normative reference: RFC 2109 (Obsoleted by RFC 2965) ** Obsolete normative reference: RFC 2965 (Obsoleted by RFC 6265) ** Obsolete normative reference: RFC 4646 (Obsoleted by RFC 5646) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) ** Obsolete normative reference: RFC 4288 (Obsoleted by RFC 6838) ** Obsolete normative reference: RFC 4395 (Obsoleted by RFC 7595) ** Downref: Normative reference to an Experimental RFC: RFC 2483 Summary: 13 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPEECHSC S. Shanmugham 3 Internet-Draft Cisco Systems, Inc. 4 Intended status: Standards Track D. Burnett 5 Expires: February 12, 2010 Voxeo 6 August 11, 2009 8 Media Resource Control Protocol Version 2 (MRCPv2) 9 draft-ietf-speechsc-mrcpv2-20 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. This document may contain material 15 from IETF Documents or IETF Contributions published or made publicly 16 available before November 10, 2008. The person(s) controlling the 17 copyright in some of this material may not have granted the IETF 18 Trust the right to allow modifications of such material outside the 19 IETF Standards Process. Without obtaining an adequate license from 20 the person(s) controlling the copyright in such materials, this 21 document may not be modified outside the IETF Standards Process, and 22 derivative works of it may not be created outside the IETF Standards 23 Process, except to format it for publication as an RFC or to 24 translate it into languages other than English. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF), its areas, and its working groups. Note that 28 other groups may also distribute working documents as Internet- 29 Drafts. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 The list of current Internet-Drafts can be accessed at 37 http://www.ietf.org/ietf/1id-abstracts.txt. 39 The list of Internet-Draft Shadow Directories can be accessed at 40 http://www.ietf.org/shadow.html. 42 This Internet-Draft will expire on February 12, 2010. 44 Copyright Notice 46 Copyright (c) 2009 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents in effect on the date of 51 publication of this document (http://trustee.ietf.org/license-info). 52 Please review these documents carefully, as they describe your rights 53 and restrictions with respect to this document. 55 Abstract 57 The MRCPv2 protocol allows client hosts to control media service 58 resources such as speech synthesizers, recognizers, verifiers and 59 identifiers residing in servers on the network. MRCPv2 is not a 60 "stand-alone" protocol - it relies on other protocols, such as 61 Session Initiation Protocol (SIP) to rendezvous MRCPv2 clients and 62 servers and manage sessions between them, and the Session Description 63 Protocol (SDP) to describe, discover and exchange capabilities. It 64 also depends on SIP and SDP to establish the media sessions and 65 associated parameters between the media source or sink and the media 66 server. Once this is done, the MRCPv2 protocol exchange operates 67 over the control session established above, allowing the client to 68 control the media processing resources on the speech resource server. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9 73 2. Document Conventions . . . . . . . . . . . . . . . . . . . . 10 74 2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 10 75 2.2. State-Machine Diagrams . . . . . . . . . . . . . . . . . 11 76 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 11 77 3.1. MRCPv2 Media Resource Types . . . . . . . . . . . . . . 12 78 3.2. Server and Resource Addressing . . . . . . . . . . . . . 13 79 4. MRCPv2 Protocol Basics . . . . . . . . . . . . . . . . . . . 14 80 4.1. Connecting to the Server . . . . . . . . . . . . . . . . 14 81 4.2. Managing Resource Control Channels . . . . . . . . . . . 14 82 4.3. Media Streams and RTP Ports . . . . . . . . . . . . . . 21 83 4.4. MRCPv2 Message Transport . . . . . . . . . . . . . . . . 23 84 5. MRCPv2 Specification . . . . . . . . . . . . . . . . . . . . 23 85 5.1. Common Protocol Elements . . . . . . . . . . . . . . . . 24 86 5.2. Request . . . . . . . . . . . . . . . . . . . . . . . . 25 87 5.3. Response . . . . . . . . . . . . . . . . . . . . . . . . 26 88 5.4. Status Codes . . . . . . . . . . . . . . . . . . . . . . 27 89 5.5. Events . . . . . . . . . . . . . . . . . . . . . . . . . 28 90 6. MRCPv2 Generic Methods, Headers, and Result Structure . . . . 29 91 6.1. Generic Methods . . . . . . . . . . . . . . . . . . . . 30 92 6.1.1. SET-PARAMS . . . . . . . . . . . . . . . . . . . . . 30 93 6.1.2. GET-PARAMS . . . . . . . . . . . . . . . . . . . . . 31 94 6.2. Generic Message Headers . . . . . . . . . . . . . . . . 31 95 6.2.1. Channel-Identifier . . . . . . . . . . . . . . . . . 33 96 6.2.2. Accept . . . . . . . . . . . . . . . . . . . . . . . 33 97 6.2.3. Active-Request-Id-List . . . . . . . . . . . . . . . 34 98 6.2.4. Proxy-Sync-Id . . . . . . . . . . . . . . . . . . . 34 99 6.2.5. Accept-Charset . . . . . . . . . . . . . . . . . . . 35 100 6.2.6. Content-Type . . . . . . . . . . . . . . . . . . . . 35 101 6.2.7. Content-ID . . . . . . . . . . . . . . . . . . . . . 35 102 6.2.8. Content-Base . . . . . . . . . . . . . . . . . . . . 35 103 6.2.9. Content-Encoding . . . . . . . . . . . . . . . . . . 35 104 6.2.10. Content-Location . . . . . . . . . . . . . . . . . . 36 105 6.2.11. Content-Length . . . . . . . . . . . . . . . . . . . 37 106 6.2.12. Fetch Timeout . . . . . . . . . . . . . . . . . . . 37 107 6.2.13. Cache-Control . . . . . . . . . . . . . . . . . . . 37 108 6.2.14. Logging-Tag . . . . . . . . . . . . . . . . . . . . 39 109 6.2.15. Set-Cookie and Set-Cookie2 . . . . . . . . . . . . . 39 110 6.2.16. Vendor Specific Parameters . . . . . . . . . . . . . 41 111 6.3. Generic Result Structure . . . . . . . . . . . . . . . . 41 112 6.3.1. Natural Language Semantics Markup Language . . . . . 42 113 7. Resource Discovery . . . . . . . . . . . . . . . . . . . . . 43 114 8. Speech Synthesizer Resource . . . . . . . . . . . . . . . . . 45 115 8.1. Synthesizer State Machine . . . . . . . . . . . . . . . 45 116 8.2. Synthesizer Methods . . . . . . . . . . . . . . . . . . 46 117 8.3. Synthesizer Events . . . . . . . . . . . . . . . . . . . 46 118 8.4. Synthesizer Header Fields . . . . . . . . . . . . . . . 47 119 8.4.1. Jump-Size . . . . . . . . . . . . . . . . . . . . . 47 120 8.4.2. Kill-On-Barge-In . . . . . . . . . . . . . . . . . . 48 121 8.4.3. Speaker Profile . . . . . . . . . . . . . . . . . . 48 122 8.4.4. Completion Cause . . . . . . . . . . . . . . . . . . 49 123 8.4.5. Completion Reason . . . . . . . . . . . . . . . . . 49 124 8.4.6. Voice-Parameter . . . . . . . . . . . . . . . . . . 50 125 8.4.7. Prosody-Parameters . . . . . . . . . . . . . . . . . 50 126 8.4.8. Speech Marker . . . . . . . . . . . . . . . . . . . 51 127 8.4.9. Speech Language . . . . . . . . . . . . . . . . . . 52 128 8.4.10. Fetch Hint . . . . . . . . . . . . . . . . . . . . . 52 129 8.4.11. Audio Fetch Hint . . . . . . . . . . . . . . . . . . 52 130 8.4.12. Failed URI . . . . . . . . . . . . . . . . . . . . . 53 131 8.4.13. Failed URI Cause . . . . . . . . . . . . . . . . . . 53 132 8.4.14. Speak Restart . . . . . . . . . . . . . . . . . . . 53 133 8.4.15. Speak Length . . . . . . . . . . . . . . . . . . . . 53 134 8.4.16. Load-Lexicon . . . . . . . . . . . . . . . . . . . . 54 135 8.4.17. Lexicon-Search-Order . . . . . . . . . . . . . . . . 54 136 8.5. Synthesizer Message Body . . . . . . . . . . . . . . . . 54 137 8.5.1. Synthesizer Speech Data . . . . . . . . . . . . . . 54 138 8.5.2. Lexicon Data . . . . . . . . . . . . . . . . . . . . 57 139 8.6. SPEAK Method . . . . . . . . . . . . . . . . . . . . . . 58 140 8.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 60 141 8.8. BARGE-IN-OCCURED . . . . . . . . . . . . . . . . . . . . 61 142 8.9. PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . 63 143 8.10. RESUME . . . . . . . . . . . . . . . . . . . . . . . . . 64 144 8.11. CONTROL . . . . . . . . . . . . . . . . . . . . . . . . 66 145 8.12. SPEAK-COMPLETE . . . . . . . . . . . . . . . . . . . . . 68 146 8.13. SPEECH-MARKER . . . . . . . . . . . . . . . . . . . . . 69 147 8.14. DEFINE-LEXICON . . . . . . . . . . . . . . . . . . . . . 71 148 9. Speech Recognizer Resource . . . . . . . . . . . . . . . . . 71 149 9.1. Recognizer State Machine . . . . . . . . . . . . . . . . 73 150 9.2. Recognizer Methods . . . . . . . . . . . . . . . . . . . 73 151 9.3. Recognizer Events . . . . . . . . . . . . . . . . . . . 74 152 9.4. Recognizer Header Fields . . . . . . . . . . . . . . . . 74 153 9.4.1. Confidence Threshold . . . . . . . . . . . . . . . . 76 154 9.4.2. Sensitivity Level . . . . . . . . . . . . . . . . . 76 155 9.4.3. Speed Vs Accuracy . . . . . . . . . . . . . . . . . 77 156 9.4.4. N Best List Length . . . . . . . . . . . . . . . . . 77 157 9.4.5. Input Type . . . . . . . . . . . . . . . . . . . . . 77 158 9.4.6. No Input Timeout . . . . . . . . . . . . . . . . . . 77 159 9.4.7. Recognition Timeout . . . . . . . . . . . . . . . . 78 160 9.4.8. Waveform URI . . . . . . . . . . . . . . . . . . . . 78 161 9.4.9. Media Type . . . . . . . . . . . . . . . . . . . . . 79 162 9.4.10. Input-Waveform-URI . . . . . . . . . . . . . . . . . 79 163 9.4.11. Completion Cause . . . . . . . . . . . . . . . . . . 79 164 9.4.12. Completion Reason . . . . . . . . . . . . . . . . . 81 165 9.4.13. Recognizer Context Block . . . . . . . . . . . . . . 81 166 9.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 82 167 9.4.15. Speech Complete Timeout . . . . . . . . . . . . . . 82 168 9.4.16. Speech Incomplete Timeout . . . . . . . . . . . . . 83 169 9.4.17. DTMF Interdigit Timeout . . . . . . . . . . . . . . 83 170 9.4.18. DTMF Term Timeout . . . . . . . . . . . . . . . . . 84 171 9.4.19. DTMF-Term-Char . . . . . . . . . . . . . . . . . . . 84 172 9.4.20. Failed URI . . . . . . . . . . . . . . . . . . . . . 84 173 9.4.21. Failed URI Cause . . . . . . . . . . . . . . . . . . 84 174 9.4.22. Save Waveform . . . . . . . . . . . . . . . . . . . 85 175 9.4.23. New Audio Channel . . . . . . . . . . . . . . . . . 85 176 9.4.24. Speech-Language . . . . . . . . . . . . . . . . . . 85 177 9.4.25. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 86 178 9.4.26. Recognition-Mode . . . . . . . . . . . . . . . . . . 86 179 9.4.27. Cancel-If-Queue . . . . . . . . . . . . . . . . . . 86 180 9.4.28. Hotword-Max-Duration . . . . . . . . . . . . . . . . 87 181 9.4.29. Hotword-Min-Duration . . . . . . . . . . . . . . . . 87 182 9.4.30. Interpret-Text . . . . . . . . . . . . . . . . . . . 87 183 9.4.31. DTMF-Buffer-Time . . . . . . . . . . . . . . . . . . 87 184 9.4.32. Clear-DTMF-Buffer . . . . . . . . . . . . . . . . . 88 185 9.4.33. Early-No-Match . . . . . . . . . . . . . . . . . . . 88 186 9.4.34. Num-Min-Consistent-Pronunciations . . . . . . . . . 88 187 9.4.35. Consistency-Threshold . . . . . . . . . . . . . . . 89 188 9.4.36. Clash-Threshold . . . . . . . . . . . . . . . . . . 89 189 9.4.37. Personal-Grammar-URI . . . . . . . . . . . . . . . . 89 190 9.4.38. Enroll-Utterance . . . . . . . . . . . . . . . . . . 89 191 9.4.39. Phrase-Id . . . . . . . . . . . . . . . . . . . . . 90 192 9.4.40. Phrase-NL . . . . . . . . . . . . . . . . . . . . . 90 193 9.4.41. Weight . . . . . . . . . . . . . . . . . . . . . . . 90 194 9.4.42. Save-Best-Waveform . . . . . . . . . . . . . . . . . 91 195 9.4.43. New-Phrase-Id . . . . . . . . . . . . . . . . . . . 91 196 9.4.44. Confusable-Phrases-URI . . . . . . . . . . . . . . . 91 197 9.4.45. Abort-Phrase-Enrollment . . . . . . . . . . . . . . 91 198 9.5. Recognizer Message Body . . . . . . . . . . . . . . . . 91 199 9.5.1. Recognizer Grammar Data . . . . . . . . . . . . . . 92 200 9.5.2. Recognizer Result Data . . . . . . . . . . . . . . . 96 201 9.5.3. Enrollment Result Data . . . . . . . . . . . . . . . 97 202 9.5.4. Recognizer Context Block . . . . . . . . . . . . . . 97 203 9.6. Recognizer Results . . . . . . . . . . . . . . . . . . . 97 204 9.6.1. Markup Functions . . . . . . . . . . . . . . . . . . 98 205 9.6.2. Overview of Recognizer Result Elements and their 206 Relationships . . . . . . . . . . . . . . . . . . . 99 207 9.6.3. Elements and Attributes . . . . . . . . . . . . . . 99 208 9.7. Enrollment Results . . . . . . . . . . . . . . . . . . . 104 209 9.7.1. NUM-CLASHES Element . . . . . . . . . . . . . . . . 104 210 9.7.2. NUM-GOOD-REPETITIONS Element . . . . . . . . . . . . 105 211 9.7.3. NUM-REPETITIONS-STILL-NEEDED Element . . . . . . . . 105 212 9.7.4. CONSISTENCY-STATUS Element . . . . . . . . . . . . . 105 213 9.7.5. CLASH-PHRASE-IDS Element . . . . . . . . . . . . . . 105 214 9.7.6. TRANSCRIPTIONS Element . . . . . . . . . . . . . . . 105 215 9.7.7. CONFUSABLE-PHRASES Element . . . . . . . . . . . . . 105 216 9.8. DEFINE-GRAMMAR . . . . . . . . . . . . . . . . . . . . . 105 217 9.9. RECOGNIZE . . . . . . . . . . . . . . . . . . . . . . . 109 218 9.10. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 115 219 9.11. GET-RESULT . . . . . . . . . . . . . . . . . . . . . . . 116 220 9.12. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 117 221 9.13. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 118 222 9.14. RECOGNITION-COMPLETE . . . . . . . . . . . . . . . . . . 118 223 9.15. START-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . 120 224 9.16. ENROLLMENT-ROLLBACK . . . . . . . . . . . . . . . . . . 121 225 9.17. END-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . . 122 226 9.18. MODIFY-PHRASE . . . . . . . . . . . . . . . . . . . . . 122 227 9.19. DELETE-PHRASE . . . . . . . . . . . . . . . . . . . . . 123 228 9.20. INTERPRET . . . . . . . . . . . . . . . . . . . . . . . 123 229 9.21. INTERPRETATION-COMPLETE . . . . . . . . . . . . . . . . 124 230 9.22. DTMF Detection . . . . . . . . . . . . . . . . . . . . . 126 231 10. Recorder Resource . . . . . . . . . . . . . . . . . . . . . . 126 232 10.1. Recorder State Machine . . . . . . . . . . . . . . . . . 127 233 10.2. Recorder Methods . . . . . . . . . . . . . . . . . . . . 127 234 10.3. Recorder Events . . . . . . . . . . . . . . . . . . . . 127 235 10.4. Recorder Header Fields . . . . . . . . . . . . . . . . . 127 236 10.4.1. Sensitivity Level . . . . . . . . . . . . . . . . . 128 237 10.4.2. No Input Timeout . . . . . . . . . . . . . . . . . . 128 238 10.4.3. Completion Cause . . . . . . . . . . . . . . . . . . 128 239 10.4.4. Completion Reason . . . . . . . . . . . . . . . . . 129 240 10.4.5. Failed URI . . . . . . . . . . . . . . . . . . . . . 129 241 10.4.6. Failed URI Cause . . . . . . . . . . . . . . . . . . 129 242 10.4.7. Record URI . . . . . . . . . . . . . . . . . . . . . 130 243 10.4.8. Media Type . . . . . . . . . . . . . . . . . . . . . 130 244 10.4.9. Max Time . . . . . . . . . . . . . . . . . . . . . . 130 245 10.4.10. Trim-Length . . . . . . . . . . . . . . . . . . . . 131 246 10.4.11. Final Silence . . . . . . . . . . . . . . . . . . . 131 247 10.4.12. Capture On Speech . . . . . . . . . . . . . . . . . 131 248 10.4.13. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 131 249 10.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 132 250 10.4.15. New Audio Channel . . . . . . . . . . . . . . . . . 132 251 10.5. Recorder Message Body . . . . . . . . . . . . . . . . . 132 252 10.6. RECORD . . . . . . . . . . . . . . . . . . . . . . . . . 132 253 10.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 133 254 10.8. RECORD-COMPLETE . . . . . . . . . . . . . . . . . . . . 134 255 10.9. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 135 256 10.10. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 135 257 11. Speaker Verification and Identification . . . . . . . . . . . 136 258 11.1. Speaker Verification State Machine . . . . . . . . . . . 137 259 11.2. Speaker Verification Methods . . . . . . . . . . . . . . 139 260 11.3. Verification Events . . . . . . . . . . . . . . . . . . 140 261 11.4. Verification Header Fields . . . . . . . . . . . . . . . 140 262 11.4.1. Repository-URI . . . . . . . . . . . . . . . . . . . 141 263 11.4.2. Voiceprint-Identifier . . . . . . . . . . . . . . . 141 264 11.4.3. Verification-Mode . . . . . . . . . . . . . . . . . 142 265 11.4.4. Adapt-Model . . . . . . . . . . . . . . . . . . . . 143 266 11.4.5. Abort-Model . . . . . . . . . . . . . . . . . . . . 143 267 11.4.6. Min-Verification-Score . . . . . . . . . . . . . . . 143 268 11.4.7. Num-Min-Verification-Phrases . . . . . . . . . . . . 143 269 11.4.8. Num-Max-Verification-Phrases . . . . . . . . . . . . 144 270 11.4.9. No-Input-Timeout . . . . . . . . . . . . . . . . . . 144 271 11.4.10. Save-Waveform . . . . . . . . . . . . . . . . . . . 144 272 11.4.11. Media Type . . . . . . . . . . . . . . . . . . . . . 145 273 11.4.12. Waveform-URI . . . . . . . . . . . . . . . . . . . . 145 274 11.4.13. Voiceprint-Exists . . . . . . . . . . . . . . . . . 145 275 11.4.14. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 146 276 11.4.15. Input-Waveform-Uri . . . . . . . . . . . . . . . . . 146 277 11.4.16. Completion-Cause . . . . . . . . . . . . . . . . . . 146 278 11.4.17. Completion Reason . . . . . . . . . . . . . . . . . 148 279 11.4.18. Speech Complete Timeout . . . . . . . . . . . . . . 148 280 11.4.19. New Audio Channel . . . . . . . . . . . . . . . . . 148 281 11.4.20. Abort-Verification . . . . . . . . . . . . . . . . . 148 282 11.4.21. Start Input Timers . . . . . . . . . . . . . . . . . 148 283 11.5. Verification Message Body . . . . . . . . . . . . . . . 149 284 11.5.1. Verification Result Data . . . . . . . . . . . . . . 149 285 11.5.2. Verification Result Elements . . . . . . . . . . . . 149 286 11.6. START-SESSION . . . . . . . . . . . . . . . . . . . . . 153 287 11.7. END-SESSION . . . . . . . . . . . . . . . . . . . . . . 154 288 11.8. QUERY-VOICEPRINT . . . . . . . . . . . . . . . . . . . . 155 289 11.9. DELETE-VOICEPRINT . . . . . . . . . . . . . . . . . . . 156 290 11.10. VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 157 291 11.11. VERIFY-FROM-BUFFER . . . . . . . . . . . . . . . . . . . 157 292 11.12. VERIFY-ROLLBACK . . . . . . . . . . . . . . . . . . . . 160 293 11.13. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 160 294 11.14. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 161 295 11.15. VERIFICATION-COMPLETE . . . . . . . . . . . . . . . . . 162 296 11.16. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 162 297 11.17. CLEAR-BUFFER . . . . . . . . . . . . . . . . . . . . . . 163 298 11.18. GET-INTERMEDIATE-RESULT . . . . . . . . . . . . . . . . 163 299 12. Security Considerations . . . . . . . . . . . . . . . . . . . 164 300 12.1. Rendezvous and Session Establishment . . . . . . . . . . 165 301 12.2. Control channel protection . . . . . . . . . . . . . . . 165 302 12.3. Media session protection . . . . . . . . . . . . . . . . 165 303 12.4. Indirect Content Access . . . . . . . . . . . . . . . . 165 304 12.5. Protection of stored media . . . . . . . . . . . . . . . 166 305 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 166 306 13.1. New registries . . . . . . . . . . . . . . . . . . . . . 166 307 13.1.1. MRCPv2 resource types . . . . . . . . . . . . . . . 166 308 13.1.2. MRCPv2 methods and events . . . . . . . . . . . . . 167 309 13.1.3. MRCPv2 headers . . . . . . . . . . . . . . . . . . . 168 310 13.1.4. MRCPv2 status codes . . . . . . . . . . . . . . . . 171 311 13.1.5. Grammar Reference List Parameters . . . . . . . . . 171 312 13.1.6. MRCPv2 vendor-specific parameters . . . . . . . . . 171 313 13.2. NLSML-related registrations . . . . . . . . . . . . . . 172 314 13.2.1. application/nlsml+xml Media Type registration . . . 172 315 13.3. NLSML XML Schema registration . . . . . . . . . . . . . 173 316 13.4. MRCPv2 XML Namespace registration . . . . . . . . . . . 173 317 13.5. text Media Type Registrations . . . . . . . . . . . . . 173 318 13.5.1. text/grammar-ref-list . . . . . . . . . . . . . . . 173 319 13.6. session URL scheme registration . . . . . . . . . . . . 174 320 13.7. SDP parameter registrations . . . . . . . . . . . . . . 175 321 13.7.1. sub-registry "proto" . . . . . . . . . . . . . . . . 175 322 13.7.2. sub-registry "att-field (session-level)" . . . . . . 176 323 13.7.3. sub-registry "att-field (media-level)" . . . . . . . 176 324 14. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 177 325 14.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . 177 326 14.2. Recognition Result Examples . . . . . . . . . . . . . . 186 327 14.2.1. Simple ASR Ambiguity . . . . . . . . . . . . . . . . 186 328 14.2.2. Mixed Initiative . . . . . . . . . . . . . . . . . . 187 329 14.2.3. DTMF Input . . . . . . . . . . . . . . . . . . . . . 188 330 14.2.4. Interpreting Meta-Dialog and Meta-Task Utterances . 188 331 14.2.5. Anaphora and Deixis . . . . . . . . . . . . . . . . 189 332 14.2.6. Distinguishing Individual Items from Sets with 333 One Member . . . . . . . . . . . . . . . . . . . . . 190 334 14.2.7. Extensibility . . . . . . . . . . . . . . . . . . . 191 335 15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 191 336 16. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 206 337 16.1. NLSML Schema Definition . . . . . . . . . . . . . . . . 206 338 16.2. Enrollment Results Schema Definition . . . . . . . . . . 207 339 16.3. Verification Results Schema Definition . . . . . . . . . 208 340 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 212 341 17.1. Normative References . . . . . . . . . . . . . . . . . . 212 342 17.2. Informative References . . . . . . . . . . . . . . . . . 214 343 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 216 344 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 217 345 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 217 347 1. Introduction 349 The MRCPv2 protocol is designed to allow a client device to control 350 media processing resources on the network. Some of these media 351 processing resources include speech recognition engines, speech 352 synthesis engines, speaker verification and speaker identification 353 engines. MRCPv2 enables the implementation of distributed 354 Interactive Voice Response platforms using VoiceXML 355 [W3C.REC-voicexml20-20040316] browsers or other client applications 356 while maintaining separate back-end speech processing capabilities on 357 specialized speech processing servers. MRCPv2 is based on the 358 earlier Media Resource Control Protocol (MRCP) [RFC4463] developed 359 jointly by Cisco Systems, Inc., Nuance Communications, and 360 Speechworks Inc. 362 The protocol requirements of SPEECHSC [RFC4313] include that the 363 solution be capable of reaching a media processing server and setting 364 up communication channels to the media resources, and sending and 365 receiving control messages and media streams to/from the server. The 366 Session Initiation Protocol (SIP) [RFC3261] meets these requirements. 368 Note the above mentioned requirements document, RFC 4313, goes into 369 detail on alternatives to SIP, such as RTSP [RFC2326], and why MRCPv2 370 does not use RTSP, even though the proprietary version of MRCP did 371 run over RTSP. 373 MRCPv2 leverages these capabilities by building upon SIP and the 374 Session Description Protocol (SDP) [RFC4566]. MRCPv2 uses SIP to 375 setup and tear down media and control sessions with the server. In 376 addition, the client can use a SIP re-INVITE method (an INVITE dialog 377 sent within an existing SIP Session) to change the characteristics of 378 these media and control session while maintaining the SIP dialog 379 between the client and server. SDP is used to describe the 380 parameters of the media sessions associated with that dialog. It is 381 mandatory to support SIP as the session establishment protocol to 382 ensure interoperability. Other protocols can be used for session 383 establishment by prior agreement. This document only describes the 384 use of SIP and SDP. 386 MRCPv2 uses SIP and SDP to create the speech client/server dialog and 387 set up the media channels to the server. It also uses SIP and SDP to 388 establish MRCPv2 control sessions between the client and the server 389 for each media processing resource required for that dialog. The 390 MRCPv2 protocol exchange between the client and the media resource is 391 carried on that control session. MRCPv2 protocol exchanges do not 392 change the state of the SIP dialog, the media sessions, or other 393 parameters of the dialog initiated via SIP. It controls and affects 394 the state of the media processing resource associated with the MRCPv2 395 session(s). 397 MRCPv2 defines the messages to control the different media processing 398 resources and the state machines required to guide their operation. 399 It also describes how these messages are carried over a transport 400 layer protocol such as TCP or TLS (Note: SCTP is a viable transport 401 for MRCPv2 as well, but the mapping onto SCTP is not described in 402 this specification). 404 2. Document Conventions 406 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 407 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 408 document are to be interpreted as described in RFC2119 [RFC2119]. 410 Since many of the definitions and syntax are identical to HTTP/1.1 411 (RFC2616 [RFC2616]), this specification refers to the section where 412 they are defined rather than copying it. For brevity, [HX.Y] is to 413 be taken to refer to Section X.Y of RFC2616. 415 All the mechanisms specified in this document are described in both 416 prose and an augmented Backus-Naur form (ABNF [RFC5234]). 418 The complete message format in ABNF form is provided in Section 15 419 and is the normative format definition. 421 2.1. Definitions 423 Media Resource 424 An entity on the speech processing server that can be 425 controlled through the MRCPv2 protocol. 426 MRCP Server 427 Aggregate of one or more "Media Resource" entities on 428 a Server, exposed through the MRCPv2 protocol 429 ("Server" for short). 430 MRCP Client 431 An entity controlling one or more Media Resources 432 through the MRCPv2 protocol ("Client" for short). 433 DTMF 434 Dual Tone Multi-Frequency; a method of transmitting 435 key presses in-band, either as actual tones (Q.23 436 [Q.23]) or as named tone events (RFC4733 [RFC4733]). 437 Endpointing 438 The process of automatically detecting the beginning 439 and end of speech in an audio stream. This is 440 critical both for speech recognition and for automated 441 recording as one would find in voice mail systems. 443 Hotword Mode 444 A mode of speech recognition where a stream of 445 utterances is evaluated for match against a small set 446 of command words. This is generally employed to 447 either trigger some action, or to control the 448 subsequent grammar to be used for further recognition 450 2.2. State-Machine Diagrams 452 The state-machine diagrams in this document do not show every 453 possible method call. Rather, they reflect the state of the resource 454 based on the methods that have moved to IN-PROGRESS or COMPLETE 455 states (see Section 5.3). Note that since PENDING requests 456 essentially have not affected the resource yet and are in queue to be 457 processed, they are not reflected in the state-machine diagrams. 459 3. Architecture 461 A system using MRCPv2 consists of a client that requires the 462 generation and/or consumption of media streams and a media resource 463 server that has the resources or "engines" to process these streams 464 as input or generate these streams as output. The client uses SIP 465 and SDP to establish an MRCPv2 control channel with the server to use 466 its media processing resources. MRCPv2 servers are addressed using 467 SIP URIs. 469 The session management protocol (SIP) uses SDP with the offer/answer 470 model described in RFC3264 [RFC3264] to set up the MRCPv2 control 471 channels and describe their characteristics. A separate MRCPv2 472 session is needed to control each of the media processing resources 473 associated with the SIP dialog between the client and server. Within 474 a SIP dialog, the individual resource control channels for the 475 different resources are added or removed through SDP offer/answer 476 carried in a SIP re-INVITE transaction. 478 The server, through the SDP exchange, provides the client with an 479 unambiguous channel identifier and a TCP port number. The client MAY 480 then open a new TCP connection with the server on this port number. 481 Multiple MRCPv2 channels can share a TCP connection between the 482 client and the server. All MRCPv2 messages exchanged between the 483 client and the server carry the specified channel identifier that the 484 server MUST ensure is unambiguous among all MRCPv2 control channels 485 that are active on that server. The client uses this channel 486 identifier to indicate the media processing resource associated with 487 that channel. For information on message framing, see Section 5. 489 The session management protocol (SIP) also establishes the media 490 sessions between the client (or other source/sink of media) and the 491 MRCPv2 server using SDP m-lines. One or more media processing 492 resources may share a media session under a SIP session, or each 493 media processing resource may have its own media session. 495 The following diagram shows the general architecture of a system that 496 uses MRCPv2. To simplify the diagram only a few resources are shown. 498 MRCPv2 client MRCPv2 Media Resource Server 499 |--------------------| |------------------------------------| 500 ||------------------|| ||----------------------------------|| 501 || Application Layer|| ||Synthesis|Recognition|Verification|| 502 ||------------------|| || Engine | Engine | Engine || 503 ||Media Resource API|| || || | || | || || 504 ||------------------|| ||Synthesis|Recognizer | Verifier || 505 || SIP | MRCPv2 || ||Resource | Resource | Resource || 506 ||Stack | || || Media Resource Management || 507 || | || ||----------------------------------|| 508 ||------------------|| || SIP | MRCPv2 || 509 || TCP/IP Stack ||---MRCPv2---|| Stack | || 510 || || ||----------------------------------|| 511 ||------------------||----SIP-----|| TCP/IP Stack || 512 |--------------------| || || 513 | ||----------------------------------|| 514 SIP |------------------------------------| 515 | / 516 |-------------------| RTP 517 | | / 518 | Media Source/Sink |------------/ 519 | | 520 |-------------------| 522 Figure 1: Architectural Diagram 524 3.1. MRCPv2 Media Resource Types 526 An MRCPv2 server may offer one or more of the following media 527 processing resources to its clients. 528 Basic Synthesizer 529 A speech synthesizer resource with very limited 530 capabilities, that can generate its media stream 531 exclusively from concatenated audio clips. The speech 532 data is described using a limited subset of SSML 533 [W3C.REC-speech-synthesis-20040907] elements. A basic 534 synthesizer MUST support the SSML tags , 535