idnits 2.17.1 draft-ietf-speechsc-mrcpv2-28.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC2606-compliant FQDNs in the document. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 619 has weird spacing: '...ple.net or...' -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 28, 2012) is 4252 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFCXXXX' is mentioned on line 8040, but not defined == Missing Reference: 'LWS' is mentioned on line 8914, but not defined ** Obsolete normative reference: RFC 2326 (Obsoleted by RFC 7826) ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 4572 (Obsoleted by RFC 8122) ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) ** Obsolete normative reference: RFC 5246 (Obsoleted by RFC 8446) ** Obsolete normative reference: RFC 4288 (Obsoleted by RFC 6838) ** Downref: Normative reference to an Experimental RFC: RFC 2483 ** Obsolete normative reference: RFC 3023 (Obsoleted by RFC 7303) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 4395 (Obsoleted by RFC 7595) -- Obsolete informational reference (is this intentional?): RFC 2818 (Obsoleted by RFC 9110) Summary: 10 errors (**), 0 flaws (~~), 5 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 SPEECHSC D. Burnett 3 Internet-Draft Voxeo 4 Intended status: Standards Track S. Shanmugham 5 Expires: March 1, 2013 Cisco Systems, Inc. 6 August 28, 2012 8 Media Resource Control Protocol Version 2 (MRCPv2) 9 draft-ietf-speechsc-mrcpv2-28 11 Abstract 13 The MRCPv2 protocol allows client hosts to control media service 14 resources such as speech synthesizers, recognizers, verifiers and 15 identifiers residing in servers on the network. MRCPv2 is not a 16 "stand-alone" protocol - it relies on other protocols, such as 17 Session Initiation Protocol (SIP), to rendezvous MRCPv2 clients and 18 servers and manage sessions between them, and the Session Description 19 Protocol (SDP) to describe, discover and exchange capabilities. It 20 also depends on SIP and SDP to establish the media sessions and 21 associated parameters between the media source or sink and the media 22 server. Once this is done, the MRCPv2 protocol exchange operates 23 over the control session established above, allowing the client to 24 control the media processing resources on the speech resource server. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on March 1, 2013. 43 Copyright Notice 45 Copyright (c) 2012 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 This document may contain material from IETF Documents or IETF 59 Contributions published or made publicly available before November 60 10, 2008. The person(s) controlling the copyright in some of this 61 material may not have granted the IETF Trust the right to allow 62 modifications of such material outside the IETF Standards Process. 63 Without obtaining an adequate license from the person(s) controlling 64 the copyright in such materials, this document may not be modified 65 outside the IETF Standards Process, and derivative works of it may 66 not be created outside the IETF Standards Process, except to format 67 it for publication as an RFC or to translate it into languages other 68 than English. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 9 73 2. Document Conventions . . . . . . . . . . . . . . . . . . . . 10 74 2.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 10 75 2.2. State-Machine Diagrams . . . . . . . . . . . . . . . . . 11 76 2.3. URI Schemes . . . . . . . . . . . . . . . . . . . . . . 11 77 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 12 78 3.1. MRCPv2 Media Resource Types . . . . . . . . . . . . . . 13 79 3.2. Server and Resource Addressing . . . . . . . . . . . . . 14 80 4. MRCPv2 Protocol Basics . . . . . . . . . . . . . . . . . . . 15 81 4.1. Connecting to the Server . . . . . . . . . . . . . . . . 15 82 4.2. Managing Resource Control Channels . . . . . . . . . . . 15 83 4.3. SIP session example . . . . . . . . . . . . . . . . . . 18 84 4.4. Media Streams and RTP Ports . . . . . . . . . . . . . . 23 85 4.5. MRCPv2 Message Transport . . . . . . . . . . . . . . . . 24 86 4.6. MRCPv2 Session Termination . . . . . . . . . . . . . . . 25 87 5. MRCPv2 Specification . . . . . . . . . . . . . . . . . . . . 25 88 5.1. Common Protocol Elements . . . . . . . . . . . . . . . . 25 89 5.2. Request . . . . . . . . . . . . . . . . . . . . . . . . 28 90 5.3. Response . . . . . . . . . . . . . . . . . . . . . . . . 29 91 5.4. Status Codes . . . . . . . . . . . . . . . . . . . . . . 30 92 5.5. Events . . . . . . . . . . . . . . . . . . . . . . . . . 31 93 6. MRCPv2 Generic Methods, Headers, and Result Structure . . . . 32 94 6.1. Generic Methods . . . . . . . . . . . . . . . . . . . . 33 95 6.1.1. SET-PARAMS . . . . . . . . . . . . . . . . . . . . . 33 96 6.1.2. GET-PARAMS . . . . . . . . . . . . . . . . . . . . . 34 97 6.2. Generic Message Headers . . . . . . . . . . . . . . . . 35 98 6.2.1. Channel-Identifier . . . . . . . . . . . . . . . . . 36 99 6.2.2. Accept . . . . . . . . . . . . . . . . . . . . . . . 36 100 6.2.3. Active-Request-Id-List . . . . . . . . . . . . . . . 37 101 6.2.4. Proxy-Sync-Id . . . . . . . . . . . . . . . . . . . 37 102 6.2.5. Accept-Charset . . . . . . . . . . . . . . . . . . . 38 103 6.2.6. Content-Type . . . . . . . . . . . . . . . . . . . . 38 104 6.2.7. Content-ID . . . . . . . . . . . . . . . . . . . . . 38 105 6.2.8. Content-Base . . . . . . . . . . . . . . . . . . . . 39 106 6.2.9. Content-Encoding . . . . . . . . . . . . . . . . . . 39 107 6.2.10. Content-Location . . . . . . . . . . . . . . . . . . 39 108 6.2.11. Content-Length . . . . . . . . . . . . . . . . . . . 40 109 6.2.12. Fetch Timeout . . . . . . . . . . . . . . . . . . . 40 110 6.2.13. Cache-Control . . . . . . . . . . . . . . . . . . . 41 111 6.2.14. Logging-Tag . . . . . . . . . . . . . . . . . . . . 42 112 6.2.15. Set-Cookie . . . . . . . . . . . . . . . . . . . . . 42 113 6.2.16. Vendor Specific Parameters . . . . . . . . . . . . . 45 114 6.3. Generic Result Structure . . . . . . . . . . . . . . . . 46 115 6.3.1. Natural Language Semantics Markup Language . . . . . 47 116 7. Resource Discovery . . . . . . . . . . . . . . . . . . . . . 47 117 8. Speech Synthesizer Resource . . . . . . . . . . . . . . . . . 49 118 8.1. Synthesizer State Machine . . . . . . . . . . . . . . . 49 119 8.2. Synthesizer Methods . . . . . . . . . . . . . . . . . . 50 120 8.3. Synthesizer Events . . . . . . . . . . . . . . . . . . . 50 121 8.4. Synthesizer Header Fields . . . . . . . . . . . . . . . 51 122 8.4.1. Jump-Size . . . . . . . . . . . . . . . . . . . . . 51 123 8.4.2. Kill-On-Barge-In . . . . . . . . . . . . . . . . . . 52 124 8.4.3. Speaker Profile . . . . . . . . . . . . . . . . . . 53 125 8.4.4. Completion Cause . . . . . . . . . . . . . . . . . . 53 126 8.4.5. Completion Reason . . . . . . . . . . . . . . . . . 54 127 8.4.6. Voice-Parameter . . . . . . . . . . . . . . . . . . 54 128 8.4.7. Prosody-Parameters . . . . . . . . . . . . . . . . . 55 129 8.4.8. Speech Marker . . . . . . . . . . . . . . . . . . . 55 130 8.4.9. Speech Language . . . . . . . . . . . . . . . . . . 56 131 8.4.10. Fetch Hint . . . . . . . . . . . . . . . . . . . . . 56 132 8.4.11. Audio Fetch Hint . . . . . . . . . . . . . . . . . . 57 133 8.4.12. Failed URI . . . . . . . . . . . . . . . . . . . . . 57 134 8.4.13. Failed URI Cause . . . . . . . . . . . . . . . . . . 57 135 8.4.14. Speak Restart . . . . . . . . . . . . . . . . . . . 57 136 8.4.15. Speak Length . . . . . . . . . . . . . . . . . . . . 58 137 8.4.16. Load-Lexicon . . . . . . . . . . . . . . . . . . . . 58 138 8.4.17. Lexicon-Search-Order . . . . . . . . . . . . . . . . 58 139 8.5. Synthesizer Message Body . . . . . . . . . . . . . . . . 59 140 8.5.1. Synthesizer Speech Data . . . . . . . . . . . . . . 59 141 8.5.2. Lexicon Data . . . . . . . . . . . . . . . . . . . . 61 142 8.6. SPEAK Method . . . . . . . . . . . . . . . . . . . . . . 62 143 8.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 64 144 8.8. BARGE-IN-OCCURRED . . . . . . . . . . . . . . . . . . . 65 145 8.9. PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . 67 146 8.10. RESUME . . . . . . . . . . . . . . . . . . . . . . . . . 68 147 8.11. CONTROL . . . . . . . . . . . . . . . . . . . . . . . . 70 148 8.12. SPEAK-COMPLETE . . . . . . . . . . . . . . . . . . . . . 72 149 8.13. SPEECH-MARKER . . . . . . . . . . . . . . . . . . . . . 73 150 8.14. DEFINE-LEXICON . . . . . . . . . . . . . . . . . . . . . 75 151 9. Speech Recognizer Resource . . . . . . . . . . . . . . . . . 75 152 9.1. Recognizer State Machine . . . . . . . . . . . . . . . . 77 153 9.2. Recognizer Methods . . . . . . . . . . . . . . . . . . . 77 154 9.3. Recognizer Events . . . . . . . . . . . . . . . . . . . 78 155 9.4. Recognizer Header Fields . . . . . . . . . . . . . . . . 78 156 9.4.1. Confidence Threshold . . . . . . . . . . . . . . . . 80 157 9.4.2. Sensitivity Level . . . . . . . . . . . . . . . . . 80 158 9.4.3. Speed Vs Accuracy . . . . . . . . . . . . . . . . . 81 159 9.4.4. N Best List Length . . . . . . . . . . . . . . . . . 81 160 9.4.5. Input Type . . . . . . . . . . . . . . . . . . . . . 81 161 9.4.6. No Input Timeout . . . . . . . . . . . . . . . . . . 82 162 9.4.7. Recognition Timeout . . . . . . . . . . . . . . . . 82 163 9.4.8. Waveform URI . . . . . . . . . . . . . . . . . . . . 82 164 9.4.9. Media Type . . . . . . . . . . . . . . . . . . . . . 83 165 9.4.10. Input-Waveform-URI . . . . . . . . . . . . . . . . . 83 166 9.4.11. Completion Cause . . . . . . . . . . . . . . . . . . 83 167 9.4.12. Completion Reason . . . . . . . . . . . . . . . . . 85 168 9.4.13. Recognizer Context Block . . . . . . . . . . . . . . 85 169 9.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 86 170 9.4.15. Speech Complete Timeout . . . . . . . . . . . . . . 86 171 9.4.16. Speech Incomplete Timeout . . . . . . . . . . . . . 87 172 9.4.17. DTMF Interdigit Timeout . . . . . . . . . . . . . . 87 173 9.4.18. DTMF Term Timeout . . . . . . . . . . . . . . . . . 88 174 9.4.19. DTMF-Term-Char . . . . . . . . . . . . . . . . . . . 88 175 9.4.20. Failed URI . . . . . . . . . . . . . . . . . . . . . 88 176 9.4.21. Failed URI Cause . . . . . . . . . . . . . . . . . . 88 177 9.4.22. Save Waveform . . . . . . . . . . . . . . . . . . . 89 178 9.4.23. New Audio Channel . . . . . . . . . . . . . . . . . 89 179 9.4.24. Speech-Language . . . . . . . . . . . . . . . . . . 89 180 9.4.25. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 90 181 9.4.26. Recognition-Mode . . . . . . . . . . . . . . . . . . 90 182 9.4.27. Cancel-If-Queue . . . . . . . . . . . . . . . . . . 90 183 9.4.28. Hotword-Max-Duration . . . . . . . . . . . . . . . . 91 184 9.4.29. Hotword-Min-Duration . . . . . . . . . . . . . . . . 91 185 9.4.30. Interpret-Text . . . . . . . . . . . . . . . . . . . 91 186 9.4.31. DTMF-Buffer-Time . . . . . . . . . . . . . . . . . . 92 187 9.4.32. Clear-DTMF-Buffer . . . . . . . . . . . . . . . . . 92 188 9.4.33. Early-No-Match . . . . . . . . . . . . . . . . . . . 92 189 9.4.34. Num-Min-Consistent-Pronunciations . . . . . . . . . 92 190 9.4.35. Consistency-Threshold . . . . . . . . . . . . . . . 93 191 9.4.36. Clash-Threshold . . . . . . . . . . . . . . . . . . 93 192 9.4.37. Personal-Grammar-URI . . . . . . . . . . . . . . . . 93 193 9.4.38. Enroll-Utterance . . . . . . . . . . . . . . . . . . 94 194 9.4.39. Phrase-Id . . . . . . . . . . . . . . . . . . . . . 94 195 9.4.40. Phrase-NL . . . . . . . . . . . . . . . . . . . . . 94 196 9.4.41. Weight . . . . . . . . . . . . . . . . . . . . . . . 94 197 9.4.42. Save-Best-Waveform . . . . . . . . . . . . . . . . . 95 198 9.4.43. New-Phrase-Id . . . . . . . . . . . . . . . . . . . 95 199 9.4.44. Confusable-Phrases-URI . . . . . . . . . . . . . . . 95 200 9.4.45. Abort-Phrase-Enrollment . . . . . . . . . . . . . . 95 201 9.5. Recognizer Message Body . . . . . . . . . . . . . . . . 96 202 9.5.1. Recognizer Grammar Data . . . . . . . . . . . . . . 96 203 9.5.2. Recognizer Result Data . . . . . . . . . . . . . . . 100 204 9.5.3. Enrollment Result Data . . . . . . . . . . . . . . . 101 205 9.5.4. Recognizer Context Block . . . . . . . . . . . . . . 101 206 9.6. Recognizer Results . . . . . . . . . . . . . . . . . . . 101 207 9.6.1. Markup Functions . . . . . . . . . . . . . . . . . . 102 208 9.6.2. Overview of Recognizer Result Elements and their 209 Relationships . . . . . . . . . . . . . . . . . . . 103 210 9.6.3. Elements and Attributes . . . . . . . . . . . . . . 103 211 9.7. Enrollment Results . . . . . . . . . . . . . . . . . . . 108 212 9.7.1. NUM-CLASHES Element . . . . . . . . . . . . . . . . 108 213 9.7.2. NUM-GOOD-REPETITIONS Element . . . . . . . . . . . . 109 214 9.7.3. NUM-REPETITIONS-STILL-NEEDED Element . . . . . . . . 109 215 9.7.4. CONSISTENCY-STATUS Element . . . . . . . . . . . . . 109 216 9.7.5. CLASH-PHRASE-IDS Element . . . . . . . . . . . . . . 109 217 9.7.6. TRANSCRIPTIONS Element . . . . . . . . . . . . . . . 109 218 9.7.7. CONFUSABLE-PHRASES Element . . . . . . . . . . . . . 109 219 9.8. DEFINE-GRAMMAR . . . . . . . . . . . . . . . . . . . . . 109 220 9.9. RECOGNIZE . . . . . . . . . . . . . . . . . . . . . . . 113 221 9.10. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 119 222 9.11. GET-RESULT . . . . . . . . . . . . . . . . . . . . . . . 120 223 9.12. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 121 224 9.13. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 122 225 9.14. RECOGNITION-COMPLETE . . . . . . . . . . . . . . . . . . 122 226 9.15. START-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . 124 227 9.16. ENROLLMENT-ROLLBACK . . . . . . . . . . . . . . . . . . 125 228 9.17. END-PHRASE-ENROLLMENT . . . . . . . . . . . . . . . . . 126 229 9.18. MODIFY-PHRASE . . . . . . . . . . . . . . . . . . . . . 126 230 9.19. DELETE-PHRASE . . . . . . . . . . . . . . . . . . . . . 127 231 9.20. INTERPRET . . . . . . . . . . . . . . . . . . . . . . . 127 232 9.21. INTERPRETATION-COMPLETE . . . . . . . . . . . . . . . . 128 233 9.22. DTMF Detection . . . . . . . . . . . . . . . . . . . . . 130 234 10. Recorder Resource . . . . . . . . . . . . . . . . . . . . . . 130 235 10.1. Recorder State Machine . . . . . . . . . . . . . . . . . 131 236 10.2. Recorder Methods . . . . . . . . . . . . . . . . . . . . 131 237 10.3. Recorder Events . . . . . . . . . . . . . . . . . . . . 131 238 10.4. Recorder Header Fields . . . . . . . . . . . . . . . . . 131 239 10.4.1. Sensitivity Level . . . . . . . . . . . . . . . . . 132 240 10.4.2. No Input Timeout . . . . . . . . . . . . . . . . . . 132 241 10.4.3. Completion Cause . . . . . . . . . . . . . . . . . . 132 242 10.4.4. Completion Reason . . . . . . . . . . . . . . . . . 133 243 10.4.5. Failed URI . . . . . . . . . . . . . . . . . . . . . 133 244 10.4.6. Failed URI Cause . . . . . . . . . . . . . . . . . . 134 245 10.4.7. Record URI . . . . . . . . . . . . . . . . . . . . . 134 246 10.4.8. Media Type . . . . . . . . . . . . . . . . . . . . . 134 247 10.4.9. Max Time . . . . . . . . . . . . . . . . . . . . . . 135 248 10.4.10. Trim-Length . . . . . . . . . . . . . . . . . . . . 135 249 10.4.11. Final Silence . . . . . . . . . . . . . . . . . . . 135 250 10.4.12. Capture On Speech . . . . . . . . . . . . . . . . . 135 251 10.4.13. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 136 252 10.4.14. Start Input Timers . . . . . . . . . . . . . . . . . 136 253 10.4.15. New Audio Channel . . . . . . . . . . . . . . . . . 136 254 10.5. Recorder Message Body . . . . . . . . . . . . . . . . . 136 255 10.6. RECORD . . . . . . . . . . . . . . . . . . . . . . . . . 137 256 10.7. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 138 257 10.8. RECORD-COMPLETE . . . . . . . . . . . . . . . . . . . . 139 258 10.9. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 140 259 10.10. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 140 260 11. Speaker Verification and Identification . . . . . . . . . . . 141 261 11.1. Speaker Verification State Machine . . . . . . . . . . . 142 262 11.2. Speaker Verification Methods . . . . . . . . . . . . . . 144 263 11.3. Verification Events . . . . . . . . . . . . . . . . . . 145 264 11.4. Verification Header Fields . . . . . . . . . . . . . . . 145 265 11.4.1. Repository-URI . . . . . . . . . . . . . . . . . . . 146 266 11.4.2. Voiceprint-Identifier . . . . . . . . . . . . . . . 146 267 11.4.3. Verification-Mode . . . . . . . . . . . . . . . . . 147 268 11.4.4. Adapt-Model . . . . . . . . . . . . . . . . . . . . 148 269 11.4.5. Abort-Model . . . . . . . . . . . . . . . . . . . . 148 270 11.4.6. Min-Verification-Score . . . . . . . . . . . . . . . 148 271 11.4.7. Num-Min-Verification-Phrases . . . . . . . . . . . . 148 272 11.4.8. Num-Max-Verification-Phrases . . . . . . . . . . . . 149 273 11.4.9. No-Input-Timeout . . . . . . . . . . . . . . . . . . 149 274 11.4.10. Save-Waveform . . . . . . . . . . . . . . . . . . . 149 275 11.4.11. Media Type . . . . . . . . . . . . . . . . . . . . . 150 276 11.4.12. Waveform-URI . . . . . . . . . . . . . . . . . . . . 150 277 11.4.13. Voiceprint-Exists . . . . . . . . . . . . . . . . . 150 278 11.4.14. Ver-Buffer-Utterance . . . . . . . . . . . . . . . . 151 279 11.4.15. Input-Waveform-Uri . . . . . . . . . . . . . . . . . 151 280 11.4.16. Completion-Cause . . . . . . . . . . . . . . . . . . 151 281 11.4.17. Completion Reason . . . . . . . . . . . . . . . . . 153 282 11.4.18. Speech Complete Timeout . . . . . . . . . . . . . . 153 283 11.4.19. New Audio Channel . . . . . . . . . . . . . . . . . 153 284 11.4.20. Abort-Verification . . . . . . . . . . . . . . . . . 153 285 11.4.21. Start Input Timers . . . . . . . . . . . . . . . . . 153 286 11.5. Verification Message Body . . . . . . . . . . . . . . . 154 287 11.5.1. Verification Result Data . . . . . . . . . . . . . . 154 288 11.5.2. Verification Result Elements . . . . . . . . . . . . 154 289 11.6. START-SESSION . . . . . . . . . . . . . . . . . . . . . 158 290 11.7. END-SESSION . . . . . . . . . . . . . . . . . . . . . . 159 291 11.8. QUERY-VOICEPRINT . . . . . . . . . . . . . . . . . . . . 160 292 11.9. DELETE-VOICEPRINT . . . . . . . . . . . . . . . . . . . 161 293 11.10. VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 162 294 11.11. VERIFY-FROM-BUFFER . . . . . . . . . . . . . . . . . . . 162 295 11.12. VERIFY-ROLLBACK . . . . . . . . . . . . . . . . . . . . 165 296 11.13. STOP . . . . . . . . . . . . . . . . . . . . . . . . . . 165 297 11.14. START-INPUT-TIMERS . . . . . . . . . . . . . . . . . . . 166 298 11.15. VERIFICATION-COMPLETE . . . . . . . . . . . . . . . . . 167 299 11.16. START-OF-INPUT . . . . . . . . . . . . . . . . . . . . . 167 300 11.17. CLEAR-BUFFER . . . . . . . . . . . . . . . . . . . . . . 168 301 11.18. GET-INTERMEDIATE-RESULT . . . . . . . . . . . . . . . . 168 302 12. Security Considerations . . . . . . . . . . . . . . . . . . . 169 303 12.1. Rendezvous and Session Establishment . . . . . . . . . . 170 304 12.2. Control channel protection . . . . . . . . . . . . . . . 170 305 12.3. Media session protection . . . . . . . . . . . . . . . . 170 306 12.4. Indirect Content Access . . . . . . . . . . . . . . . . 171 307 12.5. Protection of stored media . . . . . . . . . . . . . . . 172 308 12.6. DTMF and recognition buffers . . . . . . . . . . . . . . 172 309 12.7. Client-set server parameters . . . . . . . . . . . . . . 172 310 12.8. DELETE-VOICEPRINT and authorization . . . . . . . . . . 172 311 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 173 312 13.1. New registries . . . . . . . . . . . . . . . . . . . . . 173 313 13.1.1. MRCPv2 resource types . . . . . . . . . . . . . . . 173 314 13.1.2. MRCPv2 methods and events . . . . . . . . . . . . . 173 315 13.1.3. MRCPv2 header fields . . . . . . . . . . . . . . . . 175 316 13.1.4. MRCPv2 status codes . . . . . . . . . . . . . . . . 177 317 13.1.5. Grammar Reference List Parameters . . . . . . . . . 177 318 13.1.6. MRCPv2 vendor-specific parameters . . . . . . . . . 178 319 13.2. NLSML-related registrations . . . . . . . . . . . . . . 178 320 13.2.1. application/nlsml+xml Media Type registration . . . 178 321 13.3. NLSML XML Schema registration . . . . . . . . . . . . . 179 322 13.4. MRCPv2 XML Namespace registration . . . . . . . . . . . 179 323 13.5. text Media Type Registrations . . . . . . . . . . . . . 179 324 13.5.1. text/grammar-ref-list . . . . . . . . . . . . . . . 180 325 13.6. session URI scheme registration . . . . . . . . . . . . 180 326 13.7. SDP parameter registrations . . . . . . . . . . . . . . 181 327 13.7.1. sub-registry "proto" . . . . . . . . . . . . . . . . 182 328 13.7.2. sub-registry "att-field (media-level)" . . . . . . . 182 329 14. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 183 330 14.1. Message Flow . . . . . . . . . . . . . . . . . . . . . . 183 331 14.2. Recognition Result Examples . . . . . . . . . . . . . . 193 332 14.2.1. Simple ASR Ambiguity . . . . . . . . . . . . . . . . 193 333 14.2.2. Mixed Initiative . . . . . . . . . . . . . . . . . . 194 334 14.2.3. DTMF Input . . . . . . . . . . . . . . . . . . . . . 195 335 14.2.4. Interpreting Meta-Dialog and Meta-Task Utterances . 195 336 14.2.5. Anaphora and Deixis . . . . . . . . . . . . . . . . 196 337 14.2.6. Distinguishing Individual Items from Sets with 338 One Member . . . . . . . . . . . . . . . . . . . . . 197 339 14.2.7. Extensibility . . . . . . . . . . . . . . . . . . . 198 340 15. ABNF Normative Definition . . . . . . . . . . . . . . . . . . 198 341 16. XML Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 213 342 16.1. NLSML Schema Definition . . . . . . . . . . . . . . . . 213 343 16.2. Enrollment Results Schema Definition . . . . . . . . . . 214 344 16.3. Verification Results Schema Definition . . . . . . . . . 216 345 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 219 346 17.1. Normative References . . . . . . . . . . . . . . . . . . 219 347 17.2. Informative References . . . . . . . . . . . . . . . . . 222 348 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 224 349 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 225 350 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 225 352 1. Introduction 354 The MRCPv2 protocol is designed to allow a client device to control 355 media processing resources on the network. Some of these media 356 processing resources include speech recognition engines, speech 357 synthesis engines, speaker verification and speaker identification 358 engines. MRCPv2 enables the implementation of distributed 359 Interactive Voice Response platforms using VoiceXML 360 [W3C.REC-voicexml20-20040316] browsers or other client applications 361 while maintaining separate back-end speech processing capabilities on 362 specialized speech processing servers. MRCPv2 is based on the 363 earlier Media Resource Control Protocol (MRCP) [RFC4463] developed 364 jointly by Cisco Systems, Inc., Nuance Communications, and 365 Speechworks Inc. Although some of the method names are similar, the 366 way in which these methods are communicated is different. There are 367 also more resources and more methods for each resource. The first 368 version of MRCP was essentially taken only as input to the 369 development of this protocol. There is no expectation that an MRCPv2 370 client will work with an MRCPv1 server or vice versa. There is no 371 migration plan or gateway definition between the two protocols. 373 The protocol requirements of SPEECHSC [RFC4313] include that the 374 solution be capable of reaching a media processing server and setting 375 up communication channels to the media resources, and sending and 376 receiving control messages and media streams to/from the server. The 377 Session Initiation Protocol (SIP) [RFC3261] meets these requirements. 379 The proprietary version of MRCP ran over the Real Time Streaming 380 Protocol (RTSP) [RFC2326]. At the time work on MRCPv2 was begun, the 381 consensus was that this use of RTSP would break the RTSP protocol or 382 cause backward-compatibility problems, something forbidden by Section 383 3.2 of the above mentioned requirements document. This is the reason 384 why MRCPv2 does not run over RTSP. 386 MRCPv2 leverages these capabilities by building upon SIP and the 387 Session Description Protocol (SDP) [RFC4566]. MRCPv2 uses SIP to 388 setup and tear down media and control sessions with the server. In 389 addition, the client can use a SIP re-INVITE method (an INVITE dialog 390 sent within an existing SIP Session) to change the characteristics of 391 these media and control session while maintaining the SIP dialog 392 between the client and server. SDP is used to describe the 393 parameters of the media sessions associated with that dialog. It is 394 mandatory to support SIP as the session establishment protocol to 395 ensure interoperability. Other protocols can be used for session 396 establishment by prior agreement. This document only describes the 397 use of SIP and SDP. 399 MRCPv2 uses SIP and SDP to create the speech client/server dialog and 400 set up the media channels to the server. It also uses SIP and SDP to 401 establish MRCPv2 control sessions between the client and the server 402 for each media processing resource required for that dialog. The 403 MRCPv2 protocol exchange between the client and the media resource is 404 carried on that control session. MRCPv2 protocol exchanges do not 405 change the state of the SIP dialog, the media sessions, or other 406 parameters of the dialog initiated via SIP. It controls and affects 407 the state of the media processing resource associated with the MRCPv2 408 session(s). 410 MRCPv2 defines the messages to control the different media processing 411 resources and the state machines required to guide their operation. 412 It also describes how these messages are carried over a transport 413 layer protocol such as the Transmission Control Protocol (TCP) 414 [RFC0793] or the Transport Layer Security (TLS) Protocol [RFC5246] 415 (Note: the Stream Control Transmission Protocol (SCTP) [RFC4960] is a 416 viable transport for MRCPv2 as well, but the mapping onto SCTP is not 417 described in this specification). 419 2. Document Conventions 421 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 422 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 423 document are to be interpreted as described in RFC 2119 [RFC2119]. 425 Since many of the definitions and syntax are identical to those for 426 the HTTP/1.1 (Hypertext Transfer Protocol (HTTP/1.1) [RFC2616]), this 427 specification refers to the section where they are defined rather 428 than copying it. For brevity, [HX.Y] is to be taken to refer to 429 Section X.Y of RFC 2616. 431 All the mechanisms specified in this document are described in both 432 prose and an augmented Backus-Naur form (ABNF [RFC5234]). 434 The complete message format in ABNF form is provided in Section 15 435 and is the normative format definition. Note that productions may be 436 duplicated within the main body of the document for reading 437 convenience. If a production in the body of the text conflicts with 438 one in the normative definition, the latter rules. 440 2.1. Definitions 442 Media Resource 443 An entity on the speech processing server that can be 444 controlled through the MRCPv2 protocol. 446 MRCP Server 447 Aggregate of one or more "Media Resource" entities on 448 a Server, exposed through the MRCPv2 protocol 449 ("Server" for short). 450 MRCP Client 451 An entity controlling one or more Media Resources 452 through the MRCPv2 protocol ("Client" for short). 453 DTMF 454 Dual Tone Multi-Frequency; a method of transmitting 455 key presses in-band, either as actual tones (Q.23 456 [Q.23]) or as named tone events (RFC 4733 [RFC4733]). 457 Endpointing 458 The process of automatically detecting the beginning 459 and end of speech in an audio stream. This is 460 critical both for speech recognition and for automated 461 recording as one would find in voice mail systems. 462 Hotword Mode 463 A mode of speech recognition where a stream of 464 utterances is evaluated for match against a small set 465 of command words. This is generally employed to 466 either trigger some action, or to control the 467 subsequent grammar to be used for further recognition 469 2.2. State-Machine Diagrams 471 The state-machine diagrams in this document do not show every 472 possible method call. Rather, they reflect the state of the resource 473 based on the methods that have moved to IN-PROGRESS or COMPLETE 474 states (see Section 5.3). Note that since PENDING requests 475 essentially have not affected the resource yet and are in queue to be 476 processed, they are not reflected in the state-machine diagrams. 478 2.3. URI Schemes 480 This document defines many protocol headers that contain URIs 481 (Uniform Resource Identifier (URI) [RFC3986]) or lists of URIs for 482 referencing media. The entire document, including the Security 483 Considerations section (Section 12), assumes that HTTP or HTTP over 484 TLS (HTTPS) [RFC2818] will be used as the URI addressing scheme 485 unless otherwise stated. However, implementations MAY support other 486 schemes (such as "file") provided they have addressed any security 487 considerations described in this document and any others particular 488 to the specific scheme. For example, implementations where the 489 client and server both reside on the same physical hardware and the 490 file system is secured by traditional user-level file access controls 491 could be reasonable candidates for supporting the "file" scheme. 493 3. Architecture 495 A system using MRCPv2 consists of a client that requires the 496 generation and/or consumption of media streams and a media resource 497 server that has the resources or "engines" to process these streams 498 as input or generate these streams as output. The client uses SIP 499 and SDP to establish an MRCPv2 control channel with the server to use 500 its media processing resources. MRCPv2 servers are addressed using 501 SIP URIs. 503 The session initiation protocol (SIP) uses SDP with the offer/answer 504 model described in RFC3264 [RFC3264] to set up the MRCPv2 control 505 channels and describe their characteristics. A separate MRCPv2 506 session is needed to control each of the media processing resources 507 associated with the SIP dialog between the client and server. Within 508 a SIP dialog, the individual resource control channels for the 509 different resources are added or removed through SDP offer/answer 510 carried in a SIP re-INVITE transaction. 512 The server, through the SDP exchange, provides the client with a 513 difficult to guess, unambiguous channel identifier and a TCP port 514 number (see Section 4.2). The client MAY then open a new TCP 515 connection with the server on this port number. Multiple MRCPv2 516 channels can share a TCP connection between the client and the 517 server. All MRCPv2 messages exchanged between the client and the 518 server carry the specified channel identifier that the server MUST 519 ensure is unambiguous among all MRCPv2 control channels that are 520 active on that server. The client uses this channel identifier to 521 indicate the media processing resource associated with that channel. 522 For information on message framing, see Section 5. 524 The session initiation protocol (SIP) also establishes the media 525 sessions between the client (or other source/sink of media) and the 526 MRCPv2 server using SDP m-lines. One or more media processing 527 resources may share a media session under a SIP session, or each 528 media processing resource may have its own media session. 530 The following diagram shows the general architecture of a system that 531 uses MRCPv2. To simplify the diagram only a few resources are shown. 533 MRCPv2 client MRCPv2 Media Resource Server 534 |--------------------| |------------------------------------| 535 ||------------------|| ||----------------------------------|| 536 || Application Layer|| ||Synthesis|Recognition|Verification|| 537 ||------------------|| || Engine | Engine | Engine || 538 ||Media Resource API|| || || | || | || || 539 ||------------------|| ||Synthesis|Recognizer | Verifier || 540 || SIP | MRCPv2 || ||Resource | Resource | Resource || 541 ||Stack | || || Media Resource Management || 542 || | || ||----------------------------------|| 543 ||------------------|| || SIP | MRCPv2 || 544 || TCP/IP Stack ||---MRCPv2---|| Stack | || 545 || || ||----------------------------------|| 546 ||------------------||----SIP-----|| TCP/IP Stack || 547 |--------------------| || || 548 | ||----------------------------------|| 549 SIP |------------------------------------| 550 | / 551 |-------------------| RTP 552 | | / 553 | Media Source/Sink |------------/ 554 | | 555 |-------------------| 557 Figure 1: Architectural Diagram 559 3.1. MRCPv2 Media Resource Types 561 An MRCPv2 server may offer one or more of the following media 562 processing resources to its clients. 563 Basic Synthesizer 564 A speech synthesizer resource with very limited 565 capabilities, that can generate its media stream 566 exclusively from concatenated audio clips. The speech 567 data is described using a limited subset of the Speech 568 Synthesis Markup Language (SSML) 569 [W3C.REC-speech-synthesis-20040907] elements. A basic 570 synthesizer MUST support the SSML tags , 571