|
Hi Dan, I understand your explanation about all these "vendor
specific" parameter. I think that since this a standard track document
there should be some text explaining the usage of these parameters as well as
making a note that since these are vendor specific information you cannot
compare the values coming from different vendors As for my comment number 5 on payload type 96. My comment was
that if the m-line has a payload type number of 96 you must have a a=rtpmap
line mapping 96 to a specific subtype name while for pcmu it is not mandatory
to have a=rtpmap like you have in your examples since payload type number 0 is
a static payload type number assigned to pcmu Roni Even From: Dan Burnett
[mailto:dburnett at voxeo.com] On Jul 7, 2009, at 3:40 PM, Roni Even wrote:
Hi, I was
assigned to do a RAI review of the draft. The draft looks ready for
publication to me. I have some comments mostly editorial. The
only issue I see that is not pure editorial is the issue of the different
parameters like confidence threshold, sensitivity level (see comments 11, 13,
15, 16 and 17). I think that some clarification on the semantics and the scale
(for example are the values linearly spaced) as well as when they are useful
will be helpful to implementers. 1. In figure 1
Expand the abbreviations TTS, ASR, SV , SI and how they are related to the
media resource types in 3.1 Done. Added some text explaining Figure 1 and enhanced
Figure 1 slightly for clarification. 2. In figure 1
there is a SIP dialog between the MRCPv2 client and the media source/sink, what
is this dialog, I only saw in section 4 a dialog between the client and server. Clarified in the first example of section 4.2 that the
SIP dialog with the media source/sink is not shown.
Fixed.
4. In the example in section 4.2
you “a=cmid:1”, cmid is specified later in the document so maybe
you can add some reference to where it is specified Done.
5. In the example is section 4.2
and in following examples you have “m=audio 49170 RTP/AVP 0 96” but
do not have an rtpmap parameter for mapping 96 (dynamic payload type number) to
a media encoding name. It is not in the first or third examples (Synthesizer only),
but it is in the second example (Recognizer). I have removed 96 as an
option for the Synthesizer-only examples but let it remain as an addition for
the Recognizer example.
6. In section 4.3 “Also note
that more that one media session can be associated with a single resource if
need be, but this scenario is not useful for the current set of
resources”. There is a typo the second “that” should be
“than”. I am also not sure if the current syntax in this document
can support the mode. Fixed the typo.
7. In section 4.3 “The
formatting of the"cmid" attribute in SDP RFC3388 [RFC4566]”. I
think you meant SDP grouping and need the reference to RFC 3388. I removed the reference altogether because it already exists
(correctly) earlier in the paragraph.
8. In section 5.1 “The
message-length field specifies the length of the message, including the
start-line” is the length in Bytes, there is no unit specified. Changed "length of the message" to "length of
the message in bytes".
9. In section 6.3.1, typo you have
“Verfication “ instead of verification. It appears twice in the
section. Fixed.
10. In the example in section 7 you
have “m=audio 0 RTP/AVP 0 1 3” payload type 1 was deleted from the
IANA registry, maybe have another payload type number. I just removed that payload type. It is not germane to
the example.
11. In section 9.4.1, 9.4.2 and
9.4.3 you specify confidence threshold, sensitivity level and speed vs
accuracy. What is the scale here; is it linear between 0 and 1. What is the
absolute value of the number, if you receive the same confidence level from two
recognizers are they the same (e.g. when using context block to switch
servers). For the speed vs accuracy, how does the client know what is the
relation between the value and the number of available sessions, since this
seems to be the reason for using this parameter. The interpretation of all of these parameters is
implementation-specific because the underlying technologies used to implement
them vary and can even be proprietary. In practice the speech recognition
and synthesis and speaker authentication communities have lived with this state
of affairs for many years, and users of other APIs for this technology are well
aware of and have built applications that accommodate this variability in
interpretation. It is outside the scope of this specification to attempt
to standardize interpretations of these values.
12. In 9.4.9 and in 10.4.8, 11.4.11
what are the values for media-type-value, you also mention audio and video but
it looks to me that this document only discusses voice. Yes. Although the original intent was to record
speech, application authors today are beginning to look at ways to incorporate
other audio or video. The intent of the sentences in these sections is to
clarify that the specification itself imposes no restriction on the types of media
that are allowed.
13. In 9.4.35 and 9.4.36 what is
the scale for the consistency here. How does one know what close means. What is
the consistency between different recognizers. The answer to question 11, above, applies here as well.
14. In section 9.6.3.3 in the
example (figure 2) confidence should be 0.75 and not 75 Fixed.
15. In section 10.4.1 it is not
clear how you measure the sensitivity in order to specify, is it based on some
SNR translated to 0 to 1 scale? The answer to question 11, above, applies here as well.
16. In 11.4.6 the same issue with
the scale, how does the client know how to set a value when working with
different speaker verification servers. Ditto. I should point out that in all of these cases
the parameters are typically passed directly to the engine, and their
interpretations are defined (and described) in the vendors' documentation.
The most common MRCPv2 server implementations are by the technology
vendors themselves (the providers of the synthesis, recognition, and
verification engines). This is commonly understood in this technology
industry (meaning those who use this technology regularly).
17. In 11.5.2.9 you state that the
verification-score is not a probability, so what is it. How can the client
decide if, for example, 0 is a good score for specifying the threshold. I
also noticed that the values in the example in section 11.5.2.10 are very
precise like 0.98514 is this the expected precision. The examples here and in
section 11.11 do not show the threshold, if the threshold is required for this
flow why not show it in the example? This parameter, as others mentioned above, has only a
vendor-specific interpretation. In practice authors interpret these
values based both on guidance from the technology vendors and via
experimentation on large sets of recorded data. The Min-Verification-Score threshold is not required to be
set. In many cases the technology vendor has a fairly good understanding
of what the default threshold should be. The verification-score is
returned, however, in case the application author determines (through
experimentation, as described above) that the default threshold is not
producing optimal results for the application. In that case the author
can set the threshold to a different value or can set it to -1 and make the
determination within the application itself based on the verification-score
values.
18. In section 12.3 the suggestion
is to use SRTP as the mandatory interoperability mode. If the reason for
mandating SRTP is for a common mode you should also decide on a key exchange
mechanism. I suggest you look at http://tools.ietf.org/html/draft-ietf-avt-srtp-not-mandatory-02 for discussion on media security. Based on the discussion between you and Dan York on the
list, I will change this: 12.3. Media session protection Sensitive data is also carried on media sessions terminating on MRCPv2 servers (the other end of a media channel may or may not be on the MRCPv2 client). This data includes the user's spoken utterances and the output of text-to-speech operations. MRCPv2 servers MUST support SRTP for protection of audio media sessions. MRCPv2 clients that originate or consume audio similarly MUST support SRTP. Alternative media channel protection MAY be used if desired (e.g. IPSEC). to this: 12.3. Media session protection Sensitive data is also carried on media sessions terminating on MRCPv2 servers (the other end of a media channel may or may not be on the MRCPv2 client). This data includes the user's spoken utterances and the output of text-to-speech operations. MRCPv2 servers MUST support a security mechanism for protection of audio media sessions. MRCPv2 clients that originate or consume audio similarly MUST support a security mechanism for protection of the audio. If appropriate, usage of the Secure Real-time Transport Protocol (SRTP) [RFC3711] is recommended.
I have corrected both in section 13.7.2 to be media-level.
Thanks Roni Even |