Show / Hide Table of Contents

    OnCall IVR Designer v7.11.2, part of OnCall IVR Suite v3.3.x

    Speech Recognition

    Speech recognition is a feature that allows the IVR system to understand and process spoken language. It is used to recognize spoken words and phrases and use the recognition result to control the flow of the IVR system.

    To enable speech recognition on a node or template, you must define grammar rules that the speech engine will use to recognize spoken words and phrases. These grammar rules are specified in the Grammars management section of the IVR Designer.

    Note

    The language of the speech recognizer engine can be specified on the Start node of the flow.

    Before selecting a language, make sure that the associated language package is installed on the IVR system's speech recognizer engine.

    Defining Grammars

    To add a new grammar, click the Add grammar icon in the Grammars section. Fill in the input fields in the Add or edit grammar dialog.

    The grammar's name should be a unique and descriptive identifier, used for easy identification in the list of grammars.

    The content of a grammar is a list of words and phrases that the speech engine will recognize. Each word or phrase should be on a separate line.

    The value of the grammar contains the grammar rules in XML form, adhering to the SRGS 1.0 standard.

    The defined grammar value is inserted into a template and should start at the rule level, including a primary rule with the id="main" attribute. An example of a simple grammar definition is shown below:

      <rule id="main" scope="public">
        <one-of>
          <item>
            accounting
            <tag>out.selected_option="ACCOUNTING";</tag>
          </item>
          <item>
            credit management
            <tag>out.selected_option="CREDIT_MANAGEMENT";</tag>
          </item>
          <item>
            other
            <tag>out.selected_option="OTHER";</tag>
          </item>
        </one-of>
      </rule>
    
    Caution

    All changes made to grammars will be applied permanently only after saving the flow. Closing the flow without saving will result in the loss of all changes made to grammars.

    An existing grammar can be edited by clicking the Edit icon next to the grammar in the list. A grammar can be deleted by clicking the Delete icon next to the grammar in the list.

    DTMF grammars

    In the Grammars section, you can define both voice and DTMF grammars. A DTMF grammar can be used by a DTMF detector to determine sequences of legal and illegal DTMF tones. The grammar is defined in the same way as a voice grammar, but instead of spoken words and phrases, it contains DTMF tokens.

    An example of a simple DTMF grammar definition is shown below:

      <rule id="main" scope="public">
        <one-of>
          <item>
            1
            <tag>out.selected_option="ACCOUNTING";</tag>
          </item>
          <item>
            2
            <tag>out.selected_option="CREDIT_MANAGEMENT";</tag>
          </item>
          <item>
            3
            <tag>out.selected_option="OTHER";</tag>
          </item>
        </one-of>
      </rule>
    
    Note

    The mode of the grammar (speech or dtmf) is determined by the node attribute in which the grammar is used. It is set in the grammar template based on the given node attribute.

    Important

    A grammar can only be of one mode at a time. Either 'dtmf' type or 'speech' type.

    Template of the Grammar

    The grammar template is defined in the grammar_template.xml file, located in the .\EndPoint\Templates\ folder within the IVR Engine installation directory.

    The default template is as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <grammar xmlns="http://www.w3.org/2001/06/grammar" version="1.0" xml:lang="{LANGUAGE_CODE}" mode="{INPUT_MODE}" root="main" tag-format="semantics/1.0">
      {INPUT_SRGS_FRAGMENT}
    </grammar>
    

    The LANGUAGE_CODE placeholder is substituted with the ISO language code of the selected speech recognizer language package. Similarly, the INPUT_SRGS_FRAGMENT placeholder is replaced with the grammar value defined in the IVR Designer. The INPUT_MODE placeholder is replaced with the mode of the grammar (speech or dtmf).

    A final grammar is going to look like this:

    <?xml version="1.0" encoding="UTF-8"?>
    <grammar xmlns="http://www.w3.org/2001/06/grammar" version="1.0" xml:lang="en-US" mode="voice" root="main" tag-format="semantics/1.0">
      <rule id="main" scope="public">
        <one-of>
          <item>
            accounting
            <tag>out.selected_option="ACCOUNTING";</tag>
          </item>
          <item>
            credit management
            <tag>out.selected_option="CREDIT_MANAGEMENT";</tag>
          </item>
          <item>
            other
            <tag>out.selected_option="OTHER";</tag>
          </item>
        </one-of>
      </rule>
    </grammar>
    

    Using speech recognition in the flow

    When a node with speech recognition (PromptSR, PromptJumpSR) is executed, the outcome of the speech recognition is stored in the srresult variable. The srcompletioncode variable contains the completion code of the speech recognition operation.

    The srcompletioncode variable can take on the following number values, corresponding to the MRCPv2 Completion-Cause:

    srcompletioncode value Cause-Code Cause-Name
    0 000 success
    1 001 no-match
    2 002 no-input-timeout
    3 003 hotword-maxtime
    4 004 grammar-load-failure
    5 005 grammar-compilation-failure
    6 006 recognizer-error
    7 007 speech-too-early
    8 008 success-maxtime
    9 009 uri-failure
    10 010 language-unsupported
    11 011 cancelled
    12 012 semantics-failure
    13 013 partial-match
    14 014 partial-match-maxtime
    15 015 no-match-maxtime
    16 016 grammar-definition-failure

    When speech recognition is unsuccessful, the srresult variable contains a null value. When the speech recognition operation is successful (indicated by the srcompletioncode variable value of 0 for success), the srresult variable contains a JSON object representing the recognition result in NLSML format. The JSON object has the following schema:

    {
      "$schema": "http://json-schema.org/draft-07/schema#",
      "type": "object",
      "properties": {
        "interpretations": {
          "type": "array",
          "description": "Array of interpretations of the recognized speech. Each interpretation contains the confidence level and recognized instances of the spoken words or phrases.",
          "items": {
            "type": "object",
            "properties": {
              "confidence": {
                "type": "integer",
                "description": "Confidence level of the interpretation, ranging from 0 to 100."
              },
              "instances": {
                "type": "object",
                "description": "Object representing the data of the instance.",
                "patternProperties": {
                  "^[a-zA-Z0-9_]+$": {
                    "type": "string"
                  }
                },
                "additionalProperties": false
              },
              "inputMode": {
                "type": "string",
                "enum": ["speech", "dtmf"],
                "description": "The modality of the input. Optional."
              },
              "inputText": {
                "type": "string",
                "description": "The recognized input text. Optional."
              }
            },
            "required": ["confidence", "instances", "inputMode", "inputText"],
            "additionalProperties": false
          }
        }
      },
      "required": ["interpretations"],
      "additionalProperties": false
    }
    

    A simple example of the srresult variable value is shown below:

    {
      "interpretations": [
        {
          "confidence": 90,
          "instances": {
            "selected_option": "ACCOUNTING"
          },
          "inputMode": "speech",
          "inputText":"accounting"
        }
      ]
    }
    

    PromptSR node

    The PromptSR node is used to prompt the caller and collect some data via speech or DTMF based on grammar.

    The PromptSR node has the following parameters:

    • interruptable - If set to true, the caller can interrupt the prompt by speaking or pressing DTMF keys. If set to false, the caller cannot interrupt the prompt.
    • grammar - The voice grammar that the speech engine will use to recognize the spoken word or phrase.
    • grammar_dtmf - Optional. The DTMF grammar that the speech engine will use to recognize the pressed DTMF keys.
    • confidencethreshold - The confidence level that the speech engine will use to determine if the spoken word or phrase matches the grammar.
    • maxresults - The maximum number of recognized words or phrases that the speech engine will return.
    • timeout - The time in seconds that the speech engine will wait for the caller to speak or press DTMF keys before timing out.

    The PromptSR node has the following output transitions:

    • next - The flow will proceed to this transition if the speech engine recognizes the spoken word or phrase, or the pressed DTMF keys, or returns a valid completion cause (e.g., 001 no-match). The srcompletioncode and srresult variables are set after this transition.
    • error - The flow will proceed to this transition if the speech recognition operation fails or if other errors occur.

    PromptSR node

    PromptJumpSR node

    The PromptJumpSR node is used to prompt the caller to select an option via speech or DTMF based on grammar. The flow will jump to the next node associated with the selected option.

    The PromptJumpSR node has the following parameters:

    • interruptable - If set to true, the caller can interrupt the prompt by speaking or pressing DTMF keys. If set to false, the caller cannot interrupt the prompt.
    • grammar - The voice grammar that the speech engine will use to recognize the spoken word or phrase.
    • grammar_dtmf - Optional. The DTMF grammar that the speech engine will use to recognize the pressed DTMF keys.
    • confidencethreshold - The confidence level that the speech engine will use to determine if the spoken word or phrase matches the grammar.
    • timeout - The time in seconds that the speech engine will wait for the caller to speak or press DTMF keys before timing out.
    • grammar_output - A variable which holds the name of the voice grammar output that needs to be evaluated to determine the next node.
    • grammar_dtmf_output - Optional. A variable which holds the name of the DTMF grammar output that needs to be evaluated to determine the next node. It has to be used in conjunction with the grammar_dtmf parameter.
    • [condition_next1 | condition_next2 | .. | condition_next12] - The value of the grammar output that will cause the flow to jump to the [next1 | next2 | .. | next12] transition.

    The PromptJumpSR node has the following output transitions:

    • [next1 | next2 | .. | next12] - The flow will proceed to this transition if the speech engine recognizes the spoken word or phrase or if DTMF keys are pressed, and the grammar output matches the value of [condition_next1 | condition_next2 | .. | condition_next12].
    • nextnocondition - The flow will proceed to this transition if the speech engine recognizes the spoken word or phrase or pressed DTMF keys, but the grammar output does not match any of the specified conditions.
    • nextelse - The flow will proceed to this transition if the speech engine fails to recognize the spoken word or phrase, if DTMF keys are pressed, or if a valid completion cause is returned (e.g., 001 no-match).
    • error - The flow will proceed to this transition if the speech recognition operation fails or if other errors occur.

    PromptJumpSR node

    Sorry, your browser does not support inline SVG. article updatedarticle updated5/3/2024 2:34:18 PM (UTC)5/3/2024 2:34:18 PM (UTC)
    Feedback     Back to top Copyright © Geomant