Speech Support in Bot Framework – Webchat to Directline, to Cortana

The Bot Framework now supports speech as a method of interacting with the bot across Webchat, the DirectLine channel, and Cortana. In this article we’ll go over the new capabilities, speech recognition priming using LUIS, and a new NuGet package we’ve released which supports speech recognition and synthesis on the DirectLine channel.

Web Chat now includes speech capabilities

If you’ve ever created and registered a bot on the Microsoft Bot Framework, you may already be familiar with the Web Chat channel. This channel is automatically configured for your Bot when it is registered, and enables users to interact with your bot via a web chat control using text. Developers can also easily embed web chat control in websites using the <iframe> HTML element and passing in the bot’s credentials.

Below is a screen shot example of a web chat control connected to our sample trivia bot, notice anything different?

There’s a little microphone icon at the bottom right hand corner, which the user can use to initiate a conversation using speech and the bot’s response can also include a spoken utterance. Users can now easily have spoken conversations with bots, in addition to typing in text or selecting from UI touch menus.

To give this a try, check out the WebChat repo on GitHub and the sample at samples/speech/index.html.

Updated Message Activity

For a bot to have speech enabled interactions, there are several components that a bot must be able to do:

1) Being able to understand a user’s speech 2) Being able to speak back to the user 3) Automatically listen if the user asks/asked a question 4) Stop listening or speaking if the user begins interacting in another way (text, touch)

Within the Bot Framework, we’ve made it very simple to implement speech-enabled conversation for bots using two new fields in IMessageActivity – Speak and InputHint as shown below.

IMessageActivity {
    ...
    string Speak; // SSML or plain text
    enum InputHint; 
    ...
}

The Speak field accommodates both plain text and SSML (Speech Synthesis Markup Language) which the Bot uses to specify how a client such at WebChat can synthesize audio to speak the response back to the user. Here’s an example from the sample trivia bot connected to the bot framework emulator:

In the ‘Details’ viewer in the emulator, we can view the json payload of a message. You can see SSML passed into the speak property of the json response message and within the SSML the phrase to speak. Note that SSML also supports pre-recorded audio using the audio element, so you can even include sound effects or pre-recorded phrases for deep customization.

You’ll also note the inputHint property in the response which can be used for the bot to specify its current interaction mode, i.e whether the bot is accepting, expecting, or ignoring a user’s input. In this example, the Bot’s response is to ask a question and the inputHint is set to expectingInput, meaning the bot is awaiting an input from the user. On many channels, this would open the microphone and enable the client’s input box, allowing a natural multi-turn conversation without the user needing to manually initiate speech recognition by clicking/tapping.

Now let’s take a look at the next response from the Bot after the user has answered the question:

The Bot informs us that the answer was incorrect, and the inputHint field here is set to ignoringInput. As the name suggests, in this message from the bot, it is ignoring user input (because it’s busy picking the next question to ask), and intends to send us subsequent messages. Depending on the client/channel, this may cause the client’s input box and microphone to be disabled.

Lastly, here’s an example of acceptingInput after saying something to the bot it does not understand. The difference from expectingInput is that the bot is not asking a question but is passively ready for input in case the user responds. Speech enabled clients would not automatically start listening but would allow the user to tap the microphone button or type to initiate a response.

As a developer, setting inputHint is optional, the Bot Builder SDK will automatically implement the logic for you if you’re using system provided dialogs such as PromptSay, etc. If you’re creating custom responses, you should set inputHint explicitely.

If you are building a bot on a speech-enabled channel such as Cortana, these are the only two properties you need to implement to construct messages which will be spoken by your bot. Told you we’ve made it simple!

Click here for the docs overview of how to add speech to messages.

Intent based Speech Priming for natural language

Without context, speech can often be misinterpreted. Different people can hear the exact same thing and interpret and understand it completely differently. Bots can misinterpret in the same way, and this often leads to unpleasant user experiences. For example, in a chess scenario, a user might say:

“Move knight to A 7”

Without context for the user’s intent, the utterance might be recognized as:

“Move night 287”

We now support speech recognition priming, which allows you to provide context via your bot, to ensure that speech relevant to your scenario is recognized accurately. Many bot developers already use LUIS to extract the meaning behind the user’s text-based input. LUIS is able to do this since it’s trained using example utterances to capture what the user is likely to say as well as the context. Speech recognition priming uses the utterances and entity tags in your LUIS models to improve accuracy and relevance while converting audio to text.

In our chess scenario sample, we created an intent called MakeChessMove, and created two custom entities: ChessPiece and ChessCoordinate.

Then, we provide sample utterances that a user might say, and train and publish our model. How do we configure this model for speech? Good news, that’s the easy part!

When you register a new bot using a LUIS model, behind the scenes the Bot Framework will automatically leverage the contents of the LUIS model to create a speech model. After you create a bot, to enable speech recognition priming, simply go into your bot’s settings page, there is a new field for speech recognition options as shown below:

From here you only need to check the relevant LUIS models you want to associate with your bot to enable speech recognition priming. If you associate your LUIS app with your bot in this way, speech priming is enabled in CortanaBot Framework EmulatorSpeech enabled webchat control, and DirectLine via the Microsoft.Bot.Client NuGet package. This speech model is automatically updated any time you train and publish your LUIS model.

Things to note:

  • Speech recognition priming already supports LUIS built-in entities.
  • For custom entities, we use the tags associated with the entity definition in your LUIS app
  • LUIS Phrase list features (not to be confused with a closed list entity) are currently not used in priming speech.

If you find that a specific spoken phrase isn’t being recognized correctly, please add it as an utterance in your LUIS model. Similarly, if an entity value isn’t being recognized correctly, make sure you have an example utterance for that entity value and that the appropriate words are tagged as the entity.

Cross platform speech support in your app using the DirectLine channel

We’ve released a new NuGet package Microsoft.Bot.Client which allows you to embed your bot in your applications, and also supports both speech recognition and speech synthesis. The library supports both UWP and C# applications in XAMARIN, to allow developers to include speech enabled conversations with bots across different platforms (native iOS, Android, Windows).

NOTE: You will need to enable the DirectLine channel in your Bot to use this package

NOTE: Speech recognition is supported across platforms. However, intent based speech priming is currently only supported for Windows clients.

From our sample UWP application, to connect to the Trivia Bot, we create a new instance of the client with the proper authorization to access the bot’s DirectLine channel.

_botClient = new Microsoft.Bot.Client.BotClient(
    BotConnection.DirectLineSecret,
    BotConnection.ApplicationName
){
    // We used the Cognitive Services Speech-To-Text API, with speech priming support as the speech recognizer as well as the Text-To-Speech API.
    // Alternate/custom speech recognizer & synthesizer implementation are supported as well.
    SpeechRecognizer = new CognitiveServicesSpeechRecognizer(BotConnection.BingSpeechKey),
    SpeechSynthesizer = new CognitiveServicesSpeechSynthesizer(BotConnection.BingSpeechKey, Microsoft.Bot.Client.SpeechSynthesis.CognitiveServices.VoiceNames.Jessa_EnUs)
};

This client can be used to register for events related to speech recognition, speech synthesis, updating UI state, starting and ending conversation, etc.

// Attach to the callbacks the client provides for observing the state of the bot
// This will be called every time the bot sends down an activity
_botClient.ConversationUpdated += OnConversationUpdated;

// Speech-related events
_botClient.SpeechRecognitionStarted += OnSpeechRecognitionStarted;
_botClient.IntermediateSpeechRecognitionResultReceived += OnIntermediateSpeechRecognitionResultReceived;
_botClient.SpeechRecognitionEnded += OnSpeechRecognitionEnded;
_botClient.FinalSpeechRecognitionResultReceived += OnFinalSpeechRecognitionResultReceived;
_botClient.SpeechSynthesisEnded += OnSpeechSynthesisEnded;

// Kick off the conversation
_startConversationTask = _botClient.StartConversation();

Why would a developer want to use this NuGet? Well, it leverages the flexibility of the DirectLinechannel, and allows you to develop custom UI for your application.

What is the DirectLine Channel anyway?

DirectLine channel allows the developer to connect to their bot from anywhere. This means that you can build a completely custom client and maintain complete control over the end to end experience. When you connect a bot to other channels, say Facebook, Skype, Cortana, Slack, etc. you are ‘locked in’ so to speak in developing specifically to accommodate those clients.

In fact, the Bot Emulator, which we commonly use to test and debug bots, is a web chat control instance using the DirectLine channel.

Using DirectLine, not only can you define your own UI with your own features, but you can connect multiple applications to a registered Bot simultaneously. First let’s take a look at the custom UI sample we built for the Trivia Bot using the new Bot.Client NuGet package.

Using the NuGet package within our sample trivia app, we have full freedom to design a new UX to interact with a bot. In the above screenshot, we added a UI meant to bring the feel of a gameshow, where the bot is the host and the user is standing behind a podium. We could also bind button clicks to the UI, making the whole application feel like a custom experience as opposed to the standard common chat interfaces. For example, the user could switch to the “Geography” category by saying something like “switch to the geography category” OR just clicking on the “Geography” category button on the podium. The UI also keeps track of the user’s score as the game progresses.

Summary

You can now type, tap and also talk to interact with your bot. We’ve made it easy for you to add speech recognition capabilities to your own applications that can connect to your Bot, and the speech recognition is primed and specialized to your bot using the LUIS models you already have!

We’ve released a new Nuget package – Microsoft.Bot.Client which enables voice communication for UWP and cross platform (via Xamarin) applications on the Direct Line REST API; and we’ve also updated the Webchat control to support speech interactions.

You can also check out our samples on Github

If you missed it at Build 2017, check out the Bot conversations for apps presentation which showcases all of the concepts we discussed above.

Happy Making!

Khuram Shahid and Matthew Shim from the Bot Framework Team