Unlocking the Power of Spoken Language

The advantage of speech-enabled services is that people already know how to talk. A well-designed voice user interface (VUI) is intuitive — commands don't have to be taught, unlike the meaning of a button in a visual interface or the keys on a touchtone phone system. Still, sometimes we do have to let people know what they can say, either because they've asked for help or they're unsure how to proceed (especially if they're new users).

Here are some pointers on building a VUI that can carry on better conversations simply by leveraging the intuitive nature of spoken language.

Communicate what was understood

If a person asks a question or how to perform a task or action, the VUI should communicate what the system (the "recognizer") understood (or parsed) about the question, so that people know they've been heard, bolstering their trust in the speech technology. What we call implicit confirmations of the user's intent can be as simple as:

UserWho made the statue David?
VUIDavid was created by Michelangelo.

Or be part of a more involved answer. For example:

UserHow do I cancel an alarm?
VUIIf you have only one alarm and you want to cancel it, just say "Cancel my alarm." Otherwise, try saying things like "Cancel my 7 a.m. alarm" or "Cancel all my alarms."

Remember, due to the linear nature of an audio-only interface, people have no way of skipping ahead. They have to hear one word. Before. They hear. The next. Contrast this with how a screen might return a written response all at once, and can therefore get away with saying "Michelangelo, artist" (most likely accompanied by an image). The VUI on the other hand needs to signal to the user that what they're hearing is in fact what they asked for. Put another way, new information is presented last — which is known as the end-focus principle.

Offer examples that illustrate what people can say, and how

A VUI should also offer people a range of ways they can phrase their queries, or examples that illustrate levels of complexity, as in this exchange:

UserHow do I set an alarm?
VUITo set an alarm, try saying things like "Set an alarm for 6 a.m.," or "Wake me up in two hours," or "Set a 'medication' alarm for everyday at 7 p.m."

Avoid stating the obvious

Uninformative messages make your voice persona (which is also your brand) sound hapless and lack depth. People don't appreciate any device underestimating their intelligence. Let's take the example of a person who asks, "How do I get the news?"

Consider the entirely unhelpful:

To get the news, say "Get the news."

And swapping words for variation doesn't add value, either:

To listen to the news, say "Hear the news."

The same person probably knows or has tried framing the query just that way before. People likely are trying to explore what else they can do. Instead, build in some intuitive instruction to help them infer the broader realm of possibilities:

If you're interested in recent headlines, you can say, for instance, "Tell me the latest news." Or try asking for a specific category, like technology or sports news.

Still, a VUI often needs to tell people what to say, without anyone explicitly asking for help. Here, too, you should avoid stating the obvious.

For example, instead of using a literal instruction like this one:

Editing your shopping list is easy. To add an item, just say "Add," followed by the item you'd like to add. To remove an item, just say "Remove," followed by the item you'd like to remove.

Consider being intuitively obvious instead:

By the way, to edit your list, you can say something like "Add toothpaste" or "Take off the ice cream."

Give users credit and save extra guidance for those who need it

A VUI shouldn't try to "teach" people what to say to protect them from veering off the so-called "happy path." Instruction is irrelevant for those who aren't having problems — which should be most people if you've designed an intuitive VUI. Instead, give instructions in fallback paths and in repair (error) prompts, as in the following example. This way, you optimize relevance for people who don't need help, but offer help when someone seems to be stuck.

Don't assume that everyone needs help knowing how to ask for what they want:

UserStart a metronome.
VUIWhat tempo did you want to start with? You can say, for instance, "110 beats per minute." Or you can give me a tempo like "Allegro" or "Moderately fast." (We can always speed it up or slow it down later.)
User92 beats per minute.
VUIOkay, 92 beats per minute. Here you go.

This is bad behavior on the part of the VUI, asking a question and then immediately continuing to talk without handing over the turn to the user. This approach also requires the person either to wait for the lengthy initial message to finish or to interrupt (if barge-in is enabled), also causing them to be a poor conversation participant.

Instead, consider the sequential, time-consuming nature of speech, and yield the speaking turn back to the user:

UserStart a metronome.
VUISure, what tempo?
User[No reply]
VUIYou can say, for instance, "110 beats per minute." Or you can give me a tempo like "Allegro" or "Moderately fast." (We can always speed it up or slow it down later.)
User92 beats per minute.
VUIOkay, 92 beats per minute. Here you go.

Clearly, it seems like fewer steps to offer suggestions right away. But in addition to actually taking longer, it presumes the user is a novice, which probably isn't the case.

Remember that people know what they want. Give them a chance to before jumping in to help.