This guide shares our key takeaways from helping developers create great storytelling and gaming experiences for Google Assistant on smart displays. We recommend reading this guide before designing your Action to create a delightful experience and avoid common mistakes. At the end of this guide, we also provide a checklist that condenses these recommendations to help you self-assess your design.
Set expectations in the introduction
To make it easy for the user to engage with your experience, your introduction should provide instructions for interacting with the Action and set clear expectations. For games, use the opening as a kind of tutorial to explain the object of the game and how it’s played. But remember, the goal of the user is to get started playing quickly, so keep the introduction short (ideally less than a minute or so). In the case of interactive stories, tell the user how long the experience is likely to be. This approach might be as simple as a labeled description in the GUI. Also, be sure to let them know what to expect when they’re asked for responses during the narrative.
A natural language interface opens many creative avenues for design, but it also means there’s a greater initial responsibility to articulate the goals of the interaction.
Find the balance between touch and voice interactions
Smart displays are designed to be used hands-free. In the majority of use cases, we assume that voice is the primary mode of interaction. In general, anything that a user can accomplish through touch should be able to be done through voice as well. Consider a child asking for a bedtime story from their bed, or someone playing a game while cooking with messy hands. Leveraging voice as an interface provides real value to our users.
However, users also appreciate fast interactions. Reading and tapping can sometimes be faster than listening and speaking. And, for some games, touch may be the primary interaction modality. If gameplay would be much easier for the user with tapping, guide them to use touch. Wherever possible, though, make sure the interactions are available through voice as well.
Keep the TTS (text-to-speech) brief
Text-to-speech or computer-generated voices have improved exponentially in the last few years, but they aren’t perfect. Through user testing, we’ve learned that users (especially kids) don’t like listening to long TTS messages. Of course, some content, like interactive stories, should not be reduced. However, for games, try to keep your script simple. Wherever possible, leverage the power of the visual medium and show, don’t tell. Or consider providing a skip button on the screen so that users can read and move forward without waiting until the TTS is finished. In many cases, the TTS and text on a screen won’t always need to mirror each other. For example, the TTS may say "Great job! Let's move to the next question. What’s the name of the big red dog?" and the text on screen may simply say, "What is the name of the big red dog?"
Consider first time and returning users
Frequent users don't need to hear the same TTS instructions repeatedly. Optimize the experience for returning users. If it's a user's first time experience, try to explain the full context. If they revisit your Action, acknowledge their return with a "Welcome back" message, and try to shorten (or taper) the TTS. If you notice the user has returned more than 3-4 times, try to get to the point as quickly as possible.
Here’s an example of tapering the instructions for different users :
- Instructions for a first time user: “Just say words you can make from the letters provided. Are you ready to begin?”
- For a returning user: “Make up words from the jumbled letters. Ready?”
- For a frequent user: “Are you ready to play?”
Open the mic properly
The microphone needs to open after every direct question because, by asking a question, you're explicitly inviting the user to respond. Needing to say a wake word to open the mic is not intuitive to users in the middle of gameplay and could leave them confused, resulting in missed or repeated utterances and errors. Allow the user to respond as quickly as possible after a question has been asked by opening the mic immediately.
Any language which cues the user to respond should be at the end of the prompt, just before the mic opens. This approach prevents the user from attempting to respond while the mic is closed, which causes frustration and creates an error.
Do: “ I have red, green, or blue. Which would you like?”
Don’t: “Which color would you like? I have red, green, or blue.”
Emphasize questions
One main difference between written and spoken language is that written language is persistent— it remains on the page, where it can easily be reread if missed. Because conversation is ephemeral, it’s easy to miss when a question is being asked. Make the question clear so that users can understand and respond.
There are a few different ways to do this. There could be a change in background music or mood on the screen. Or you can add a short sound (or earcon) before the question is asked.
If you’re putting the questions on-screen, they should be visible while the TTS is playing. Sometimes the player may want to skip ahead by reading the question and using touch to move forward.
Prepare for “no match” errors and edge cases
Escalating error handling and context-specific prompting is recommended to give users multiple opportunities to re-engage when there’s a choice to be made or a question answered. At each choice point of your experience, determine if a user answer is required, or if you can elegantly move users forward without hearing their choice.
In an escalating error strategy, say the initial question is "You can have red, green, or blue. Which color would you like?" If there's a No Match (where the user’s response isn’t understood), standard practice would be a rapid re-prompt: "Which color was that?" If there's another No Match, the next response would give a little more context: "Would you like red, green, or blue?" If they still don't respond with something the system can recognize, you might just move forward to keep them in the game or story: “Let’s go with red this time.” You could also direct them to use buttons on the screen to make their choice.
For situations where the user doesn’t respond to a question, the mic will close after a predetermined number of seconds, requiring the user to use touch input.
Support strongly recommended intents
There are some commonly used intents which enhance the user experience. If your Action doesn’t support them, users might get frustrated. The following is a list of strongly recommended intents:
“Exit / Quit”
Closes the Action.
“Repeat / Say that again”
Repeats immediately preceding content. This should be available at any point.
“Play again”
Allows the user to begin the Action experience again. This intent gives users an opportunity to re-engage with their favorite experiences.
“Help”
Provides more detailed instructions to users who may be lost. Depending on the type of Action, this may need to be context-specific. Default to returning users to where they left off in game play after a Help message plays.
“Pause, Resume”
Allows the user to pause or resume the experience. Provide a visual indication that the game has been paused, and provide both visual and voice options to resume.
“Skip”
Moves to the next decision point.
“Home / Menu”
Moves to the home or main menu of an Action. Having a visual affordance for this is a great idea. Without visual cues, it’s hard for users to know that they can navigate through voice even when it’s supported.
“Go back”
Moves to the previous page in an interactive story.
Ensure legibility and readability
The smart display is a stationary device, and it can be used from a distance in many use cases. We recommend using bigger fonts to ensure legibility— at minimum, 32pt for primary text and 24pt for secondary text.
Also, using negative space properly may reduce visual clutter. If there’s not enough space between the elements, they become hard to read and demand additional effort. Put some breathing room around the object. We recommend putting a 40px margin at the edge of the screen.
Provide visual feedback
When it takes significant time to execute users’ requests, provide them with proper visual feedback. Instead of using a simple spinner, try to be transparent about what’s happening. How much time remains to complete the task? How much of the content has been loaded? What’s happening in the system? Also, make sure your Action supports pressed state of buttons to give immediate touch feedback to users. It can easily prevent errors from double-tapping.
Reduce cognitive load
Your screen can help reduce the cognitive load by showing compact information in an organized way. Keep the content on the screen clear, concise, and scannable, with the most important information first. Display prompts may need to be a condensed version of the spoken prompts and placed in the top or middle of the screen. Show any responses in the suggestion chips at the bottom of the screen. If you decide to use an icon button instead of a text button, the icon should be extra clear so that users can execute the button via voice without hesitation.
Checklist
This checklist condenses the information from the previous sections and provides an easy way to avoid common issues in design.
Set expectations
Find the balance between touch and voice interactions
Keep the TTS (text-to-speech) brief
Consider first time and returning users
Open the mic properly
Emphasize questions
Prepare for “no match” errors and edge cases
Support strongly recommended intents
Ensure legibility and readability
Design clear navigation
Provide visual feedback
Reduce cognitive load
Additional resources
- For specification, you can use our Voice UI Template.
- For information about making your Action accessible, view our Accessibility Guidelines .
- For more best practices, visit our Conversation Design site.