Scale your design
Multimodal design
Anatomy of a response
Conversational components
Conversational components are combined to compose the content in the spoken prompts, display prompts, and chips.
Conversational components (prompts and chips) should be designed for every dialog turn.
Spoken prompt | The content your Action speaks to the user, via TTS or pre-recorded audio |
Display prompt | The content your Action writes to the user, via printed text on the screen |
Chips | Suggestions for how the user can continue or pivot the conversation |

Visual components
Visual components include cards, carousels, and other visual assets.
Perfect for scanning and comparing options, visual components are useful if you're presenting detailed information—but they aren't required for every dialog turn.
Basic card | Use basic cards to display an image and text to users. |
Browsing carousel | Browsing carousels are optimized for allowing users to select one of many items, when those items are content from the web. |
Carousel | Carousels are optimized for allowing users to select one of many items, when those items are most easily differentiated by an image. |
List | Lists are optimized for allowing users to select one of many items, when those items are most easily differentiated by their title. |
Media response | Media responses are used to play and control the playback of audio content like music or other media. |
Table | Tables are used to display static data to users in an easily scannable format. |
Group devices by the components used for the response



Go from spoken to multimodal
Here are a couple examples from the Google I/O 18 Action


Most of the time, you can simply re-use the same spoken prompt on devices like smart displays, since the need to convey the core of the conversation remains the same.
At this point in the conversation, there isn’t any content that would be appropriate in a visual component like a card or carousel, so none is included.
Be sure to add chips. At a minimum, these should include any options offered in the prompts so the user can quickly tap them to respond.

Since there isn’t any content that would be appropriate in a visual component, there’s no content that can be moved out of the spoken prompt. Therefore, it’s okay to re-use the original.
The display prompt should be a condensed version of the spoken prompt, optimized for scannability. Move any response options to the chips, but be sure to always include the question.
Re-use the same chips you just created.

Start with the original spoken prompt from the example sample dialog.
Note that the spoken list is limited to 6 items (of 17 total) in order to reduce cognitive load. The topics are randomized to not favor one topic over another.

Once again, it’s okay to re-use the same spoken prompt, since we can’t assume the user is looking at the screen.
Including a visual list of all the topics helps the user to browse and select. Note that the visual list of all 17 items (paginated) is shown in alphabetical order, which is easiest for users to search for the topic they want.
Because the list already enumerates the topics that can be chosen, there is no need to include them as chips. Instead, include other options like “None of those” to offer the user a way out.

Here, we can assume that the user has equal access to the audio and the screen. Since the visual modality is better suited to lists, leverage this strength by directing the user to the screen to pick a topic. This allows us to shorten the spoken prompt to a simple list overview and question.
Only the question needs to be maintained in the display prompt.
Re-use the same chip you just created.
Relationship between prompts
In general, spoken prompts are optimized for and follow the conventions of spoken conversations. Display prompts are optimized for and follow the conventions of written conversations. Although slightly different, they should still convey the same core message.
Design prompts for both the ear and the eye. It’s easiest to start with the spoken prompt, imagining what you might say in a human-to-human conversation. Then, condense it to create the display prompt.

Do

Don’t.

Do

Don’t.

Do

Don’t.
Relationship between components
Remember that all the components are meant to provide a single unified response.
It’s often easiest to start by writing prompts for a screenless experience, again imagining what you might say in a human-to-human conversation. Then, imagine how the conversation would change if one of the participants was holding a touchscreen. What details can now be omitted from the conversational components? Typically, the display prompt is significantly reduced since the user can just as easily comprehend the information in the visual as they can in the display prompt. Group the information in such a way that the user doesn’t have to look back and forth between the display prompt and visual repeatedly.
STOPPED