How multimodal agentive interfaces work

Multimodal AI agents can process and synthesize information from diverse sensory signals, such as voice, text, and images. This provides significant flexibility in both input and output modalities. This multimodality is key to adaptability, enabling the interface to change based on individual needs and the context of use.

Examples of multimodal interaction

To foster an inclusive experience, the agent provides tailored support and communication methods for users with visual impairments, motor impairments, or those who are deaf or hard of hearing.

  • For users with visual impairments: the user can interact with the agent using voice commands and receive auditory descriptions or haptic feedback.

  • For users with motor impairments: the user can control the system using eye tracking or limited hand or body movements, and the agent provides visual outputs designed to support these alternative input methods.

  • For users who are deaf or hard of hearing: the agent provides visual support through graphics and videos, complemented by caption text. Interaction can occur through captions, with the agent delivering responses visually or textually. For any audio content, the agent makes captions readily available.

Benefits of adaptive multimodal agents

This adaptive nature means the agent can dynamically adjust to match the individual user's capabilities and preferences, effectively bridging accessibility gaps. By learning from user interactions, an agent can provide a truly personalized and evolving accessibility solution.

Google's models for building multimodal agentive interfaces

Google's Gemini and Gemma models offer different options for building multimodal agentive interfaces:

  • Gemini: a family of models designed with built-in multimodality at its core. This means it can simultaneously process and reason across different modalities like text, images, audio, video, and code.

  • Gemma: a family of models that are open-source, lightweight language models. While the initial Gemma models primarily focus on text, they can still contribute significantly to multimodal agentive interfaces, often by working in conjunction with other specialized models or through integration techniques.

The following table summarizes these two models across features that are relevant to building a multimodal agentive interface.

Feature Gemini Gemma
Multimodality Built-in, strong understanding across modalities Primarily text-focused, enhanced integration potential
Agency Advanced reasoning and planning across modalities Strong text-based reasoning and planning, potential for multimodal orchestration
Complexity Generally larger and more complex models Lightweight and more efficient models
Customization Fine-tuning options available Highly customizable due to open-source nature
Access Google Cloud (Vertex AI) Downloadable weights, open-source frameworks
Best suited for Highly integrated multimodal experiences Text-centric control, orchestration, efficient deployments, customization

What's next