Multimodal AI agents can process and synthesize information from diverse sensory signals, such as voice, text, and images. This provides significant flexibility in both input and output modalities. This multimodality is key to adaptability, enabling the interface to change based on individual needs and the context of use.
Examples of multimodal interaction
To foster an inclusive experience, the agent provides tailored support and communication methods for users with visual impairments, motor impairments, or those who are deaf or hard of hearing.
For users with visual impairments: the user can interact with the agent using voice commands and receive auditory descriptions or haptic feedback.
For users with motor impairments: the user can control the system using eye tracking or limited hand or body movements, and the agent provides visual outputs designed to support these alternative input methods.
For users who are deaf or hard of hearing: the agent provides visual support through graphics and videos, complemented by caption text. Interaction can occur through captions, with the agent delivering responses visually or textually. For any audio content, the agent makes captions readily available.
Benefits of adaptive multimodal agents
This adaptive nature means the agent can dynamically adjust to match the individual user's capabilities and preferences, effectively bridging accessibility gaps. By learning from user interactions, an agent can provide a truly personalized and evolving accessibility solution.
Google's models for building multimodal agentive interfaces
Google's Gemini and Gemma models offer different options for building multimodal agentive interfaces:
Gemini: a family of models designed with built-in multimodality at its core. This means it can simultaneously process and reason across different modalities like text, images, audio, video, and code.
Gemma: a family of models that are open-source, lightweight language models. While the initial Gemma models primarily focus on text, they can still contribute significantly to multimodal agentive interfaces, often by working in conjunction with other specialized models or through integration techniques.
The following table summarizes these two models across features that are relevant to building a multimodal agentive interface.
Feature | Gemini | Gemma |
---|---|---|
Multimodality | Built-in, strong understanding across modalities | Primarily text-focused, enhanced integration potential |
Agency | Advanced reasoning and planning across modalities | Strong text-based reasoning and planning, potential for multimodal orchestration |
Complexity | Generally larger and more complex models | Lightweight and more efficient models |
Customization | Fine-tuning options available | Highly customizable due to open-source nature |
Access | Google Cloud (Vertex AI) | Downloadable weights, open-source frameworks |
Best suited for | Highly integrated multimodal experiences | Text-centric control, orchestration, efficient deployments, customization |