This page focuses on making sure all users can interact with and receive information from the agent effectively, regardless of their preferred modality.
Handle multimodal I/O focusing on voice
This section focuses on enabling seamless and equivalent user experiences by detailing how voice commands are integrated as input and how auditory feedback or answers are provided as output.
- Challenge: need to seamlessly integrate voice commands as input and provide auditory feedback or answers as output.
- Insight and approach for input:
- Capture the user's voice command using the device's microphone. After capturing the voice query, a recommended best practice would be to show the recognized query back to the user, similar to live captions. This enables users to validate whether what was captured by the agent matches the user's spoken intent. Sometimes voice recognition can be tricky and it is good for users to know and understand what was actually sent to the agent.
- Use a separate Speech-to-Text (STT) service or library to convert the
captured audio into a text transcript. Examples include:
- The Web Speech API, specifically
SpeechRecognition
, available in many modern browsers. - Cloud-based services like Google Cloud Speech-to-Text, AWS Transcribe, Azure Speech to Text.
- Other third-party or open-source STT libraries.
- The Web Speech API, specifically
- Send the resulting text transcript to the standard Gemini API for processing.
- Code example implementation
Capture the user's voice command
Clicking the listenButton
to initiate voice input triggers the following event
listener and recognition.start()
call:
// Listener for the Microphone Button
listenButton.addEventListener('click', () => {
if (assistantActive && !isLoading && recognition) {
if (isListening) {
recognition.stop(); // Allow manual stop if needed
} else {
try {
// --- This starts the capture ---
recognition.start();
listenButton.disabled = true; // Disable while listening process starts
} catch(e) {
console.error("Error starting recognition:", e);
addMessageToChat("Could not start listening. Please check microphone permissions.", "system");
}
}
} // ... (rest of handler) ...
});
// And the onstart handler provides feedback that capture has begun:
recognition.onstart = () => {
isListening = true;
listenButton.textContent = '...'; // Indicate listening visually
listenButton.classList.add('listening');
console.log("Speech recognition started.");
};
Show the recognized query back to the user
The current code doesn't provide live captions as you speak. Instead, it waits for the final result and then displays the complete transcript in the text input field. This occurs within the onresult handler, as shown in the following example:
recognition.onresult = (event) => {
const transcript = event.results[0][0].transcript;
console.log("Transcript:", transcript);
// --- This displays the final transcript in the input box ---
userInput.value = transcript;
// --- End Display ---
};
Use the Web Speech API to convert the captured audio into a text transcript
The following snippet demonstrates how the SpeechRecognition
object is
initialized and set up within the initializeSpeechRecognition
function. This
setup allows the browser's underlying engine to perform the conversion,
culminating in the onresult event firing with the transcript:
function initializeSpeechRecognition() {
const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
if (SpeechRecognition) {
recognition = new SpeechRecognition(); // Creates the object
recognition.continuous = false;
recognition.lang = 'en-US';
recognition.interimResults = false;
recognition.maxAlternatives = 1;
console.log("Speech Recognition supported.");
// --- Assigning handlers where results (transcript) are received ---
recognition.onresult = (event) => {
// --- The 'transcript' is the result of the STT conversion ---
const transcript = event.results[0][0].transcript;
userInput.value = transcript;
};
// ... other handlers (onstart, onerror, onend) ...
} // ... (rest of function) ...
}
Send the text transcript to the Gemini API
In the current setup, after the transcript appears in the userInput
field, the
user manually clicks the Send button. This action triggers the existing
sendButton event listener, which captures the text from the input field and
sends it to the backend using sendMessageToBackend
, as shown in the following
snippet:
// Event listener for the Send button
sendButton.addEventListener('click', () => {
// --- Gets the transcript (or typed text) from the input field ---
const prompt = userInput.value.trim();
if (prompt && assistantActive && !isLoading) {
// --- Sends the text transcript to the backend ---
sendMessageToBackend(prompt);
}
});
// The sendMessageToBackend function then sends it using fetch:
async function sendMessageToBackend(prompt) {
// ... (inside the function)
const fetchResponse = await fetch('/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
// --- The 'prompt' variable here contains the transcript ---
body: JSON.stringify({ prompt: currentPrompt, history: chatHistory }),
});
// ... (rest of function sends to backend, which calls Gemini)
}
Provide multimodal output with voice focus
This section details the process of converting textual responses from the AI agent into audible speech.
To manage the agent's output, follow this insight and approach:
- Receive the text response from the Gemini API.
- Use a Text-to-Speech (TTS) service or library to convert that text response
into audible speech:
- The Web Speech API (specifically SpeechSynthesis) available in many modern browsers.
- Cloud-based services like Google Cloud Text-to-Speech, AWS Polly, Azure Text to Speech.
- Other third-party or open-source TTS libraries.
- Play the generated audio back to the user.
Receive the text response from the Gemini API
The following snippet from the sendMessageToBackend
function shows how the
frontend receives the text response from the backend server:
// assistant.js (inside sendMessageToBackend async function)
async function sendMessageToBackend(prompt) {
// ... (fetch setup) ...
try {
const fetchResponse = await fetch('/chat', { /* ... */ });
const data = await fetchResponse.json().catch(e => { /* ... */});
if (!fetchResponse.ok) { /* ... */ }
if (data.error) { /* ... */ }
// --- Receives the text response from the backend ---
const assistantResponseText = data.response;
// --- End Receiving ---
// Display assistant response (which also triggers speech)
addMessageToChat(assistantResponseText, 'assistant');
// Update local chat history
chatHistory = data.history;
} // ... (catch / finally blocks) ...
}
Convert Text to Speech using Web Speech API
To enable speech output, the Web Speech API is initialized once using the
initializeSpeechSynthesis
function as follows:
// assistant.js
let synthesis = null;
// ...
function initializeSpeechSynthesis() {
if ('speechSynthesis' in window) {
// --- Gets reference to the Web Speech API ---
synthesis = window.speechSynthesis;
console.log("Speech Synthesis supported.");
} // ... (else block) ...
}
Play the generated audio to the user
The speakText
function, responsible for speaking assistant messages, is shown
in the following example:
// assistant.js
function speakText(text) {
// --- Checks if API is available and feature enabled ---
if (synthesis && readAloudCheckbox?.checked && text) {
synthesis.cancel(); // Stop previous speech
// --- Creates the utterance object with the text ---
const utterance = new SpeechSynthesisUtterance(text);
// Optional configuration (voice, rate, pitch)
utterance.rate = 1;
utterance.pitch = 1;
// ... (potential voice selection) ...
// Error handling for the utterance itself
utterance.onerror = (event) => { /* ... */ };
// --- Calls speak() in the next step ---
synthesis.speak(utterance);
}
}
Use emerging multimodal capabilities
The Gemini multimodal API streamlines voice input, letting you directly send audio data streams for processing. For output, textual responses convert to speech using standard Text-to-Speech (TTS) libraries or services. Alternatively, the multimodal API can directly provide audio streams, removing the need for a web speech API or TTS libraries.