Case study: Enable equivalent experiences

This page focuses on making sure all users can interact with and receive information from the agent effectively, regardless of their preferred modality.

Handle multimodal I/O focusing on voice

This section focuses on enabling seamless and equivalent user experiences by detailing how voice commands are integrated as input and how auditory feedback or answers are provided as output.

Challenge: need to seamlessly integrate voice commands as input and provide auditory feedback or answers as output.
Insight and approach for input:
1. Capture the user's voice command using the device's microphone. After capturing the voice query, a recommended best practice would be to show the recognized query back to the user, similar to live captions. This enables users to validate whether what was captured by the agent matches the user's spoken intent. Sometimes voice recognition can be tricky and it is good for users to know and understand what was actually sent to the agent.
2. Use a separate Speech-to-Text (STT) service or library to convert the captured audio into a text transcript. Examples include:
  - The Web Speech API, specifically SpeechRecognition, available in many modern browsers.
  - Cloud-based services like Google Cloud Speech-to-Text, AWS Transcribe, Azure Speech to Text.
  - Other third-party or open-source STT libraries.
3. Send the resulting text transcript to the standard Gemini API for processing.
Code example implementation

Capture the user's voice command

Clicking the listenButton to initiate voice input triggers the following event listener and recognition.start() call:

// Listener for the Microphone Button
listenButton.addEventListener('click', () => {
    if (assistantActive && !isLoading && recognition) {
        if (isListening) {
            recognition.stop(); // Allow manual stop if needed
        } else {
             try {
                 // --- This starts the capture ---
                 recognition.start();
                 listenButton.disabled = true; // Disable while listening process starts
             } catch(e) {
                 console.error("Error starting recognition:", e);
                 addMessageToChat("Could not start listening. Please check microphone permissions.", "system");
             }
        }
    } // ... (rest of handler) ...
});

// And the onstart handler provides feedback that capture has begun:
recognition.onstart = () => {
    isListening = true;
    listenButton.textContent = '...'; // Indicate listening visually
    listenButton.classList.add('listening');
    console.log("Speech recognition started.");
};

Show the recognized query back to the user

The current code doesn't provide live captions as you speak. Instead, it waits for the final result and then displays the complete transcript in the text input field. This occurs within the onresult handler, as shown in the following example:

recognition.onresult = (event) => {
    const transcript = event.results[0][0].transcript;
    console.log("Transcript:", transcript);
    // --- This displays the final transcript in the input box ---
    userInput.value = transcript;
    // --- End Display ---
};

Use the Web Speech API to convert the captured audio into a text transcript

The following snippet demonstrates how the SpeechRecognition object is initialized and set up within the initializeSpeechRecognition function. This setup allows the browser's underlying engine to perform the conversion, culminating in the onresult event firing with the transcript:

function initializeSpeechRecognition() {
    const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
    if (SpeechRecognition) {
        recognition = new SpeechRecognition(); // Creates the object
        recognition.continuous = false;
        recognition.lang = 'en-US';
        recognition.interimResults = false;
        recognition.maxAlternatives = 1;
        console.log("Speech Recognition supported.");

        // --- Assigning handlers where results (transcript) are received ---
        recognition.onresult = (event) => {
            // --- The 'transcript' is the result of the STT conversion ---
            const transcript = event.results[0][0].transcript;
            userInput.value = transcript;
        };
        // ... other handlers (onstart, onerror, onend) ...
    } // ... (rest of function) ...
}

Send the text transcript to the Gemini API

In the current setup, after the transcript appears in the userInput field, the user manually clicks the Send button. This action triggers the existing sendButton event listener, which captures the text from the input field and sends it to the backend using sendMessageToBackend, as shown in the following snippet:

// Event listener for the Send button
sendButton.addEventListener('click', () => {
    // --- Gets the transcript (or typed text) from the input field ---
    const prompt = userInput.value.trim();
    if (prompt && assistantActive && !isLoading) {
        // --- Sends the text transcript to the backend ---
        sendMessageToBackend(prompt);
    }
});

// The sendMessageToBackend function then sends it using fetch:
async function sendMessageToBackend(prompt) {
    // ... (inside the function)
    const fetchResponse = await fetch('/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        // --- The 'prompt' variable here contains the transcript ---
        body: JSON.stringify({ prompt: currentPrompt, history: chatHistory }),
    });
    // ... (rest of function sends to backend, which calls Gemini)
}

Provide multimodal output with voice focus

This section details the process of converting textual responses from the AI agent into audible speech.

To manage the agent's output, follow this insight and approach:

Receive the text response from the Gemini API.
Use a Text-to-Speech (TTS) service or library to convert that text response into audible speech:
- The Web Speech API (specifically SpeechSynthesis) available in many modern browsers.
- Cloud-based services like Google Cloud Text-to-Speech, AWS Polly, Azure Text to Speech.
- Other third-party or open-source TTS libraries.
Play the generated audio back to the user.

Receive the text response from the Gemini API

The following snippet from the sendMessageToBackend function shows how the frontend receives the text response from the backend server:

// assistant.js (inside sendMessageToBackend async function)

async function sendMessageToBackend(prompt) {
    // ... (fetch setup) ...
    try {
        const fetchResponse = await fetch('/chat', { /* ... */ });
        const data = await fetchResponse.json().catch(e => { /* ... */});
        if (!fetchResponse.ok) { /* ... */ }
        if (data.error) { /* ... */ }

        // --- Receives the text response from the backend ---
        const assistantResponseText = data.response;
        // --- End Receiving ---

        // Display assistant response (which also triggers speech)
        addMessageToChat(assistantResponseText, 'assistant');

        // Update local chat history
        chatHistory = data.history;

    } // ... (catch / finally blocks) ...
}

Convert Text to Speech using Web Speech API

To enable speech output, the Web Speech API is initialized once using the initializeSpeechSynthesis function as follows:


// assistant.js

let synthesis = null;
// ...
function initializeSpeechSynthesis() {
    if ('speechSynthesis' in window) {
        // --- Gets reference to the Web Speech API ---
        synthesis = window.speechSynthesis;
        console.log("Speech Synthesis supported.");
    } // ... (else block) ...
}

Play the generated audio to the user

The speakText function, responsible for speaking assistant messages, is shown in the following example:

// assistant.js

function speakText(text) {
    // --- Checks if API is available and feature enabled ---
    if (synthesis && readAloudCheckbox?.checked && text) {
        synthesis.cancel(); // Stop previous speech

        // --- Creates the utterance object with the text ---
        const utterance = new SpeechSynthesisUtterance(text);

        // Optional configuration (voice, rate, pitch)
        utterance.rate = 1;
        utterance.pitch = 1;
        // ... (potential voice selection) ...

        // Error handling for the utterance itself
        utterance.onerror = (event) => { /* ... */ };

        // --- Calls speak() in the next step ---
        synthesis.speak(utterance);
    }
}

Use emerging multimodal capabilities

The Gemini multimodal API streamlines voice input, letting you directly send audio data streams for processing. For output, textual responses convert to speech using standard Text-to-Speech (TTS) libraries or services. Alternatively, the multimodal API can directly provide audio streams, removing the need for a web speech API or TTS libraries.