Case study: Implement high-leverage features

Implementations on this page demonstrate how to integrate specific high-impact capabilities related to system control, information seeking, and content transformation into an adaptive agent.

Handle user preferences

Effective handling of user preferences is key to creating adaptive agents, and it involves:

  • Challenge: the system needs to adapt to individual user settings and potentially changing preferences.
  • Insight into approaches:
    • Profiles: a straightforward method involves creating predefined user profiles, which generalize settings and have associated setting configurations. This allows users to switch between different configurations.
    • Dynamic prompting: a more flexible approach uses the agent itself to determine appropriate settings. This approach involves constructing a prompt that provides Gemini with the following:
      • A description of all relevant application settings.
      • The possible values for each setting.
      • Examples of scenarios where different setting values might be beneficial.
      • The user's current query or request which might imply a need for a settings adjustment or support. Gemini can then analyze this information and output the suggested setting values or configuration adjustments based on the user's implicit or explicit need.
  • Code example implementation

Managing application settings, particularly accessibility options, can be challenging. This example shows how Gemini simplifies the process by interpreting natural language requests—both explicit and implicit like "my eyes hurt"—using function calling. By analyzing user needs against current settings and descriptive context, Gemini can propose configuration changes, significantly improving usability and accessibility as users manage settings conversationally without having to navigate complex menus.

Define setting context for the LLM

The following code example defines the application settings provided to Gemini. The SETTING_DEFINITIONS array details each setting purpose and offers example phrases, including implicit user needs, enabling the LLM to accurately adjust settings based on user requests.

// --- Interfaces and Types ---
interface AppSettings { /* ... darkMode, fontSizeFactor, etc. ... */ }
interface SettingDefinition { /* ... key, description, examples ... */ }

// --- App Settings Definitions ---
// Provide detailed context about each setting, including its purpose,
// possible values, and crucially, examples of explicit AND implicit user requests.
const SETTING_DEFINITIONS: SettingDefinition[] = [
    {
        key: "darkMode",
        description: "Display mode. Boolean: 'true' for dark background... Useful for light sensitivity...",
        examples: ["Enable dark mode", "Switch to light mode", "My eyes hurt from the bright screen", "I have photophobia"]
    },
    {
        key: "fontSizeFactor",
        description: "Text size multiplier. Number between 0.8... and 2.0... default 1.0...",
        examples: ["Make text bigger", "Increase font size", "Shrink the text", "I find it hard to read this"]
    },
    // ... other setting definitions ...
];

Process requests to update settings

The following code example shows Gemini performing contextual analysis of user requests. The update_app_settings tool creates a detailed prompt, incorporating the user's request, current settings, and comprehensive setting definitions. Using this combined context, Gemini interprets both explicit and implicit needs and generates a complete, updated settings configuration as a JSON object.

/**
 * Tool: Calculates new settings using a transactional LLM call.
 * Demonstrates analyzing user need against current state and definitions.
 */
async function update_app_settings(
    { currentSettings, userRequest }: { currentSettings: AppSettings, userRequest: string },
    model: GenerativeModel,
    settingDefinitions: SettingDefinition[]
): Promise<AppSettings> {
    /* ... logging args ... */

    // Prepare detailed descriptions including current values for the prompt
    const descriptions = settingDefinitions.map(/* ... format description ... */).join('\n\n');

    // **Insight:** The prompt combines all relevant context for analysis.
    const transactionalPrompt = `Analyze the user request based on the current application settings and their descriptions provided below. Determine which settings need to change and calculate their new values based on the request (explicit or implicit)...

Return ONLY a single JSON object representing the *complete set* of application settings with the updated values...

Current Settings Object:
${JSON.stringify(currentSettings, null, 2)} // <-- Current state context

Setting Descriptions:
---
${descriptions} // <-- Definitions context (from Snippet 1)
---

User Request: "${userRequest}" // <-- User's specific need

Required JSON Output Format (Complete AppSettings Object):
{ /* ... AppSettings structure ... */ }`;

    // Request Gemini to analyze and output the new settings JSON
    const request: GenerateContentRequest = {
        contents: [{ role: "user", parts: [{ text: transactionalPrompt }] }],
        generationConfig: {
            responseMimeType: "application/json",
            responseSchema: { /* ... AppSettings schema ... */ }
        }
    };

    try {
        /* ... make call, parse, validate ... */
        const result = await model.generateContent(request);
        const proposedSettings = JSON.parse(result.response.text()) as AppSettings;
        // ... (validation) ...
        console.log("Proposed Settings (validated):", JSON.stringify(proposedSettings));
        return proposedSettings; // <-- Gemini's suggested settings
    } catch (error) { /* ... error handling ... */ return currentSettings; }
}

System instruction and example interaction flow

The following code example outlines the orchestration of a two-step LLM workflow. The SYSTEM_INSTRUCTION explicitly guides the LLM to first call get_current_app_settings and then update_app_settings. When a user query is sent, the LLM follows these instructions, triggering this sequence of tool interactions to fulfill the request by dynamically fetching and analyzing the app's state.

// --- System Instruction Definition ---
// **Insight:** Define the workflow for the main LLM when settings changes are needed.
const SYSTEM_INSTRUCTION = `You are an AI assistant helping users manage application settings.

WHEN THE USER EXPRESSES A DESIRE TO CHANGE SETTINGS (...):

1.  **Get current state:** first, call the 'get_current_app_settings'
    function.
2.  **Calculate new state:** next, call the 'update_app_settings' function. Pass
    the 'currentSettings' object and the user's *original request* text.
3.  **Confirm/inform:** after the 'update_app_settings' function returns...,
    inform the user clearly about the changes.

For general chat, respond conversationally.`;

// --- Inside main function ---

    // Initialize model with system instruction and tools...
    const model = genAI.getGenerativeModel({
        model: "gemini-2.0-flash-exp",
        systemInstruction: SYSTEM_INSTRUCTION,
        tools: [{ functionDeclarations: [getCurrentSettingsTool, updateSettingsTool] }],
    });
    const chat: ChatSession = model.startChat({ history: [] });
    /* ... get user query ... */
    const userQuery = "My eyes are sensitive to light and I need a bigger font size.";
    console.log(`\nUser: ${userQuery}`);

    // Send the user's request to the chat session
    let result = await chat.sendMessage(userQuery);

    // **Insight:** The LLM follows the system instruction, orchestrating the
    // two-step function call process automatically based on the user query.
    // The application code handles executing the functions when called.

    // --- Example Interaction Flow & Output ---
    /*
    (User query sent above)

    Assistant requested Function Call 1: get_current_app_settings
    -> Tool executed, returns: {"darkMode":false,"fontSizeFactor":1.0,...}
    -> Result sent back to LLM.

    Assistant requested Function Call 2: update_app_settings
    -> Tool executed (receives current settings & user query).
    -> Tool makes internal call to Gemini using setting definitions context (Snippet 1 & 2).
    -> Tool returns proposed settings: {"darkMode":true,"fontSizeFactor":1.2,...}
    -> Result sent back to LLM.

    Assistant (Final Response): OK. I've increased the font size to 1.2 and enabled dark mode.
    */

    // (Actual implementation uses a loop to handle function calls - not shown for brevity)

Mitigate live video processing lag

To address the challenge of latency in live video analysis, the following approach uses a storyboarding technique to enable AI agents to provide accurate, real-time responses to user queries.

  • Challenge: in live or streaming video, processing delays can cause AI agents to analyze older frames. This can lead to inaccurate or outdated responses to user queries about the current action.
  • Insight and approach: to boost real-time accuracy, create a timestamped "storyboard" centered on the user's query. Capture the latest available frame, plus frames from a window around that point. For example, from two to three seconds before and after, with relative timestamps. Sending this focused storyboard alongside the query provides the AI model with immediate visual context, greatly improving its ability to answer questions about the current visual state, even with underlying processing latency. This strategy relies on fast pre-processing to enhance responsiveness.
  • Example: in a live cooking show, if a chef adds an ingredient at 55 seconds and a user immediately asks "What spice was that?", a two-second processing lag might mean the agent has only processed up to 53 seconds. By sending a storyboard with frames from 52 to 56 seconds with the query, the agent can analyze that specific window, correctly identify the spice added around 55 seconds, and provide an accurate, timely response.

Use timestamped storyboards for generating contextual audio descriptions

To create meaningful and accurately timed audio descriptions for video content, this approach uses timestamped storyboards to provide AI agents with a comprehensive understanding of both visual and audio changes over time.

  • Challenge: transforming visual video content into meaningful audio descriptions is challenging. This is because it requires a holistic understanding of how visuals evolve over time and relate to the accompanying audio track such as dialogue, sound effects, and music. Also, analyzing single frames in isolation is insufficient.
  • Insight and approach: create a "storyboard" by capturing video frames at regular and frequent intervals. For example, every 0.5 seconds or based on scene changes or key moments. Add a relative timestamp to each captured frame. Providing this sequence of timestamped frames to the AI agent allows it to analyze visual changes over time and understand their relationship to the corresponding moments in the audio track. This generates contextually relevant and accurately timed audio descriptions.
  • Example: consider a scene where a character walks across a room from 20 seconds to 23 seconds, briefly pauses silently looking at a painting from 24 seconds to 26 seconds, and then dialogue starts at 27 seconds. By analyzing both the audio track and timestamped storyboard frames—for example, from 20.0 seconds to 27.0 seconds—the agent can generate descriptions like: "At 21 seconds, John walks towards the fireplace. At 25 seconds, he stops and looks closely at the painting overhead." These descriptions accurately reflect the visual action and fit appropriately within pauses in the audio.

This is a storyboard generated to answer a user query: "What kind of flower is the purple one?". Gemini, using video descriptions and transcripts, identified flower appearance around second 7. This allowed sampling relevant video frames for the storyboard, which are then passed to Gemini as a single image. This format enables Gemini to interpret motion and actions. For example, for "What is happening with the flower", Gemini would likely respond, "It is blooming around second 7."

Storyboard showing multiple video frames of a flower