Case study: Engineer prompts

This section describes effective prompt engineering techniques and strategies for managing AI agent limitations to achieve accurate and reliable behavior.

Engineer effective prompts

Achieving optimal behavior and accuracy from your AI agent relies on well-designed prompts. Here are some key strategies for effective prompt engineering:

Provide relevant context: always include necessary context directly within the prompt. For example:
- For video analysis, this might mean frames, transcripts, or metadata.
- For a navigation app, include the user's location and preferences.
Keep prompts concise: define specific, focused tasks for each prompt. Break down complex user goals into smaller, manageable sub-tasks to prevent the agent from losing track or performing poorly.

Note: Since LLMs vary in multi-task handling, prioritize reliability by first classifying user intent, then using targeted follow-up prompts. Be mindful of potential latency from added requests, aiming for one or two subsequent requests if needed.
Use multimodality: utilize Gemini's ability to understand prompts referencing diverse data types, like images and text together, to improve its grasp of user intent.

Apply key insights

This use case uses a video assistant scenario to demonstrate the following insights:

Context in prompts: use time-stamped text descriptions and storyboard images as specific context to improve accuracy.
Conciseness and task decomposition: instead of a single, complex prompt for video analysis, use a sequential, two-step function calling process to answer specific visual queries.
Multimodality: combine image and text input.

Configure the model with tools and multi-step instructions

This Gemini model setup for the video assistant defines two tools:

find_relevant_timestamps for text analysis
answer_user_query for multimodal analysis

Crucially, the systemInstruction explicitly tells the model how to handle visual queries using a sequential, two-step process, effectively splitting a more complex task into two concise steps:

Call find_relevant_timestamps.
Use its results to call answer_user_query.

This setup prepares the model to orchestrate the necessary actions based on user intent.

// Verify you have the necessary imports from your Gemini client library
// For example, if using @google/generative-ai:
// import { GoogleGenerativeAI, HarmCategory, HarmBlockThreshold } from "@google/generative-ai";

// --- Tool definitions (declarations only) ---
// In JavaScript, you might not have direct type declarations like FunctionDeclaration
// unless defined elsewhere. We'll define them as plain objects.
const findRelevantTimestampsTool = {
    name: "find_relevant_timestamps",
    description: "Analyzes video text descriptions to find timestamps relevant to a user query.",
    // parameters: { /* ... userQuery ... */ } - Replace with actual schema if known
    parameters: {
        type: "OBJECT",
        properties: {
            userQuery: { type: "STRING" }
        },
        required: ["userQuery"]
    }
};

const answerUserQueryTool = {
    name: "answer_user_query",
    description: "Answers a user query using specific timestamps by analyzing corresponding visual context. Call this AFTER find_relevant_timestamps.",
    // parameters: { /* ... userQuery, relevantTimestamps ... */ } - Replace with actual schema if known
    parameters: {
        type: "OBJECT",
        properties: {
            userQuery: { type: "STRING" },
            relevantTimestamps: {
                type: "ARRAY",
                items: { type: "STRING" }
            }
        },
        required: ["userQuery", "relevantTimestamps"]
    }
};

// --- System instruction definition ---
const SYSTEM_INSTRUCTION = `You are a helpful video assistant.

WHEN THE USER ASKS A QUESTION ABOUT SPECIFIC DETAILS OR EVENTS IN THE VIDEO
(...):

1.  **Identify Relevant Times:** First, call the 'find_relevant_timestamps'
    tool.
2.  **Analyze Visuals:** Next, call the 'answer_user_query' tool. Pass it the
    original user query AND the 'relevantTimestamps' array returned by the first
    tool.
3.  **Respond:** Formulate a final response to the user based *only* on the
    answer provided by the 'answer_user_query' tool.

If the user asks a general question... answer directly.`;

// --- Model initialization ---
const tools = [
    { functionDeclarations: [findRelevantTimestampsTool, answerUserQueryTool] }
];

// Assuming 'genAI' is an initialized instance of GoogleGenerativeAI
// For example: const genAI = new GoogleGenerativeAI(API_KEY);
const model = genAI.getGenerativeModel({
    model: "gemini-2.0-flash-exp", // Or your preferred model
    // Insight: Provide clear instructions and available tools upfront.
    systemInstruction: SYSTEM_INSTRUCTION,
    tools: tools,
});

// --- Start the chat session ---
const chat = model.startChat({ history: [] });

Provide text context for timestamp analysis

The find_relevant_timestamps tool exemplifies context in prompts by explicitly including VIDEO_DESCRIPTIONS within its prompt. This focused approach allows Gemini to efficiently identify relevant timestamps based solely on the provided text metadata, demonstrating the first concise step of the conciseness and task decomposition insight.

const VIDEO_DESCRIPTIONS = `
0:00 Shoreline Amphitheatre exterior, crowd.
0:03 Google I/O 2024 logo & montage.
0:05 CEO Sundar Pichai takes selfie near seating.
0:07 Lamp post with Google I/O banners.
0:08 Person walks onto main stage.
0:09 Speaker on stage addressing audience...`;

/**

*   Tool 1: Finds relevant timestamps using a transactional LLM call.
*   Demonstrates providing specific text context within the prompt.
 *
*   @param {object} params - The parameters for the function.
*   @param {string} params.userQuery - The user's query.
*   @param {object} model - The GenerativeModel instance from the Gemini API.
*   @returns {Promise<{ relevantTimestamps: string[] }>} A promise that resolves to an object
*   containing an array of relevant timestamps.
 */
async function find_relevant_timestamps({ userQuery }, model) {
    console.log(`\n--- TOOL CALL: find_relevant_timestamps ---`);
    console.log(`Finding timestamps relevant to: "${userQuery}"`);

 const transactionalPrompt = `Analyze the following video descriptions and
identify the timestamps (in MM:SS format) that are most relevant to answering
the user query. Return ONLY a JSON array of strings containing the relevant
timestamps. If no specific timestamps seem relevant, return an empty array.

Video descriptions:
---
${VIDEO_DESCRIPTIONS}
---

User query: "${userQuery}"

Response format: JSON array of strings`;

    const request = {
        contents: [{ role: "user", parts: [{ text: transactionalPrompt }] }],
        generationConfig: {
            responseMimeType: "application/json",
            // Define the responseSchema directly in JavaScript
            responseSchema: {
                type: "OBJECT",
                properties: {
                    timestamps: {
                        type: "ARRAY",
                        items: {
                            type: "STRING"
                        }
                    }
                },
                required: ["timestamps"]
            }
        }
    };

    try {
        const result = await model.generateContent(request);
        // Assuming TimestampResponse would be something like { timestamps: string[] }
        const parsed = JSON.parse(result.response.text());
        return { relevantTimestamps: parsed.timestamps || [] };
    } catch (error) {
        console.error("Error in find_relevant_timestamps:", error);
        return { relevantTimestamps: [] };
    }
}

Provide multimodal context for answering

The answer_user_query tool exemplifies the multimodal context insight by combining visual information from imagePart with a text prompt. This approach also demonstrates the task decomposition insight. By first using text analysis to identify relevant timestamps, the subsequent multimodal step becomes more efficient and reliable than analyzing the entire video at once. This delivers better results.

// Assuming loadImageBase64 is a function that converts an image path to a Base64Parts object
// and VIDEO_DESCRIPTIONS is defined globally or imported.
// import { GenerativeModel } from '@google/generative-ai'; // If using the client library

/**
 *   Tool 2: Answers query using timestamps and image context using a multimodal LLM call.
 *   Demonstrates providing multimodal context (text + image) in the prompt.
 *
 *   @param {object} params - The parameters for the function.
 *   @param {string} params.userQuery - The user's original query.
 *   @param {string[]} params.relevantTimestamps - An array of relevant timestamps.
 *   @param {GenerativeModel} model - The GenerativeModel instance from the Gemini API.
 *   @returns {Promise<{ answerText: string }>} A promise that resolves to an object
 *   containing the answer text.
 */
async function answer_user_query({ userQuery, relevantTimestamps }, model) {
    console.log(`\n--- TOOL CALL: answer_user_query ---`);
    // Placeholder for actual logic that might use relevantTimestamps to fetch specific frames
    // For this example, we're simulating a single storyboard image.

    console.log("Simulating fetching relevant video frames (loading storyboard)...");
    // Verify STORYBOARD_IMAGE_PATH is defined and loadImageBase64 is implemented
    // Example: const STORYBOARD_IMAGE_PATH = 'path/to/your/storyboard.jpg';
    // Example loadImageBase64 implementation (simplified):
    // async function loadImageBase64(imagePath) {
    //   const fs = require('fs').promises; // Node.js example for file reading
    //   const mime = require('mime-types'); // For determining mime type
    //   const imageData = await fs.readFile(imagePath, { encoding: 'base64' });
    //   return {
    //     inlineData: {
    //       data: imageData,
    //       mimeType: mime.lookup(imagePath) || 'image/jpeg'
    //     }
    //   };
    // }
    const imagePart = await loadImageBase64(STORYBOARD_IMAGE_PATH); // <-- Visual context

 const transactionalPrompt = `Based on the provided image (representing relevant
video frames), the timestamped video descriptions and the original user query,
please answer the query.

User Query: "${userQuery}"

** Video Descriptions **
${VIDEO_DESCRIPTIONS} // <-- Relevant text context
`;

    const request = {
        contents: [{
            role: "user",
            parts: [
                { text: transactionalPrompt },
                imagePart // This needs to be in the format expected by the Gemini API (e.g., inlineData)
            ]
        }],
    };

    try {
        console.log("Making multimodal call to Gemini...");
        const result = await model.generateContent(request);
        return { answerText: result.response.text() };
    } catch (error) {
        console.error("Error in answer_user_query:", error);
        /* ... handle specific error cases ... */
        return { answerText: "Error analyzing visual context." };
    }
}

Trigger the multi-step flow with a user query

To initiate the multi-step flow with a user query, use the following code example:

// --- Inside main function's try block ---
    // --- Simulate user query ---
    const userQuery = "What is Sundar Pichai wearing?";
    console.log(`\nUser: ${userQuery}`);

    // **Insight:** Send the user query. The model uses session history,
    // system instructions, and tool definitions to determine the multi-step plan.
    let result = await chat.sendMessage(userQuery);

    // --- Expected interaction flow (conceptual) ---
    // Based on the system instruction and the nature of the query,
    // the following sequence is expected, managed using the function calling loop
    //(not shown):
    //
    // 1. Gemini receives the query.
    // 2. Gemini decides Tool 1 ('find_relevant_timestamps') is needed based on
    //instructions.
    //    -> Issues Function Call 1
    // 3. App executes Tool 1 (makes internal text analysis call, potentially using context like video descriptions).
    // 4. App sends Tool 1 Result (e.g., { relevantTimestamps: ["0:05", "0:24"] }) back to Gemini.
    // 5. Gemini receives Tool 1 result. Instructions say Tool 2 is next.
    // 6. Gemini decides Tool 2 ('answer_user_query') is needed.
    //    -> Issues Function Call 2 (with query + timestamps)
    // 7. App executes Tool 2 (loads image context for timestamps, makes internal multimodal call).
    // 8. App sends Tool 2 Result (e.g., { answerText: "He wears X at 0:05 and Y at 0:24." }) back to Gemini.
    // 9. Gemini receives Tool 2 result. Instructions say to formulate the final response now.
    // 10. Gemini generates the final text response for the user.

This shows the conceptual expected flow, and the following is a console output of this example:

User: What is Sundar Pichai wearing?

Assistant requested Function Call 1: find_relevant_timestamps

--- TOOL CALL: find_relevant_timestamps ---
Finding timestamps relevant to: "What is Sundar Pichai wearing?"
--- TOOL RESULT: Found timestamps: ---
Sending Function Response [find_relevant_timestamps] back to model...

Assistant requested Function Call 2: answer_user_query

--- TOOL CALL: answer_user_query --- Answering query "What is Sundar Pichai
wearing?" using timestamps: Simulating fetching relevant video frames (loading
storyboard)... Loaded image...
--- TOOL RESULT: Generated Answer: "At 0:05 Sundar Pichai is wearing a grey
shirt or jacket. At 0:24, when he walks onto the stage, he is wearing a grey
shirt, a dark blue jacket and dark blue jeans." --- Sending Function Response
[answer_user_query] back to model...

Assistant (Final Answer): At 0:05 Sundar Pichai is wearing a grey shirt or
jacket. At 0:24, when he walks onto the stage, he is wearing a grey shirt, a
dark blue jacket and dark blue jeans.

Explicitly handle agent limitations to prevent hallucination

To effectively handle agent limitations and prevent AI systems from generating false information, consider these crucial aspects:

Challenge: AI systems often don't know their own limitations and may hallucinate or create plausible but incorrect answers when they can't fulfill a request.
Insight and approach: to prevent hallucinating, design the agent to explicitly recognize when a user's query is outside its capabilities or knowledge. It must then clearly and politely communicate this limitation instead of fabricating a response.
Example: if an agent can't send emails, have it say so directly rather than trying to pretend it can. Similarly, if a question is outside the provided context, instruct the agent to state that it can only answer questions related to the specific content.
Code example implementation: this involves using Gemini's chat sessions, function calling, and strong system instructions. The agent can determine if a query matches a defined capability or hits a limitation by analyzing the type of response generated by the model, for example, a function call request versus a direct text answer.

Configure the model with system instruction and tools

Before initiating the chat, the model is configured with specific components that define its behavior and capabilities:

Static context: provides background data like location or user profiles to inform the model's responses.
System instruction: outlines the model's limitations, guiding it to explain them directly instead of calling tools for unfeasible requests. It also directs tool usage for valid capabilities by refining the request and calling the tool, or otherwise responds directly with an explanation if the tool isn't available. General chat is handled conversationally.
Tools: define the agent's executable functions, such as find_route or get_location_info. Crucially, actions outside the model's limitations have no corresponding tools.

// Verify you have the GoogleGenerativeAI library available.
// If using Node.js, you might need:
//const { GoogleGenerativeAI, HarmCategory,
HarmBlockThreshold } = require('@google/generative-ai');
// If in a browser, verify the script tag is loaded:
// <script src="https://unpkg.com/@google/generative-ai@latest/dist/index.js"></script>

// Then GoogleGenerativeAI, HarmCategory, HarmBlockThreshold would be globally
available or destructured from a global object.

const STATIC_CONTEXT = {
    // Current location based on context provided: Las Vegas, NV
    currentLocation: "Near the Fountains of Bellagio, Las Vegas",
    userProfile: "Prefers clear, direct instructions. Sometimes makes typos.",
    // Define Capabilities using Tool Descriptions later
    // Define Limitations Explicitly Here:
    agentLimitations: [
 "Cannot send emails, messages, or share content directly with contacts.",
"Cannot make phone calls or video calls.",  "Cannot book reservations or
purchase tickets.",  "Cannot access real-time external data beyond basic
map/location info (e.g., no live flight status, detailed external web Browse).",
 "Cannot interact with device hardware or other applications."
    ]
};

// --- System Instruction ---
// Incorporates capabilities implicitly using tool descriptions,
// but explicitly lists limitations and handling instructions.
const SYSTEM_INSTRUCTION = `You are an AI assistant integrated into a navigation
and local information application. Maintain awareness of the ongoing
conversation history. Your primary goal is to be helpful and accurate WITHIN
your defined capabilities.

**YOUR CAPABILITIES:**

-   Provide directions using the 'find_route' tool.
-   Get information about specific locations using the 'get_location_info' tool.
-   Engage in general conversation related to navigation or local points of
    interest.

**YOUR LIMITATIONS:**
${STATIC_CONTEXT.agentLimitations.map(l => `- ${l}`).join('\n')}

**INSTRUCTIONS:**

1.  **Analyze request:** Understand the user's intent using the conversation
    history and provided context (location, profile).
2.  **Check capabilities vs limitations:** Determine if the request matches a
    capability (requiring a tool) or a limitation.
3.  **If request matches a capability, for example, route, location information:**
    a. **Refine query internally:** Before calling a tool, mentally refine the
    user's request to be clear and actionable (correct typos like
    'Ruth'->'route', resolve 'there'->'Bellagio', use context like 'from here').
    b.  **Call tool:** Invoke the appropriate tool ('find_route' or
    'get_location_info') with the necessary, refined arguments. Include your
    reasoning for the refinement if helpful.
    c.  **Respond after tool:** Once
    the tool provides information, formulate a natural language response to the
    user summarizing the tool's result.
4.  **If request matches a limitation (e.g., email route, make call):**
    a.  **DO NOT CALL ANY TOOL.**
    b.  **Respond directly:** Clearly state that you cannot
    perform the requested action, referencing the specific limitation, for example, "I
    cannot send emails.". Offer an alternative if possible within your
    capabilities, for example, "I can display the route information for you to
    share.".
5.  **For general chat:**Respond conversationally without calling tools.

**User profile:**
${STATIC_CONTEXT.userProfile}
**Current location context:**
Your current location is assumed to be: ${STATIC_CONTEXT.currentLocation}. Use
this for requests like 'from here'. `;

// --- Tool definitions ---
const findRouteTool = {
    name: "find_route",
 description: `Calculates and finds a suitable route between an origin and a
destination. Refine the user's natural language query into specific origin and
destination arguments, considering context and user profile.`,
    parameters: {
        type: "OBJECT",
        description: "Origin and Destination for route finding.",
        properties: {
 origin: { type: "STRING", description: "The starting point of the route (e.g.,
'current location', address, place name)." },
 destination: { type: "STRING", description: "The ending point of the route
(e.g., address, place name)." },
 reasoning: { type: "STRING", description: "Brief explanation if the query was
refined (e.g., typo correction, context used)." }
        },
        required: ["origin", "destination", "reasoning"]
    }
};

const getLocationInfoTool = {
    name: "get_location_info",
 description: "Retrieves details about a specific point of interest (e.g.,
hours, description).",
    parameters: {
        type: "OBJECT",
        description: "Specifies the location.",
        properties: {
 locationName: { type: "STRING", description: "The name of the place the user is
asking about (e.g., 'Bellagio Conservatory', 'Eiffel Tower Restaurant')." },
 reasoning: { type: "STRING", description: "Brief explanation if the query was
refined." }
        },
        required: ["locationName", "reasoning"]
    }
};

// --- Model Initialization ---
const tools = [{ functionDeclarations: [findRouteTool, getLocationInfoTool] }];
const apiKey = "YOUR_API_KEY"; // Replace with your actual API key

// Assuming GoogleGenerativeAI, HarmCategory, and HarmBlockThreshold are available in scope.
// If not, you'll need to import or require them as shown in the comments at the top.
const genAI = new GoogleGenerativeAI(apiKey);
const model = genAI.getGenerativeModel({
    // Using a model that supports function calling well
    model: "gemini-2.0-flash-exp",
    systemInstruction: SYSTEM_INSTRUCTION,
    tools: tools,
    // Optional: Add safety settings if needed
    safetySettings: [
        { category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE },
        // Add others as needed
    ]
});

Start the chat and sending user queries

To initiate and manage the conversation with the model, we use the following steps:

Chat session: model.startChat() initiates the session.
Sending messages: chat.sendMessage() sends the user query. History is managed automatically.
Two turns: our simulation involves two queries. The first asks for directions and is set to activate find_route, and the second asks to email and produces a direct text response about the limitation.

async function runChat() {
    console.log("Starting chat session...");
    const chat = model.startChat({ // Type annotation ': ChatSession' removed
        history: [] // Start with empty history
    });

    // --- Turn 1: Requesting a route (Capability) ---
    const query1 = "Show me the way to the Bellagio Conservatory from here";
    console.log(`\nUser: ${query1}`);
    let result1 = await chat.sendMessage(query1);
    await handlePotentialFunctionCall(result1.response, chat); // Handle response (likely function call)

    // --- Turn 2: Requesting an action that hits a limitation ---
    const query2 = "Ok great, now email that route to my work email.";
    console.log(`\nUser: ${query2}`);
    let result2 = await chat.sendMessage(query2);
    await handlePotentialFunctionCall(result2.response, chat); // Handle response (likely direct text)

    console.log("\nChat finished.");
}

Process model responses as function calls or direct text

The handlePotentialFunctionCall helper function is the core logic for interpreting the model's response. It checks whether the model did one of the following:

Requested a function call, thereby indicating an action within its capabilities.
Provided a direct text response, either to engage in a general chat with the user or to explain limitations based on system instructions.

// Helper function to process responses
async function handlePotentialFunctionCall(response, chat) { // Removed type annotations
    // Check if the model response includes a function call request
    const functionCall = response.candidates?.[0]?.content?.parts?.[0]?.functionCall;

    if (functionCall) {
        const { name, args } = functionCall;
        console.log(`\nAssistant requested Function Call: ${name}`);

        // Call the appropriate local function based on the name
        let apiResponse;
        if (name === "find_route") {
            apiResponse = await find_route(args); // Removed 'as any' cast
        } else if (name === "get_location_info") {
            apiResponse = await get_location_info(args); // Removed 'as any' cast
        } else {
            console.warn(`Unknown function call requested: ${name}`);
            // Handle unknown function calls if necessary
            apiResponse = { error: `Unknown function: ${name}` };
        }

        // Send the function response back to the model
        const result = await chat.sendMessage([{ functionResponse: { name, response: apiResponse } }]);

        // Log the final natural language response from the model
        console.log(`\nAssistant (after function call): ${result.response.text()}`);

    } else {
        // No function call - Model responded directly
        // This happens for general chat OR when a limitation is hit
        console.log(`\nAssistant (Direct Response): ${response.text()}`);
        // The text itself should explain the limitation based on system instructions
    }
}

Define tools

These are the local functions executed when the model requests a function call. In a real application, they interact with actual APIs or services like a mapping service.

// --- Tool Implementations ---
async function find_route(args) { // Type annotations removed
    console.log(`\n--- TOOL CALL: find_route ---`);
    console.log(`    Origin: ${args.origin}`);
    console.log(`    Destination: ${args.destination}`);
    console.log(`    Reasoning: ${args.reasoning}`);
    // Simulate route finding
 return { routeInfo: `Okay, calculating route from ${args.origin} to
${args.destination}. Route found: [Details placeholder].` };
}

async function get_location_info(args) { // Type annotations removed
    console.log(`\n--- TOOL CALL: get_location_info ---`);
    console.log(`    Location Name: ${args.locationName}`);
    console.log(`    Reasoning: ${args.reasoning}`);
    // Simulate fetching info
 return { details: `Details for ${args.locationName}: Open daily 9 AM - 11 PM.
Features seasonal floral displays. [More details placeholder].` };
}

View conceptual console output

This section illustrates the expected console output, clearly showing the difference between handling a capability using a function call and handling a limitation using a direct response.

// This is a simulation of console output, not executable code for the full
//interaction. The actual interaction would involve API calls to Gemini and
//your defined tools.

console.log("Starting chat session...\n");

console.log("User: Show me the way to the Bellagio Conservatory from here\n");

console.log("Assistant requested Function Call 1: find_route\n");

console.log("--- TOOL CALL: find_route ---"); console.log('Calculating route
based on origin: "Near the Fountains of Bellagio, Las Vegas", destination:
"Bellagio Conservatory". Reasoning: "Used current location context for origin.
Destination identified from user query."'); console.log("--- TOOL RESULT: Route
info generated: \"{ routeInfo: 'Okay, calculating route from Near the Fountains
of Bellagio, Las Vegas to Bellagio Conservatory. Route found: [Details
placeholder].' }\"\n"); console.log("Sending Function Response [find_route] back
to model...\n");

console.log("Assistant (Final Answer): Okay, calculating route from Near the
Fountains of Bellagio, Las Vegas to Bellagio Conservatory. Route found: [Details
placeholder].\n");

console.log("User: Ok great, now email that route to my work email.\n");

console.log("Assistant (Direct Response): I cannot send emails, as that's one of
my limitations. However, I have displayed the route information for you.\n");

console.log("Chat finished.");