Prompting with images and text using the Gemini API for accessibility

1. Introduction

Gemini is a multimodal model capable of receiving multiple input data types such as text, image and audio and returning multiple output types of the same variety.

In this codelab, you will learn how to access the Google Generative AI API to access the Gemini Pro Vision model and send multimodal payloads of text and image data to the Gemini API and receive a text-only response.

You will achieve this by first calling the API with a text-only response and then attaching image payloads. Through this process you will explore a variety of types of prompts that can be used with images and learn how to effectively "think" about prompting with image payloads in mind.

Let's get started.

Prerequisites

  • A basic understanding of JavaScript.
  • Command-line access to a computer with NodeJS and NPM installed.

What you'll learn

  • How to access the Gemini API.
  • How to send text and receive text from the Gemini API.
  • How to send text and images to the API.

2. Accessing the Gemini API

We'll begin by installing the Google Generative AI SDK via the pip command:

npm install @google-ai/generativelanguage

Next let's create a basic NodeJS script to execute our API calls.

Create a file called script.js and the following import statements to access the library:

const { GoogleGenerativeAI } = require("@google/generative-ai");

const generationConfig = {
  temperature: 0.7,
  candidateCount: 1,
  topK: 40,
  topP: 0.95,
  maxOutputTokens: 1024,
};

const safetySettings = [
  {
    category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
    threshold: 'BLOCK_NONE'
  },
];

const genAI = new GoogleGenerativeAI(process.env.API_KEY);

const model = genAI.getGenerativeModel({
  model: "gemini-pro",
});

model.generateContent({
  generationConfig,
  safetySettings,
  contents: [
    {
      role: "user",
      parts: [
        { text: 'On what planet do humans live? ' }
    },
  ],
}).then(result => {
  console.log(JSON.stringify(result, null, 2));
});

After saving the file, try executing the following command to test the API:

node script.js

If you have the environment variable set for your API key then the script should execute successfully. If not, you can set the API key via the following command:

export API_KEY=<YOUR API KEY GOES HERE>

When executing successfully, the result should look something like:

{
  "response": {
    "candidates": [
      {
        "content": {
          "parts": [
            {
              "text": "Humans currently only live on one planet: Earth. There are no known human colonies or permanent settlements on any other planets in our solar system or beyond."
            }
          ],
          "role": "model"
        },
        "finishReason": "STOP",
        "index": 0,
        "safetyRatings": [
          {
            "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
            "probability": "NEGLIGIBLE"
          },
          {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "probability": "NEGLIGIBLE"
          },
          {
            "category": "HARM_CATEGORY_HARASSMENT",
            "probability": "NEGLIGIBLE"
          },
          {
            "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
            "probability": "NEGLIGIBLE"
          }
        ]
      }
    ],
    "promptFeedback": {
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE"
        }
      ]
    }
  }
}

We successfully accessed the API. But we only prompted with text to the model and we received a text response. This is for two reasons, we're only using the Gemini Pro model but if we want to attach images to our prompt payload we need to access the Gemini Pro Vision model via the API. Secondly, we are not supplying an image payload.

Let's try accessing the Gemini Pro Vision model and see what happens.

Update the model on like so:

const model = genAI.getGenerativeModel({
  model: "gemini-pro-vision",
});

And in the text prompt, try changing it to this:

contents: [
    {
      role: "user",
      parts: [
        { text: 'Can you see an image attached to this message?' },
      ]
    }
]

If you re-run the script via the commandline:

node script.js

You should receive an error:

node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: [400 Bad Request] Add an image to use models/gemini-pro-vision, or switch your model to a text model.

What happened? Now that we've swapped to the Gemini Pro Vision model, the API expects an image to be attached with the API call. However, we only sent text. Hence the error.

In the next step, we'll attach an image and ask the LLM questions about the image. To do this, we'll need to load the image into JavaScript and encode it into Base64 encoding.

3. Seeing is believing

First we'll need to access a local image. I recommend finding a picture of a cat and using it. Here's a photo I took:

cd292507e95bdbf6.png

First we'll need the join function from the ‘path' package in NodeJS. Let's add this to the top of the script:

const { join } = require('path');

And before we engage with the Gemini API, add this code as well:

const base64Image = Buffer.from(fs.readFileSync(join(__dirname, src))).toString("base64");

Assuming that we've placed the cat.png picture next to the script.js file in our file system.

The above code means that we've loaded the picture into memory as a Base 64 encoded variable titled base64Image. We can now include this in the prompt, like so:

  contents: [
    {
      role: "user",
      parts: [
        { text: 'Can you see an image attached to this message?' },
        {
          inlineData: {
            mimeType: 'image/png',
            data: base64Image
          }
        },
      ]
    },
  ],

If we re-run the script on the command line, the output should look something like this:

{
  "response": {
    "candidates": [
      {
        "content": {
          "parts": [
            {
              "text": " Yes, I can see a cat sleeping in the grass."
            }
          ],
          "role": "model"
        },
...

Success! The Gemini Pro Vision model saw the attached photo of the cat and described it.

4. Making pictures more accessible

Now that our script is communicating successfully with the Gemini API and we're able to send images and text as prompts. Let's try leveraging this new capability in a way to help users. We're going to adapt our script to read a local HTML file, parse the file for <IMG> tags and add ‘alt' text attributes for any images that are missing tags.

We'll start by creating a simple HTML page. Let's name it input.html:

<html>
  <body>
    <img src="cat.png" />

    <img src="cat.png" alt="It's a cat." />

    <img src="cat.png" alt="It's a dog." />
  </body>
</html>

We'll need to pull in some additional packages to make this work simpler. Include ‘cheerio' and ‘fs' near the top of your script:

const cheerio = require("cheerio");
const fs = require("fs");

Next, we'll reorganize our script to execute a ‘main' function and read the input.html file before parsing it using Cheerio. Cheerio gives us a simple query selector function for the file, similar to old school jQuery syntax.

We can then query the HTML file using $("img") and loop through the images in the file:

async function main() {
  const html = fs.readFileSync("input.html", "utf8");

  const $ = cheerio.load(html);

  $("img").each(function (i, element) {
      const src = $(element).attr("src");
      const originalAlt = $(element).attr("alt");
 
      if (!fs.existsSync(src)) {
        console.log("Image doesn't exist", src);
        return;
      }
    
      const base64Image = Buffer.from(fs.readFileSync(join(__dirname, src))).toString("base64");
      
      if (originalAlt !== undefined) {
        console.log("Alt text is already defined:", originalAlt);
        return;
      }

      // TODO: Query Gemini API with the image and ask for a description
      // to update the missing alt text
}

main();

For each image, we check if the image exists locally and we encode it to Base 64 if it does. On top of that, we check if the image tag already has an ‘alt' text attribute. If it does, we skip the image.

Finally, we need to fill in the logic to query the Gemini API.

Based on the code earlier, what would that look like?

Try before continuing to the next step.

5. Getting ‘alt' text

How did you go? Hopefully you added in some code that looked something like this:

    const geminiSummary = await model
      .generateContent({
        generationConfig,
        safetySettings,
        contents: [
          {
            role: "user",
            parts: [
              { text: "What is in this picture?" },
              {
                inlineData: {
                  mimeType: "image/png",
                  data: base64Image,
                },
              },
            ],
          },
        ],
      })
      .then((result) => {
        return extractText(result, "text");
      });

    $(element).attr("alt", geminiSummary.trim());

    console.log("Updated alt text for", src, "image:", geminiRead.trim());

But what is the function ‘extractText' doing? Well, the response from the Gemini API is a nested object. To simplify extracting just the LLM's response itself, we can write the following helper function:

function extractText(obj) {
  try {
    return obj.response.candidates[0].content.parts[0].text;
  } catch(e) {
    return undefined;
  }
}

It's not elegant but it gets the job done for the purpose of this codelab. If you run the script on the command line again:

node script.js

You should see output like this:

Alt text is already defined: It's a cat.
Alt text is already defined: It's a dog.
Updated alt text for cat.png image: This is a picture of a gray and white cat sleeping in the grass.

At this point, our script should be looping through the images successfully and querying the Gemini API for new alt text attributes where necessary. However, to pull this asynchronous code together and truly update the HTML altogether so that we can write the final HTML output back to disk as an ‘output.html' file will require some additional changes.

Let's rearrange the script to include an array of promises before we begin the loop. We'll prepare promises as we loop through the images, add them to the promise array and then execute all the promises in a row before finally outputting the HTML again.

It should look like something like this:

const cheerio = require("cheerio");
const fs = require("fs");
const { GoogleGenerativeAI } = require("@google/generative-ai");
const { join } = require("path");

const generationConfig = {
  temperature: 0.7,
  candidateCount: 1,
  topK: 40,
  topP: 0.95,
  maxOutputTokens: 1024,
};

const safetySettings = [
  {
    category: "HARM_CATEGORY_DANGEROUS_CONTENT",
    threshold: "BLOCK_NONE",
  },
];

const genAI = new GoogleGenerativeAI(process.env.API_KEY);

const model = genAI.getGenerativeModel({
  model: "gemini-pro-vision",
});

async function main() {
  const html = fs.readFileSync("input.html", "utf8");
  const $ = cheerio.load(html);

  const promises = [];

  $("img").each(function (i, element) {
    // elements.push(element);

    promises.push(
      new Promise(async (resolve, reject) => {
        const src = $(element).attr("src");
        const originalAlt = $(element).attr("alt");

        if (!fs.existsSync(src)) {
          console.log("Image doesn't exist:", src);
          resolve(false);
          return;
        }

        const base64Image = Buffer.from(fs.readFileSync(join(__dirname, src))).toString("base64");

        if (originalAlt !== undefined) {
          console.log("Alt text is already defined:", originalAlt);
          resolve(false);
          return;
        }

        const geminiRead = await model
          .generateContent({
            generationConfig,
            safetySettings,
            contents: [
              {
                role: "user",
                parts: [
                  { text: "What is in this picture? " },
                  {
                    inlineData: {
                      mimeType: "image/png",
                      data: base64Image,
                    },
                  },
                ],
              },
            ],
          })
          .then((result) => {
            return extractText(result);
          });

        $(element).attr("alt", geminiRead.trim());

        console.log("Updated alt text for", src, "image:", geminiRead.trim());
        resolve(true);
      })
    );
  });

  Promise.all(promises).then((values) => {
    const updatedHTML = $.html();
    fs.writeFileSync("output.html", updatedHTML);
  });
}

main();

function extractText(obj) {
  try {
    return obj.response.candidates[0].content.parts[0].text;
  } catch (e) {
    return undefined;
  }
}

If you re-run the script. You should now have an output.html file in your directory with the final HTML! Does your output.html look like this?

<html>
  <body>
    <img src="cat.png" alt="This is a picture of a gray and white cat sleeping in the grass.">

    <img src="cat.png" alt="It's a cat.">

    <img src="cat.png" alt="It's a dog.">
  </body>
</html>

But wait a moment. Do you see a problem? Did you notice that our original input.html file had a surprising alt tag on one of the images of the cat?

Perhaps you did not notice. It was incredibly subtle ;-)

That's right. One of the alt text attributes referred to the cat as a dog – outrageous.

So we skipped that alt text because we believed it was accurate and didn't need updating. If only there was a way we could verify whether an alt text was a good enough description of an image or not... of course, this is trivial for the Gemini API.

Let's rework the script to verify the existing alt text for each image and if the Gemini model doesn't feel it's a "good enough" description, then and only then will we replace it with Gemini's description.

6. Verify first

To make this easier, we're going to write another helper function: ‘askBoolean'.

For this function, we want to make it easy to ask Gemini a yes or no question about an image and return the response as a boolean return value.

What might that look like?

async function askBoolean(question, base64Image) {
  return await model
    .generateContent({
      generationConfig,
      safetySettings,
      contents: [
        {
          role: "user",
          parts: [
            { text: question + "\nAnswer me with either 'yes' or 'no'." },
            {
              inlineData: {
                mimeType: "image/png",
                data: base64Image,
              },
            },
          ],
        },
      ],
    })
    .then((result) => {
      return extractText(result, "text").toLowerCase().includes("yes");
    });
}

Again, this is a very simplified approach but it highlights an important part of working with an LLM. Sometimes we must make it clear to the LLM what format we want a response in. In this case, we're asking that the answer be returned as very specifically either: ‘yes' or ‘no'.

Funnily enough, the LLM might still disobey us and almost certainly we cannot assume that the LLM will only respond with ‘yes' or ‘no'. In many cases, the LLM will elaborate as to why it has concluded with ‘yes' or ‘no' and include that in its response.

For this reason, after extracting the text using our other helper function ‘extractText', we've simply converted all the text in the response to lower case and then we're returning true if the word ‘yes' appears anywhere in the response. Otherwise we'll return false.

It's a simplistic solution but it will do the job for this codelab.

How would we use this back in our script though?

Try to rework the script yourself to leverage this function before reading the final script in the next step.

7. Beauty is in the eye of the Gemini

Your final script should look something like this:

const cheerio = require("cheerio");
const fs = require("fs");
const { GoogleGenerativeAI } = require("@google/generative-ai");
const { join } = require("path");

const generationConfig = {
  temperature: 0.7,
  candidateCount: 1,
  topK: 40,
  topP: 0.95,
  maxOutputTokens: 1024,
};

const safetySettings = [
  {
    category: "HARM_CATEGORY_DANGEROUS_CONTENT",
    threshold: "BLOCK_NONE",
  },
];

const genAI = new GoogleGenerativeAI(process.env.API_KEY);

const model = genAI.getGenerativeModel({
  model: "gemini-pro-vision",
});

async function main() {
  const html = fs.readFileSync("input.html", "utf8");
  const $ = cheerio.load(html);

  const promises = [];

  $("img").each(function (i, element) {
    promises.push(
      new Promise(async (resolve, reject) => {
        const src = $(element).attr("src");
        const originalAlt = $(element).attr("alt");

        if (!fs.existsSync(src)) {
          console.log("Image doesn't exist:", src);
          resolve(false);
          return;
        }

        const base64Image = Buffer.from(fs.readFileSync(join(__dirname, src))).toString("base64");

        let answer = false;

        if (originalAlt !== undefined) {
          answer = await askBoolean(
            "Does this description describe the image well? " + originalAlt,
            base64Image
          );

          if (answer) {
            console.log("Alt text is already defined:", originalAlt);

            resolve(false);
            return;
          } else {
            console.log(
              "Alt text already exists but it's not good enough:",
              originalAlt
            );
          }
        }

        const geminiRead = await model
          .generateContent({
            generationConfig,
            safetySettings,
            contents: [
              {
                role: "user",
                parts: [
                  { text: "What is in this picture? " },
                  {
                    inlineData: {
                      mimeType: "image/png",
                      data: base64Image,
                    },
                  },
                ],
              },
            ],
          })
          .then((result) => {
            return extractText(result);
          });

        $(element).attr("alt", geminiRead.trim());

        console.log("Updated alt text for", src, "image:", geminiRead.trim());
        resolve(true);
      })
    );
  });

  Promise.all(promises).then((values) => {
    const updatedHTML = $.html();
    fs.writeFileSync("output.html", updatedHTML);
  });
}

main();

function extractText(obj) {
  try {
    return obj.response.candidates[0].content.parts[0].text;
  } catch (e) {
    return undefined;
  }
}

async function askBoolean(question, base64Image) {
  return await model
    .generateContent({
      generationConfig,
      safetySettings,
      contents: [
        {
          role: "user",
          parts: [
            { text: question + "\nAnswer me with either 'yes' or 'no'." },
            {
              inlineData: {
                mimeType: "image/png",
                data: base64Image,
              },
            },
          ],
        },
      ],
    })
    .then((result) => {
      return extractText(result, "text").toLowerCase().includes("yes");
    });
}

And now, re-run the script one last time:

<html>
  <body>
    <img src="cat.png" alt="This is a picture of a gray and white cat sleeping in the grass.">

    <img src="cat.png" alt="It's a cat.">

    <img src="cat.png" alt="This is a picture of a gray and white cat sleeping in the grass.">
  </body>
</html>

The incorrect dog alt text label is gone. Replaced with Gemini's interpretation of the image.

Meanwhile the accurate existing alt text of "It's a cat." remains unchanged. Perfect.

8. Reflection

Congratulations on completing the codelab! You now have a solid understanding of how to query the Gemini API using multimodal prompts combining text and images with the NodeJS SDK.

Furthermore you've embraced a practical real-world scenario of improving the accessibility of a webpage using HTML selector queries, extracting existing HTML data and querying the Gemini model for its opinion on that existing markup.

Where could this script go from here? An obvious next step would be to handle non-local image file references. If the src attribute doesn't match a local file on the system and if the src attribute matches the structure of a URL we could attempt to fetch the image resource dynamically from the internet and proceed from there with our script.

For now though, we have explored a good exercise with just that humble cat.