Attention: This MediaPipe Solutions Preview is an early release. Learn more

LLM Inference guide for Web

The LLM Inference API lets you run large language models (LLMs) completely on the browser for web applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your web apps.

You can see this task in action with the MediaPipe Studio demo. For more information about the capabilities, models, and configuration options of this task, see the Overview.

Code example

The example application for the LLM Inference API provides a basic implementation of this task in JavaScript for your reference. You can use this sample app to get started building your own text generation app.

You can access the LLM Inference API example app on GitHub.

Setup

This section describes key steps for setting up your development environment and code projects specifically to use LLM Inference API. For general information on setting up your development environment for using MediaPipe Tasks, including platform version requirements, see the Setup guide for Web.

Browser compatibility

The LLM Inference API requires a web browser with WebGPU compatibility. For a full list of compatible browsers, see GPU browser compatibility.

JavaScript packages

LLM Inference API code is available through the @mediapipe/tasks-genai package. You can find and download these libraries from links provided in the platform Setup guide.

Install the required packages for local staging:

npm install @mediapipe/tasks-genai

To deploy to a server, use a content delivery network (CDN) service like jsDelivr to add code directly to your HTML page:

<head>
  <script src="https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/genai_bundle.cjs"
    crossorigin="anonymous"></script>
</head>

Model

The MediaPipe LLM Inference API requires a trained model that is compatible with this task. For web applications, the model must be GPU-compatible.

For more information on available trained models for LLM Inference API, see the task overview Models section.

Download a model

Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory:

Gemma 2B: Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
Phi-2: 2.7 billion parameter Transformer model, best suited for the Question-Answer, chat, and code format.
Falcon-RW-1B: 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.
StableLM-3B: 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets.

We recommend using Gemma 2B, which is available on Kaggle Models and comes in a format that is already compatible with the LLM Inference API. If you use another LLM, you will need to convert the model to a MediaPipe-friendly format. For more information on Gemma 2B, see the Gemma site. For more information on the other available models, see the task overview Models section.

Convert model to MediaPipe format

If you are using an external LLM (Phi-2, Falcon, or StableLM) or a non-Kaggle version of Gemma, use our conversion scripts to format the model to be compatible with MediaPipe.

The model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  input_ckpt=INPUT_CKPT,
  ckpt_format=CKPT_FORMAT,
  model_type=MODEL_TYPE,
  backend=BACKEND,
  output_dir=OUTPUT_DIR,
  combine_file_only=False,
  vocab_model_file=VOCAB_MODEL_FILE,
  output_tflite_file=OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)

Parameter	Description	Accepted Values
`input_ckpt`	The path to the `model.safetensors` or `pytorch.bin` file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. `model-00001-of-00003.safetensors`, `model-00001-of-00003.safetensors`. You can specify a file pattern, like `model*.safetensors`.	PATH
`ckpt_format`	The model file format.	{"safetensors", "pytorch"}
`model_type`	The LLM being converted.	{"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
`backend`	The processor (delegate) used to run the model.	{"cpu", "gpu"}
`output_dir`	The path to the output directory that hosts the per-layer weight files.	PATH
`output_tflite_file`	The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file.	PATH
`vocab_model_file`	The path to the directory that stores the `tokenizer.json` and `tokenizer_config.json` files. For Gemma, point to the single `tokenizer.model` file.	PATH

Add model to project directory

Store the model within your project directory:

<dev-project-root>/assets/gemma-2b-it-gpu-int4.bin

Specify the path of the model with the baseOptions object modelAssetPath parameter:

baseOptions: { modelAssetPath: `/assets/gemma-2b-it-gpu-int4.bin`}

Create the task

Use one of the LLM Inference API createFrom...() functions to prepare the task for running inferences. You can use the createFromModelPath() function with a relative or absolute path to the trained model file. The code example uses the createFromOptions() function. For more information on the available configuration options, see Configuration options.

The following code demonstrates how to build and configure this task:

const genai = await FilesetResolver.forGenAiTasks(
    // path/to/wasm/root
    "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@latest/wasm"
);
const llmInference = await LlmInference.createFromOptions(genai, {
    baseOptions: {
        modelAssetPath: '/assets/gemma-2b-it-gpu-int4.bin'
    },
    maxTokens: 1000,
    topK: 40,
    temperature: 0.8,
    randomSeed: 101
});

Configuration options

This task has the following configuration options for Web and JavaScript applications:

Option Name	Description	Value Range	Default Value
`maxTokens`	The maximum number of tokens (input tokens + output tokens) the model handles.	Integer	512
`topK`	The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting `topK`, you must also set a value for `randomSeed`.	Integer	1
`temperature`	The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting `temperature`, you must also set a value for `randomSeed`.	Float	1.0
`randomSeed`	The random seed used during text generation.	Integer	undefined

Prepare data

LLM Inference API accepts text (string) data. The task handles the data input preprocessing, including tokenization and tensor preprocessing.

All preprocessing is handled within the generateResponse() function. There is no need for additional preprocessing of the input text.

const inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday.";

Run the task

The LLM Inference API uses the generateResponse() function to trigger inferences. For text classification, this means returning the possible categories for the input text.

The following code demonstrates how to execute the processing with the task model.

const response = await llmInference.generateResponse(inputPrompt);
document.getElementById('output').textContent = response;

To stream the response, use the following:

llmInference.generateResponse(
  inputPrompt,
  (partialResult, done) => {
        document.getElementById('output').textContent += partialResult;
});

Handle and display results

The LLM Inference API returns a string, which includes the generated response text.

Here's a draft you can use:

Subject: Lunch on Saturday Reminder

Hi Brett,

Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.

Looking forward to it!

Best,
[Your Name]