Attention: This MediaPipe Solutions Preview is an early release. Learn more

LLM Inference guide

The LLM Inference API lets you run large language models (LLMs) completely on-device, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your apps and products.

The task supports Gemma 2B, a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B.

Try it!

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, with code examples that use an available model and the recommended configuration options:

Web:
- Guide
- Code example
Android:
- Guide
- Code example
iOS
- Guide
- Code example

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

The LLM Inference API contains the following key features:

Text-to-text generation - Generate text based on an input text prompt.
LLM selection - Apply multiple models to tailor the app for your specific use cases. You can also retrain and apply customized weights to the model.

Task inputs	Task outputs
The LLM Inference API accepts the following inputs: Text prompt (e.g., a question, an email subject, a document to be summarized)	The LLM Inference API outputs the following results: Generated text based on the input prompt (e.g., an answer to the question, an email draft, a summary of the document)

Configurations options

This task has the following configuration options:

Option Name	Description	Value Range	Default Value
`modelPath`	The path to where the model is stored within the project directory.	PATH	N/A
`maxTokens`	The maximum number of tokens (input tokens + output tokens) the model handles.	Integer	512
`topK`	The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting `topK`, you must also set a value for `randomSeed`.	Integer	40
`temperature`	The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting `temperature`, you must also set a value for `randomSeed`.	Float	0.8
`randomSeed`	The random seed used during text generation.	Integer	0
`resultListener`	Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method.	N/A	N/A
`errorListener`	Sets an optional error listener.	N/A	N/A

Models

The LLM Inference API contains built-in support for severable text-to-text large language models that are optimized to run on browsers and mobile devices. These lightweight models can be downloaded to run inferences completely on-device.

Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory.

Gemma 2B

Gemma 2B is a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The model contains 2B parameters and open weights. This model is well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.

Download Gemma 2B

The Gemma 2B models come in four variants:

gemma-2b-it-cpu-int4: Gemma 4-bit model with CPU compatibility.
gemma-2b-it-cpu-int8: Gemma 8-bit model with CPU compatibility.
gemma-2b-it-gpu-int4: Gemma 4-bit model with GPU compatibility.
gemma-2b-it-gpu-int8: Gemma 8-bit model with GPU compatibility.

You can also tune the model and add new weights before adding it to the app. For more information on tuning and customizing Gemma 2B, see Tuning Gemma 2B. After downloading Gemma 2B from Kaggle Models, the model is already in the appropriate format to use with MediaPipe.

If you download Gemma 2B from Hugging Face, you must convert the model to a MediaPipe-friendly format. The LLM Inference API requires the following files to be downloaded and converted:

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
tokenizer.json
tokenizer_config.json

Falcon 1B

Falcon-1B is a 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.

Download Falcon 1B

The LLM Inference API requires the following files to be downloaded and stored locally:

tokenizer.json
tokenizer_config.json
pytorch_model.bin

After downloading the Falcon model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

StableLM 3B

StableLM-3B is a 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets for 4 epochs.

Download StableLM 3B

The LLM Inference API requires the following files to be downloaded and stored locally:

tokenizer.json
tokenizer_config.json
model.safetensors

After downloading the StableLM model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

Phi-2

Phi-2 is a 2.7 billion parameter Transformer model. It was trained using various NLP synthetic texts and filtered websites. The model is best suited for prompts using the Question-Answer, chat, and code format.

Download Phi-2

The LLM Inference API requires the following files to be downloaded and stored locally:

tokenizer.json
tokenizer_config.json
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors

After downloading the Phi-2 model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

Convert model to MediaPipe format

If you are using an external LLM (Phi-2, Falcon, or StableLM) or a non-Kaggle version of Gemma, use our conversion scripts to format the model to be compatible with MediaPipe.

The model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  input_ckpt=INPUT_CKPT,
  ckpt_format=CKPT_FORMAT,
  model_type=MODEL_TYPE,
  backend=BACKEND,
  output_dir=OUTPUT_DIR,
  combine_file_only=False,
  vocab_model_file=VOCAB_MODEL_FILE,
  output_tflite_file=OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)

Parameter	Description	Accepted Values
`input_ckpt`	The path to the `model.safetensors` or `pytorch.bin` file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. `model-00001-of-00003.safetensors`, `model-00001-of-00003.safetensors`. You can specify a file pattern, like `model*.safetensors`.	PATH
`ckpt_format`	The model file format.	{"safetensors", "pytorch"}
`model_type`	The LLM being converted.	{"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
`backend`	The processor (delegate) used to run the model.	{"cpu", "gpu"}
`output_dir`	The path to the output directory that hosts the per-layer weight files.	PATH
`output_tflite_file`	The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file.	PATH
`vocab_model_file`	The path to the directory that stores the `tokenizer.json` and `tokenizer_config.json` files. For Gemma, point to the single `tokenizer.model` file.	PATH