LLM Inference guide for Android

The LLM Inference API lets you run large language models (LLMs) completely on-device for Android applications, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your Android apps.

The task supports Gemma 2B, a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B.

You can see this task in action with the MediaPipe Studio demo. For more information about the capabilities, models, and configuration options of this task, see the Overview.

Code example

This guide refers to an example of a basic text generation app for Android. You can use the app as a starting point for your own Android app, or refer to it when modifying an existing app. The example code is hosted on GitHub.

Download the code

The following instructions show you how to create a local copy of the example code using the git command line tool.

To download the example code:

  1. Clone the git repository using the following command:
    git clone https://github.com/googlesamples/mediapipe
    
  2. Optionally, configure your git instance to use sparse checkout, so you have only the files for the LLM Inference API example app:
    cd mediapipe
    git sparse-checkout init --cone
    git sparse-checkout set examples/llm_inference/android
    

After creating a local version of the example code, you can import the project into Android Studio and run the app. For instructions, see the Setup Guide for Android.

Setup

This section describes key steps for setting up your development environment and code projects specifically to use the LLM Inference API. For general information on setting up your development environment for using MediaPipe tasks, including platform version requirements, see the Setup guide for Android.

Dependencies

The LLM Inference API uses the com.google.mediapipe:tasks-genai library. Add this dependency to the build.gradle file of your Android app:

dependencies {
    implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}

Model

The MediaPipe LLM Inference API requires a trained text-to-text language model that is compatible with this task. After downloading a model, install the required dependencies and push the model to the Android device. If you are using a model other than Gemma, you will have to convert the model to a format compatible with MediaPipe.

For more information on available trained models for LLM Inference API, see the task overview Models section.

Download a model

Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory:

  • Gemma 2B: Part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.
  • Phi-2: 2.7 billion parameter Transformer model, best suited for the Question-Answer, chat, and code format.
  • Falcon-RW-1B: 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.
  • StableLM-3B: 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets.

We recommend using Gemma 2B, which is available on Kaggle Models and comes in a format that is already compatible with the LLM Inference API. If you use another LLM, you will need to convert the model to a MediaPipe-friendly format. For more information on Gemma 2B, see the Gemma site. For more information on the other available models, see the task overview Models section.

Convert model to MediaPipe format

If you are using an external LLM (Phi-2, Falcon, or StableLM) or a non-Kaggle version of Gemma, use our conversion scripts to format the model to be compatible with MediaPipe.

The model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  input_ckpt=INPUT_CKPT,
  ckpt_format=CKPT_FORMAT,
  model_type=MODEL_TYPE,
  backend=BACKEND,
  output_dir=OUTPUT_DIR,
  combine_file_only=False,
  vocab_model_file=VOCAB_MODEL_FILE,
  output_tflite_file=OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)
Parameter Description Accepted Values
input_ckpt The path to the model.safetensors or pytorch.bin file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003.safetensors, model-00001-of-00003.safetensors. You can specify a file pattern, like model*.safetensors. PATH
ckpt_format The model file format. {"safetensors", "pytorch"}
model_type The LLM being converted. {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
backend The processor (delegate) used to run the model. {"cpu", "gpu"}
output_dir The path to the output directory that hosts the per-layer weight files. PATH
output_tflite_file The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. PATH
vocab_model_file The path to the directory that stores the tokenizer.json and tokenizer_config.json files. For Gemma, point to the single tokenizer.model file. PATH

Push model to the device

Push the content of the output_path folder to the Android device.

$ adb shell rm -r /data/local/tmp/llm/ # Remove any previously loaded models
$ adb shell mkdir -p /data/local/tmp/llm/
$ adb push output_path /data/local/tmp/llm/model_version.bin

Create the task

The MediaPipe LLM Inference API uses the createFromOptions() function to set up the task. The createFromOptions() function accepts values for the configuration options. For more information on configuration options, see Configuration options.

The following code initializes the task using basic configuration options:

// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
        .setModelPATH('/data/local/.../')
        .setMaxTokens(1000)
        .setTopK(40)
        .setTemperature(0.8)
        .setRandomSeed(101)
        .build()

// Create an instance of the LLM Inference task
llmInference = LlmInference.createFromOptions(context, options)

Configuration options

Use the following configuration options to set up an Android app:

Option Name Description Value Range Default Value
modelPath The path to where the model is stored within the project directory. PATH N/A
maxTokens The maximum number of tokens (input tokens + output tokens) the model handles. Integer 512
topK The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting topK, you must also set a value for randomSeed. Integer 40
temperature The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting temperature, you must also set a value for randomSeed. Float 0.8
randomSeed The random seed used during text generation. Integer 0
resultListener Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method. N/A N/A
errorListener Sets an optional error listener. N/A N/A

Prepare data

The LLM Inference API accepts the following inputs:

  • prompt (string): A question or prompt.
val inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday."

Run the task

Use the generateResponse() method to generate a text response to the input text provided in the previous section (inputPrompt). This produces a single generated response.

val result = llmInference.generateResponse(inputPrompt)
logger.atInfo().log("result: $result")

To stream the response, use the generateResponseAsync() method.

val options = LlmInference.LlmInferenceOptions.builder()
  ...
  .setResultListener { partialResult, done ->
    logger.atInfo().log("partial result: $partialResult")
  }
  .build()

llmInference.generateResponseAsync(inputPrompt)

Handle and display results

The LLM Inference API returns a LlmInferenceResult, which includes the generated response text.

Here's a draft you can use:

Subject: Lunch on Saturday Reminder

Hi Brett,

Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.

Looking forward to it!

Best,
[Your Name]