On-Device Image Segmentation in C++ with LiteRT

1. Before you begin

Typing out code is a great way to build muscle memory and deepen your understanding of the material. While copy-pasting can be a time-saver, investing in this practice can lead to greater efficiency and stronger coding skills in the long run.

In this codelab, you'll learn how to build a C++ image segmentation binary that runs directly on an Android device using Google's high-performance on-device runtime, LiteRT. Instead of using Kotlin or Android Studio, this codelab focuses on building a C++ binary. You will cross-compile it with CMake or Bazel and deploy it using ADB. The same LiteRT C++ API works on any platform (Android, Linux, embedded), making this a useful foundation for performance-critical applications, robotics, and edge systems.

You will walk through the full pipeline:

  • Setting up the build environment (CMake + Android NDK or Bazel).
  • Linking the LiteRT C++ SDK — either from a prebuilt release or from source.
  • Using OpenGL ES compute shaders for GPU-accelerated image pre- and post-processing.
  • Running the selfie_multiclass segmentation model with the LiteRT C++ API.
  • Accelerating inference on CPU, GPU (OpenCL), and NPU (Qualcomm / MediaTek).
  • Postprocessing the raw model output into a color-blended segmentation image.
  • Deploying to a physical Android device with ADB and retrieving the result.

In the end, you'll produce something similar to the following image — a static image processed through the full pipeline, with each of the 6 segmentation classes overlaid in a distinct color:

Segmentation output: a person with semi-transparent colored masks over hair, skin, background, and clothes

Prerequisites

This codelab is designed for developers comfortable with C++ who want to gain experience running machine learning models on Android at the C++ layer. You should be familiar with:

  • C++ fundamentals (pointers, vectors, includes).
  • Basic Android/ADB concepts (adb push, adb shell).
  • Using a terminal and shell scripts on Linux or macOS.

What you'll learn

  • How to cross-compile a C++ binary for Android arm64-v8a with CMake + NDK or Bazel.
  • How to use the LiteRT C++ API (Environment, CompiledModel, TensorBuffer) for efficient on-device inference.
  • How OpenGL ES 3.1 compute shaders accelerate pre- and post-processing entirely on the GPU.
  • How to configure LiteRT for CPU, GPU (OpenCL), and NPU (Qualcomm HTP, MediaTek APU, Google Tensor) acceleration.
  • The difference between synchronous (Run) and asynchronous (RunAsync) inference.
  • How to deploy and run a C++ binary on Android using ADB.

What you'll need

  • A Linux or macOS machine (Windows users should use WSL2).
  • Android NDK r25c or later (download).
  • For CMake path: CMake ≥ 3.22 (sudo apt-get install cmake).
  • For Bazel path: Bazel installed, plus the full LiteRT samples repository.
  • ADB in your PATH (Android Platform Tools).
  • A physical Android device — best tested on Galaxy S24/S25 or Pixel.

2. Image Segmentation

Image segmentation is a computer vision task that assigns a class label to every pixel in an image. Unlike object detection, which draws a bounding box, segmentation produces a precise, pixel-perfect understanding of where each object begins and ends.

This codelab uses the selfie_multiclass_256x256 model, which classifies each pixel into one of 6 classes:

Class index

Segment

0

Background

1

Hair

2

Body skin

3

Face skin

4

Clothes

5

Accessories (glasses, jewelry, etc.)

The model outputs a float tensor of shape [1, 256, 256, 6]. For each of the 256×256 pixels, there are 6 confidence scores — one per class. The class with the highest score wins that pixel (argmax).

LiteRT: Performance at the Edge

LiteRT is Google's next-generation, high-performance runtime for TFLite models. Its C++ API gives you direct, low-overhead access to hardware accelerators with a consistent interface across all three:

  • CPU — universally compatible; ~128 ms inference on a mid-range device.
  • GPU (OpenCL) — ~1 ms inference; ~17–43 ms end-to-end depending on the buffer strategy.
  • NPU — ~9–28 ms end-to-end on Qualcomm Snapdragon, MediaTek Dimensity 9400, and Google Tensor devices, depending on AOT versus. JIT compilation.

The key abstraction is CompiledModel: the model is pre-compiled and optimized for the target hardware at load time, reducing inference to a Run() call on pre-allocated buffers.

3. Get set up

Clone the repository

git clone https://github.com/google-ai-edge/litert-samples.git

All resources for this codelab are in:

litert-samples/compiled_model_api/image_segmentation/c++_segmentation/

This directory has two sub-projects, each a complete build of the same sample:

Directory

Build system

LiteRT dependency

use_prebuilt_litert/

CMake + Android NDK

Prebuilt litert_cc_sdk.zip + libLiteRt.so

build_from_source/

Bazel

Compiles LiteRT from source

Choose one path and follow it. The code is identical between the two directories — only the build system and dependency strategy differ. If you want the fastest setup, choose use_prebuilt_litert/. If you need to modify LiteRT itself or work within an existing Bazel monorepo, use build_from_source/.

A note on file paths

All file paths in this tutorial use Linux/macOS format. Windows users should use WSL2.

Directory overview

Both sub-projects share the same source layout:

<variant>/
├── main_cpu.cc              # CPU inference entry point
├── main_gpu.cc              # GPU (OpenCL) inference entry point
├── main_npu.cc              # NPU (Qualcomm / MediaTek) entry point
├── image_processor.h/.cc    # OpenGL ES preprocessing and postprocessing
├── image_utils.h/.cc        # STB-based image load / save utilities
├── timing_utils.h/.cc       # Profiling helpers
├── shaders/                 # GLSL ES 3.1 compute shaders
   ├── preprocess_compute.glsl
   ├── resize_compute.glsl
   ├── mask_blend_compute.glsl
   ├── deinterleave_masks.glsl
   └── passthrough_shader.vert
├── models/
   ├── selfie_multiclass_256x256.tflite        (CPU / GPU / NPU JIT)
   ├── selfie_multiclass_256x256_SM8650.tflite (Qualcomm S24 AOT)
   └── selfie_multiclass_256x256_SM8750.tflite (Qualcomm S25 AOT)
└── test_images/
    └── image.jpeg

Additionally:

  • use_prebuilt_litert/ adds CMakeLists.txt, build_prebuilt.sh, deploy_and_run_on_android.sh, and third_party/stb/.
  • build_from_source/ adds a Bazel BUILD file and uses deploy_and_run_on_android.sh pointing at bazel-bin/.

Terminal showing the use_prebuilt_litert directory tree

4. Understand the project structure

Three entry points, one pipeline

main_cpu.cc, main_gpu.cc, and main_npu.cc each contain a main() function that drives the full segmentation pipeline. The pipeline is identical across all three; only the LiteRT accelerator configuration and buffer strategy differ:

File

Accelerator

Buffer strategy

main_cpu.cc

kCpu

CPU memory

main_gpu.cc

kGpu | kCpu

CPU memory with OpenCL backend

main_npu.cc

kNpu | kCpu

CPU memory with CPU fallback

All three share the same ImageProcessor (OpenGL ES compute shaders for preprocessing and postprocessing) and ImageUtils (STB image I/O) utilities.

The full pipeline

Every entry point follows the same five-phase structure:

Load  GPU upload  Preprocess (shader)  Inference (LiteRT)  Postprocess (shader)  Save
  1. LoadImageUtils::LoadImage() decodes the JPEG into CPU memory using the STB image library.
  2. Uploadprocessor.CreateOpenGLTexture() uploads the raw pixels to a GPU texture (OpenGL RGBA8).
  3. Preprocessprocessor.PreprocessInputForSegmentation() runs a GLSL compute shader that resizes the texture to 256×256 and normalizes pixel values from [0, 1] to [-1, 1]. The result lands in a GPU SSBO.
  4. Infer — SSBO data is written into a LiteRT TensorBuffer and compiled_model.Run() (or RunAsync()) executes the model.
  5. Postprocess — The model's 6-channel float output is deinterleaved into 6 single-channel mask SSBOs, which are then color-blended back onto the original image.
  6. SaveImageUtils::SaveImage() writes the final RGBA image as PNG.

5. Core LiteRT C++ APIs

Before building, familiarize yourself with the three key LiteRT C++ types used across all entry points. All live in the litert:: namespace.

litert::Environment

The Environment is the root context for all LiteRT operations. Create it once and pass it to CompiledModel::Create. For NPU usage, configure it with the vendor plugin library directory.

// For CPU or GPU - no extra options needed
LITERT_ASSIGN_OR_ABORT(auto env, litert::Environment::Create({}));

// For NPU: point at the vendor dispatch library directory on the device
std::vector<litert::Environment::Option> opts;
opts.push_back({litert::Environment::OptionTag::DispatchLibraryDir,
                "/data/local/tmp/cpp_segmentation_android/npu/"});
LITERT_ASSIGN_OR_ABORT(auto env,
    litert::Environment::Create(std::move(opts)));

litert::CompiledModel

CompiledModel loads and pre-compiles your TFLite model for the requested hardware at construction time. Inference then reduces to filling buffers and calling Run().

// CPU
LITERT_ASSIGN_OR_ABORT(auto model,
    litert::CompiledModel::Create(env, model_path,
                                  litert::HwAccelerators::kCpu));

// GPU (pass an Options object with GpuOptions configured)
LITERT_ASSIGN_OR_ABORT(auto model,
    litert::CompiledModel::Create(env, model_path, gpu_options));

// NPU (pass an Options object with kNpu | kCpu and vendor options)
LITERT_ASSIGN_OR_ABORT(auto model,
    litert::CompiledModel::Create(env, model_path, npu_options));

litert::TensorBuffer

Tensor buffers hold input/output data. Always create them from the CompiledModel so they are correctly sized and aligned for the target hardware.

LITERT_ASSIGN_OR_ABORT(auto input_buffers,
                       compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_ABORT(auto output_buffers,
                       compiled_model.CreateOutputBuffers());

// Write preprocessed float data, run, read results
LITERT_ABORT_IF_ERROR(
    input_buffers[0].Write(absl::MakeConstSpan(preprocessed_data)));
LITERT_ABORT_IF_ERROR(
    compiled_model.Run(input_buffers, output_buffers));
LITERT_ABORT_IF_ERROR(
    output_buffers[0].Read(absl::MakeSpan(output_data)));

Error-handling macros

Macro

Behavior

LITERT_ASSIGN_OR_ABORT(var, expr)

Assigns or calls abort() on failure

LITERT_ABORT_IF_ERROR(expr)

Calls abort() if expression returns error

LITERT_ASSIGN_OR_RETURN(var, expr)

Assigns or propagates error to caller

6. Build — Option A: Prebuilt LiteRT C++ SDK (CMake)

This is the recommended path if you don't need to modify LiteRT itself. The build script handles downloading the SDK headers, copying your .so, fetching STB, and invoking CMake + NDK in a single command.

Step 1 — Get libLiteRt.so from Maven

LiteRT ships its runtime as a shared library inside an Android AAR on Google Maven. Download it and extract the arm64-v8a .so:

# Download the AAR
wget -O litert.aar \
    "https://dl.google.com/dl/android/maven2/com/google/ai/edge/litert/litert/2.1.3/litert-2.1.3.aar"

# Extract the runtime library
unzip litert.aar "jni/arm64-v8a/libLiteRt.so" -d extracted/

For GPU support, also extract the OpenCL/GL accelerator:

unzip litert.aar "jni/arm64-v8a/libLiteRtClGlAccelerator.so" -d extracted/

Terminal showing wget downloading the LiteRT AAR and unzip extracting libLiteRt.so

Step 2 — Run build_prebuilt.sh

cd litert-samples/compiled_model_api/image_segmentation/c++_segmentation/use_prebuilt_litert/

bash build_prebuilt.sh \
    --litert_version=2.1.3 \
    --ndk_path=/path/to/android-ndk \
    --litert_so=extracted/jni/arm64-v8a/libLiteRt.so

The script will:

  1. Download litert_cc_sdk.zip (SDK headers + cmake files) from the LiteRT GitHub release — skipped on subsequent runs if already present.
  2. Copy libLiteRt.so into litert_cc_sdk/.
  3. Download STB image headers into third_party/stb/ — skipped if present.
  4. Configure and build with CMake using the Android NDK toolchain for arm64-v8a at android-26.

On success you will see three binaries in build/:

build/cpp_segmentation_cpu
build/cpp_segmentation_gpu
build/cpp_segmentation_npu

Terminal showing build_prebuilt.sh output completing with three binaries listed in build/

What CMakeLists.txt does

Open CMakeLists.txt. It requires C++20, pulls in the LiteRT SDK via add_subdirectory, links OpenGL ES 3 (GLESv3) and EGL, then uses a helper macro to create each binary from its main_*.cc source:

macro(add_segmentation_target target_name main_source)
  add_executable(${target_name} ${main_source})
  target_link_libraries(${target_name}
    PRIVATE
      image_processor image_utils timing_utils litert_cc_api
      absl::log absl::check EGL GLESv3 android log
  )
endmacro()

add_segmentation_target(cpp_segmentation_cpu main_cpu.cc)
add_segmentation_target(cpp_segmentation_gpu main_gpu.cc)
add_segmentation_target(cpp_segmentation_npu main_npu.cc)

7. Build — Option B: Build with Bazel (From Source)

Choose this path if you prefer Bazel as your build system, which compiles the LiteRT runtime from source, or if you need to work within an existing Bazel workspace.

Prerequisites

In addition to the NDK and ADB listed in the "Before you begin" section, you will need:

  • Bazel installed and on your PATH.
  • A full clone of the LiteRT samples source repository.

Step 1 — Configure the LiteRT samples workspace

All commands run from the root of the LiteRT samples repository

cd /path/to/litert-samples
./configure

When prompted:

  • Accept defaults for Python and Python lib path.
  • Answer N to ROCm and CUDA support.
  • Select clang (tested with 18.1.3) as the compiler.
  • Accept default optimization flags.
  • Answer Y to configure the WORKSPACE for Android builds.
  • Set minimum Android NDK level to at least 26.
  • Provide the path to your Android SDK.
  • Set Android SDK API level to default (36) and build tools to 36.0.0.

Terminal showing ./configure prompts and answers for the LiteRT samples workspace

Step 2 — Build the CPU and GPU targets

# CPU
bazel build \
  //compiled_model_api/image_segmentation/c++_segmentation/build_from_source:cpp_segmentation_cpu \
  --config=android_arm64

# GPU
bazel build \
  //compiled_model_api/image_segmentation/c++_segmentation/build_from_source:cpp_segmentation_gpu \
  --config=android_arm64

Step 3 — Build the NPU target

Qualcomm HTP

  1. Download QAIRT SDK v2.41 or later and extract it.
  2. Ensure the extracted SDK contents are inside a subdirectory named latest/:
    /path/to/qairt_sdk/
      └── latest/
          ├── include/
          ├── lib/
          └── ...
    
  3. Build, passing the parent path ending with /:
    bazel build \
      //compiled_model_api/image_segmentation/c++_segmentation/build_from_source:cpp_segmentation_npu \
      --config=android_arm64 \
      --nocheck_visibility \
      --action_env LITERT_QAIRT_SDK=/path/to/qairt_sdk/
    

The --nocheck_visibility flag is required because some upstream LiteRT targets have restricted visibility defaults.

MediaTek APU

No extra SDK is required. The NeuroPilot runtime is a system library on Dimensity 9400 devices.

bazel build \
  //compiled_model_api/image_segmentation/c++_segmentation/build_from_source:cpp_segmentation_npu_mtk \
  --config=android_arm64 \
  --nocheck_visibility

Terminal showing bazel build completing for cpp_segmentation_cpu and cpp_segmentation_gpu

The BUILD file

Open build_from_source/BUILD. It defines four cc_binary targets — one per accelerator plus a dedicated MediaTek NPU target — each depending on the shared image_processor, image_utils, and timing_utils library targets:

cc_binary(
    name = "cpp_segmentation_cpu",
    srcs = ["main_cpu.cc"],
    deps = [
        ":image_processor",
        ":image_utils",
        ":timing_utils",
        "@litert_archive//litert/cc:litert_api_with_dynamic_runtime",
        "@com_google_absl//absl/time",
        "@com_google_absl//absl/types:span",
    ] + gles_deps() + gl_native_deps(),
    ...
)

The GPU target adds libLiteRtClGlAccelerator.so as a data dependency so Bazel includes it in the runfiles. The NPU targets add vendor dispatch and compiler plugin .so files as data dependencies.

8. GPU-Accelerated Preprocessing with Compute Shaders

All three entry points use the same OpenGL ES compute shader pipeline for preprocessing. Understanding it is key to understanding why the GPU path is so much faster than the CPU path.

Set up a headless EGL context

ImageProcessor::InitializeGL() creates a headless EGL context — an OpenGL context with no window or display attached. This is standard practice for off-screen GPU compute on Android. It then compiles the five GLSL compute shader programs from disk:

processor.InitializeGL(
    "shaders/passthrough_shader.vert",
    "shaders/mask_blend_compute.glsl",
    "shaders/resize_compute.glsl",
    "shaders/preprocess_compute.glsl",
    "shaders/deinterleave_masks.glsl");

Upload the input image to the GPU

The JPEG is decoded into CPU memory by ImageUtils::LoadImage() (via the STB library), then uploaded to a GPU texture:

auto img_data_cpu = ImageUtils::LoadImage(
    input_file, width_orig, height_orig, channels_file, /*desired=*/3);

GLuint tex_id_orig = processor.CreateOpenGLTexture(
    img_data_cpu, width_orig, height_orig, loaded_channels);

ImageUtils::FreeImageData(img_data_cpu);  // CPU copy no longer needed

From this point the original image lives in GPU memory as an OpenGL texture.

The preprocess compute shader

shaders/preprocess_compute.glsl dispatches 8×8 thread groups across the 256×256 output grid. Each thread handles one output pixel: it samples the input texture using bilinear filtering (hardware resize for free), converts the [0, 1] RGB value to [-1, 1], and writes to the output SSBO:

vec2 uv = vec2(float(pos.x) / float(out_width - 1),
               float(pos.y) / float(out_height - 1));
vec4 color_0_1 = texture(inputTexture, uv);
vec3 color_neg1_1 = (color_0_1.rgb * 2.0) - 1.0;

int base = (pos.y * out_width + pos.x) * num_channels;
preprocessed_output.data[base + 0] = color_neg1_1.r;
preprocessed_output.data[base + 1] = color_neg1_1.g;
preprocessed_output.data[base + 2] = color_neg1_1.b;

For the standard (non-zero-copy) path, this SSBO is then read back to CPU and written into the LiteRT tensor:

std::vector<float> preprocessed(256 * 256 * num_channels);
processor.ReadBufferData(preprocessed_buffer_id, 0,
                         preprocessed.size() * sizeof(float),
                         preprocessed.data());
LITERT_ABORT_IF_ERROR(
    input_buffers[0].Write(absl::MakeConstSpan(preprocessed)));

9. CPU Inference

Open main_cpu.cc. The LiteRT setup is three lines:

// Create the root environment
LITERT_ASSIGN_OR_ABORT(auto env, litert::Environment::Create({}));

// Compile the model for the CPU
LITERT_ASSIGN_OR_ABORT(auto compiled_model,
    litert::CompiledModel::Create(
        env, model_path, litert::HwAccelerators::kCpu));

// Allocate input and output tensor buffers
LITERT_ASSIGN_OR_ABORT(auto input_buffers,
                       compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_ABORT(auto output_buffers,
                       compiled_model.CreateOutputBuffers());

After preprocessing, inference is a single synchronous call:

LITERT_ABORT_IF_ERROR(compiled_model.Run(input_buffers, output_buffers));

Run() blocks until inference is complete. The selfie_multiclass_256x256.tflite floating-point model runs on the ARM Cortex cores and typically takes ~116–128 ms on a mid-range device.

Binary usage:

cpp_segmentation_cpu <model_path> <input_image> <output_image>

10. GPU Inference (OpenCL)

Open main_gpu.cc. The GPU path introduces two concepts not present in the CPU path: litert::Options for configuring the GPU accelerator (with the OpenCL backend), and asynchronous execution.

Configure GPU options

litert::Options CreateGpuOptions() {
  LITERT_ASSIGN_OR_ABORT(litert::Options options, litert::Options::Create());
  LITERT_ASSIGN_OR_ABORT(auto& gpu_options, options.GetGpuOptions());

  LITERT_ABORT_IF_ERROR(
      gpu_options.SetBackend(litert::GpuOptions::Backend::kOpenCl));

  // Allow CPU fallback for any ops not supported by the GPU delegate
  options.SetHardwareAccelerators(litert::HwAccelerators::kGpu |
                                  litert::HwAccelerators::kCpu);
  return options;
}

Asynchronous inference

The GPU path uses RunAsync() instead of Run(). This submits work to the GPU command queue and returns immediately. You then synchronize before reading results:

bool async = false;
LITERT_ABORT_IF_ERROR(
    compiled_model.RunAsync(0, input_buffers, output_buffers, async));

if (output_buffers[0].HasEvent()) {
  LITERT_ASSIGN_OR_ABORT(auto event, output_buffers[0].GetEvent());
  event.Wait();
}

This non-blocking design lets you overlap CPU work with GPU execution in a real-time pipeline.

Binary usage:

cpp_segmentation_gpu <model_path> <input_image> <output_image>

11. Postprocess — Deinterleave and Blend

After Run() or RunAsync() completes, output_buffers[0] holds a flat float array of shape [256 × 256 × 6] in interleaved order. The 6 class scores for pixel (row, col) are at indexes (row * 256 + col) * 6 through (row * 256 + col) * 6 + 5.

Deinterleave into 6 mask SSBOs

A CPU helper splits the interleaved array into 6 single-channel float arrays and uploads each to its own GPU SSBO:

std::vector<float> data(256 * 256 * 6);
output_buffers[0].Read(absl::MakeSpan(data));

std::vector<GLuint> mask_ids(6);
for (int i = 0; i < 6; ++i)
  mask_ids[i] = processor.CreateOpenGLBuffer(nullptr, 256 * 256 * sizeof(float));

processor.DeinterleaveMasksCpu(data.data(), 256, 256, mask_ids);

Color-blend masks onto the original image

processor.ApplyColoredMasks() runs the mask_blend_compute.glsl shader. For each output pixel it finds the class with the highest score (argmax across the 6 mask SSBOs) and alpha-composites the corresponding color over the original image pixel. The six colors are defined in each entry point:

std::vector<RGBAColor> mask_colors = {
    {1.0f, 0.0f, 0.0f, 0.1f},  // red     - background
    {0.0f, 1.0f, 0.0f, 0.1f},  // green   - hair
    {0.0f, 0.0f, 1.0f, 0.1f},  // blue    - body skin
    {1.0f, 1.0f, 0.0f, 0.1f},  // yellow  - face skin
    {1.0f, 0.0f, 1.0f, 0.1f},  // magenta - clothes
    {0.0f, 1.0f, 1.0f, 0.1f},  // cyan    - accessories
};

The alpha of 0.1f keeps the tint subtle so the original image remains visible.

Save the output

The final blended RGBA float SSBO is read back, clamped to [0, 1], converted to unsigned char, and saved as PNG:

for (size_t i = 0; i < float_data.size(); ++i)
  uchar_data[i] = static_cast<unsigned char>(
      std::max(0.0f, std::min(1.0f, float_data[i])) * 255.0f);
ImageUtils::SaveImage(output_file, width, height, 4, uchar_data.data());

12. Deploy and Run on Device

Connect your Android device using USB and verify ADB connectivity:

adb devices

Terminal showing adb devices output with one connected device

Use deploy_and_run_on_android.sh

Each variant has its own deploy script. The CMake variant points at the build/ directory; the Bazel variant points at bazel-bin/. Both scripts:

  1. Create /data/local/tmp/cpp_segmentation_android/ on the device.
  2. Push the binary, GLSL shaders, model, test image, and runtime .so files.
  3. Run inference using adb shell.
  4. Pull output_segmented.png back to your machine.

CMake variant (use_prebuilt_litert/)

# CPU
./deploy_and_run_on_android.sh --accelerator=cpu --phone=s25 build/

# GPU
./deploy_and_run_on_android.sh --accelerator=gpu --phone=s25 build/

Bazel variant (build_from_source/)

Run these commands from the LiteRT samples repo root:

# CPU
./compiled_model_api/image_segmentation/c++_segmentation/build_from_source/deploy_and_run_on_android.sh \
    --accelerator=cpu --phone=s25 bazel-bin/

# GPU
./compiled_model_api/image_segmentation/c++_segmentation/build_from_source/deploy_and_run_on_android.sh \
    --accelerator=gpu --phone=s25 bazel-bin/

The --phone flag controls which device-specific model and vendor libraries are used. Supported values: s24 (Snapdragon 8 Gen 3), s25 (Snapdragon 8 Elite), dim9400 (MediaTek Dimensity 9400), pixel8 (Tensor G3), pixel9 (Tensor G4), pixel10 (Tensor G5), and pixel11 (Tensor G6).

Inference Timing

After inference, PrintTiming() prints a full profiling breakdown:

Load image:    X ms
Preprocess:    X ms
Inference:     X ms
Postprocess:   X ms
E2E:           X ms
Save image:    X ms

Reference performance on Samsung S25 Ultra (Snapdragon 8 Elite):

Accelerator

Execution type

Inference

E2E

CPU

Sync

~116–128 ms

~157 ms

GPU (OpenCL)

Async

~0.95 ms

~35–43 ms

13. Advanced (Optional): NPU Inference

For maximum performance, LiteRT supports NPU acceleration using vendor-specific plugin libraries. The NPU path can achieve end-to-end latency as low as 9 ms.

Supported devices and modes

Chip

Device example

Mode

E2E

Qualcomm SM8650

Galaxy S24

AOT

~17 ms

Qualcomm SM8750

Galaxy S25

AOT

~17 ms

Qualcomm (any)

JIT

~28 ms

MediaTek Dimensity 9400

JIT

~9 ms

Google Tensor G3-G6

Pixel 8-11

AOT/JIT

Varies

AOT (Ahead-of-Time) uses a device-specific pre-compiled model (e.g., selfie_multiclass_256x256_SM8650.tflite). These are the fastest option but are chip-specific.

JIT (Just-in-Time) uses the standard selfie_multiclass_256x256.tflite and compiles to the NPU at runtime — slower first run, chip-independent.

Extra prerequisites

Qualcomm HTP:

  • QAIRT SDK v2.41+ (provides libQnnHtp.so, stub or skel .so files).
  • libLiteRtDispatch_Qualcomm.so from the LiteRT NPU runtime libraries release on GitHub.

MediaTek APU:

  • libLiteRtDispatch_MediaTek.so from the LiteRT NPU runtime libraries release.
  • NeuroPilot runtime (already a system library on Dimensity 9400 devices — nothing to push).

Google Tensor:

  • libLiteRtDispatch_GoogleTensor.so from the LiteRT NPU runtime libraries release.

NPU environment and options

main_npu.cc points the Environment at the vendor dispatch library directory on the device, then sets vendor-specific performance options:

// Configure LiteRT to find the dispatch library
std::vector<litert::Environment::Option> env_opts;
env_opts.push_back({litert::Environment::OptionTag::DispatchLibraryDir,
                    kQualcommDispatchDir});
LITERT_ASSIGN_OR_ABORT(auto env,
    litert::Environment::Create(std::move(env_opts)));

// Target NPU with CPU fallback
LITERT_ASSIGN_OR_ABORT(litert::Options options, litert::Options::Create());
options.SetHardwareAccelerators(litert::HwAccelerators::kNpu |
                                litert::HwAccelerators::kCpu);

// Qualcomm: burst performance mode
auto& qnn_opts = options.GetQualcommOptions();
qnn_opts.SetLogLevel(litert::qualcomm::QualcommOptions::LogLevel::kOff);
qnn_opts.SetHtpPerformanceMode(
    litert::qualcomm::QualcommOptions::HtpPerformanceMode::kBurst);

LITERT_ASSIGN_OR_ABORT(auto model,
    litert::CompiledModel::Create(env, model_path, options));

For MediaTek, replace the GetQualcommOptions() block:

// MediaTek: fast single-answer mode + low-latency hint
auto& mtk_opts = options.GetMediatekOptions();
mtk_opts.SetPerformanceMode(
    kLiteRtMediatekNeuronAdapterPerformanceModeNeuronPreferFastSingleAnswer);
mtk_opts.SetOptimizationHint(
    kLiteRtMediatekNeuronAdapterOptimizationHintLowLatency);
mtk_opts.SetNeronSDKVersionType(
    kLiteRtMediatekOptionsNeronSDKVersionTypeVersion8);

Deploy for NPU

CMake variant — Qualcomm S25 (AOT)

./deploy_and_run_on_android.sh \
    --accelerator=npu --phone=s25 \
    --host_npu_lib=/path/to/qairt/lib \
    --host_npu_dispatch_lib=/path/to/dir/with/libLiteRtDispatch_Qualcomm.so \
    build/

CMake variant — MediaTek Dimensity 9400 (JIT)

./deploy_and_run_on_android.sh \
    --accelerator=npu --phone=dim9400 --jit \
    --host_npu_dispatch_lib=/path/to/dir/with/libLiteRtDispatch_MediaTek.so \
    build/

Bazel variant — Qualcomm S25 (AOT)

./compiled_model_api/image_segmentation/c++_segmentation/build_from_source/deploy_and_run_on_android.sh \
    --accelerator=npu --phone=s25 bazel-bin/

Bazel variant — MediaTek Dimensity 9400 (JIT)

./compiled_model_api/image_segmentation/c++_segmentation/build_from_source/deploy_and_run_on_android.sh \
    --accelerator=npu --phone=dim9400 --jit bazel-bin/

Bazel variant — Google Tensor Pixel 9 (JIT)

./compiled_model_api/image_segmentation/c++_segmentation/build_from_source/deploy_and_run_on_android.sh \
    --accelerator=npu --phone=pixel9 --jit bazel-bin/

For the Bazel variant, QAIRT SDK libraries are picked up automatically from the bazel-bin runfiles tree when LITERT_QAIRT_SDK is set at build time. The CMake variant requires the --host_npu_lib flag to point at your extracted QAIRT SDK.

14. Congratulations!

You have successfully built and run a C++ image segmentation pipeline on Android using LiteRT. You've learned how to:

  • Cross-compile a C++ binary for Android arm64-v8a with CMake + NDK or Bazel.
  • Use the LiteRT C++ API (Environment, CompiledModel, TensorBuffer) for efficient on-device inference.
  • Preprocess image data on the GPU with OpenGL ES 3.1 compute shaders.
  • Run synchronous CPU inference, and asynchronous GPU (OpenCL) inference.
  • Configure NPU acceleration for Qualcomm, MediaTek, and Google Tensor devices.
  • Deploy and run a C++ binary on Android using ADB.

Next Steps

  • Swap in a different TFLite model (e.g., depth estimation or pose detection).
  • Integrate the C++ pipeline into an Android NDK app using JNI.
  • Profile memory usage with the Android GPU Inspector alongside timing output.
  • Explore model quantization to further reduce NPU inference latency.

Learn More