Attention: This MediaPipe Solutions Preview is an early release. Learn more

Image embedding task guide

The MediaPipe Image Embedder task lets you create a numeric representation of an image, which is useful in accomplishing various ML-based image tasks. This functionality is frequently used to compare the similarity of two images using mathematical comparison techniques such as Cosine Similarity. This task operates on image data with a machine learning (ML) model as static data or a continuous stream, and outputs a numeric representation of the image data as a list of high-dimensional feature vectors, also known as embedding vectors, in either floating-point or quantized form.

Try it!

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, using a recommended model, and provide code examples with the recommended configuration options:

Android - Code example - Guide
Python - Code example - Guide
Web - Code example - Guide

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

Input image processing - Processing includes image rotation, resizing, normalization, and color space conversion.
Region of interest - Performs embedding on a region of the image instead of the whole image.
Embedding similarity computation - Built-in utility function to compute the cosine similarity between two feature vectors
Quantization - Supports scalar quantization for the feature vectors.

Task inputs	Task outputs
Input can be one of the following data types: Still images Decoded video frames Live video feed	Image Embedder outputs a list of embeddings consisting of: Embedding: the feature vector itself, either in floating-point form or scalar-quantized. Head index: the index for the head that produced this embedding. Head name (optional): the name of the head that produced this embedding.

Task inputs

Task outputs

Input can be one of the following data types:

Still images

Decoded video frames

Live video feed

Image Embedder outputs a list of embeddings consisting of:

Embedding: the feature vector itself, either in floating-point form or scalar-quantized.

Head index: the index for the head that produced this embedding.

Head name (optional): the name of the head that produced this embedding.

Configurations options

This task has the following configuration options:

Option Name	Description	Value Range	Default Value
`running_mode`	Sets the running mode for the task. There are three modes: IMAGE: The mode for single image inputs. VIDEO: The mode for decoded frames of a video. LIVE_STREAM: The mode for a livestream of input data, such as from a camera. In this mode, resultListener must be called to set up a listener to receive results asynchronously.	{`IMAGE, VIDEO, LIVE_STREAM`}	`IMAGE`
`l2_normalize`	Whether to normalize the returned feature vector with L2 norm. Use this option only if the model does not already contain a native L2_NORMALIZATION TFLite Op. In most cases, this is already the case and L2 normalization is thus achieved through TFLite inference with no need for this option.	`Boolean`	`False`
`quantize`	Whether the returned embedding should be quantized to bytes via scalar quantization. Embeddings are implicitly assumed to be unit-norm and therefore any dimension is guaranteed to have a value in [-1.0, 1.0]. Use the l2_normalize option if this is not the case.	`Boolean`	`False`
`result_callback`	Sets the result listener to receive the embedding results asynchronously when the Image Embedder is in the live stream mode. Can only be used when running mode is set to `LIVE_STREAM`	N/A	Not set

Models

The Image Embedder requires an image embedding model to be downloaded and stored in your project directory. Start with the default, recommended model for your target platform when you start developing with this task. The other available models typically make trade-offs between performance, accuracy, resolution, and resource requirements, and in some cases, include additional features.

MobileNetV3 model

This model family uses a MobileNet V3 architecture and was trained using ImageNet data. This model uses a multiplier of 0.75 for the depth (number of features) in the convolutional layers to tune the accuracy-latency trade off. In addition, MobileNet V3 comes in two different sizes, small and large, to adapt the network to low or high resource use cases.

Model name	Input shape	Quantization type	Versions
MobileNet-V3 (small)	224 x 224	None (float32)	Latest
MobileNet-V3 (large)	224 x 224	None (float32)	Latest

Task benchmarks

Here's the task benchmarks for the whole pipeline based on the above pre-trained models. The latency result is the average latency on Pixel 6 using CPU / GPU.

Model Name	CPU Latency	GPU Latency
MobileNet-V3 (small)	3.94ms	7.83ms
MobileNet-V3 (large)	9.75ms	9.08ms