Attention: This MediaPipe Solutions Preview is an early release. Learn more

Image segmentation guide

The MediaPipe Image Segmenter task lets you divide images into regions based on predefined categories. You can use this functionality to identify specific objects or textures, and then apply visual effects such as background blurring. This task includes several models specifically trained for segmenting people and their features within image data, including:

Person and background
Person's hair only
Person's hair, face, skin, clothing, and accessories

This task operates on image data with a machine learning (ML) model with single images or a continuous video stream. It outputs a list of segmented regions, representing objects or areas in an image, depending on the model you choose.

Try it!

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, including a recommended model, and code example with recommended configuration options:

Android - Code example - Guide
Python - Code example Guide
Web - Code example - Guide

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

Input image processing - Processing includes image rotation, resizing, normalization, and color space conversion.

Task inputs Task outputs

Task inputs	Task outputs
Input can be one of the following data types: Still images Decoded video frames Live video feed	Image Segmenter outputs segmented image data, which can include one or both of the following, depending on the configuration options you set: `CATEGORY_MASK`: a list containing a segmented mask as a uint8 format image. Each pixel value indicates if it is part of a specific segment category supported by the model. `CONFIDENCE_MASK`: a list of channels containing a segmented mask with pixel values in float32 format. Each pixel value indicates the level of confidence that it is part of a specific category supported by the model.

Input can be one of the following data types:

Still images

Decoded video frames

Live video feed

Image Segmenter outputs segmented image data, which can include one or both of the following, depending on the configuration options you set:

CATEGORY_MASK: a list containing a segmented mask as a uint8 format image. Each pixel value indicates if it is part of a specific segment category supported by the model.

CONFIDENCE_MASK: a list of channels containing a segmented mask with pixel values in float32 format. Each pixel value indicates the level of confidence that it is part of a specific category supported by the model.

Configurations options

This task has the following configuration options:

Option Name	Description	Value Range	Default Value
`running_mode`	Sets the running mode for the task. There are three modes: IMAGE: The mode for single image inputs. VIDEO: The mode for decoded frames of a video. LIVE_STREAM: The mode for a livestream of input data, such as from a camera. In this mode, resultListener must be called to set up a listener to receive results asynchronously.	{`IMAGE, VIDEO, LIVE_STREAM`}	`IMAGE`
`output_category_mask`	If set to `True`, the output includes a segmentation mask as a uint8 image, where each pixel value indicates the winning category value.	{`True, False`}	`False`
`output_confidence_masks`	If set to `True`, the output includes a segmentation mask as a float value image, where each float value represents the confidence score map of the category.	{`True, False`}	`True`
`display_names_locale`	Sets the language of labels to use for display names provided in the metadata of the task's model, if available. Default is `en` for English. You can add localized labels to the metadata of a custom model using the TensorFlow Lite Metadata Writer API	Locale code	en
`result_callback`	Sets the result listener to receive the segmentation results asynchronously when the image segmenter is in the live stream mode. Can only be used when running mode is set to `LIVE_STREAM`	N/A	N/A

Models

The Image Segmenter can be used with more than one ML model. Most of the following segmentation models are built and trained to perform segmentation with images of people. However, the DeepLab-v3 model is built as a general purpose image segmenter. Select the model that fits best for your application.

Selfie segmentation model

This model can segment the portrait of a person, and can be used for replacing or modifying the background in an image. The model output two categories, background at index 0 and person at index 1. This model has versions with different input shapes including square version and a landscape version which may be more efficient for applications where the input is always that shape, such as video calls.

Model name	Input shape	Quantization type	Model Card	Versions
SelfieSegmenter (square)	256 x 256	float 16	info	Latest
SelfieSegmenter (landscape)	144 x 256	float 16	info	Latest

Hair segmentation model

This model takes an image of a person, locates the hair on their head, and outputs a image segmentation map for their hair. You can use this model for recoloring hair or applying other hair effects. The model outputs the following segmentation categories:

0 - background
1 - hair

Model name	Input shape	Quantization type	Model Card	Versions
HairSegmenter	512 x 512	None (float32)	info	Latest

Multi-class selfie segmentation model

This model takes an image of a person, locates areas for different areas such as hair, skin, and clothing, and outputs an image segmentation map for these items. You can use this model for applying various effects to people in images or video. The model outputs the following segmentation categories:

0 - background
1 - hair
2 - body-skin
3 - face-skin
4 - clothes
5 - others (accessories)

Model name	Input shape	Quantization type	Model Card	Versions
SelfieMulticlass (256 x 256)	256 x 256	None (float32)	info	Latest

DeepLab-v3 model

This model identifies segments for a number of categories, including background, person, cat, dog, and potted plant. The model uses atrous spatial pyramid pooling to capture longer range information. For more information, see DeepLab-v3.

Model name	Input shape	Quantization type	Versions
DeepLab-V3	257 x 257	None (float32)	Latest

Task benchmarks

Here's the task benchmarks for the whole pipeline based on the above pre-trained models. The latency result is the average latency on Pixel 6 using CPU / GPU.

Model Name	CPU Latency	GPU Latency
SelfieSegmenter (square)	33.46ms	35.15ms
SelfieSegmenter (landscape)	34.19ms	33.55ms
HairSegmenter	57.90ms	52.14ms
SelfieMulticlass (256 x 256)	217.76ms	71.24ms
DeepLab-V3	123.93ms	103.30ms