Attention: This MediaPipe Solutions Preview is an early release. Learn more.

Object detection task guide

Stay organized with collections Save and categorize content based on your preferences.

The MediaPipe Object Detector task lets you detect the presence and location of multiple classes of objects within images or videos. For example, an object detector can locate dogs within in an image. This task operates on image data with a machine learning (ML) model, accepting static data or a continuous video stream as input and outputting a list of detection results. Each detection result represents an object that appears within the image or video.

Try it!

Get Started

Start using this task by following one of these implementation guides for the platform you are working on:

These platform-specific guides walk you through a basic implementation of this task, including a recommended model, and code example with recommended configuration options.

Task details

This section describes the capabilities, inputs, and outputs of this task.

Features

  • Input image processing - Processing includes image rotation, resizing, normalization, and color space conversion.
  • Label map locale - Set the language used for display names
  • Score threshold - Filter results based on prediction scores.
  • Top-k detection - Filter the number detection results.
  • Label allowlist and denylist - Specify the categories detected.
Task inputs Task outputs
The Object Detector API accepts an input of one of the following data types:
  • Still images
  • Decoded video frames
  • Live video feed
The Object Detector API outputs the following results for detected objects:
  • Category of object
  • Probability score
  • Bounding box coordinates

Configurations options

This task has the following configuration options:

Option Name Description Value Range Default Value
running_mode Sets the running mode for the task. Object detector has three modes:

IMAGE: The mode for detecting objects on single image inputs.

VIDEO: The mode for detecting objects on the decoded frames of a video.

LIVE_STREAM: The mode for detecting objects on a live stream of input data, such as from a camera. In this mode, resultListener must be called to set up a listener to receive the detection results asynchronously.
{IMAGE, VIDEO, LIVE_STREAM} IMAGE
display_names Sets the language of labels to use for display names provided in the metadata of the task's model, if available. Default is en for English. You can add localized labels to the metadata of a custom model using the TensorFlow Lite Metadata Writer API Locale code en
max_results Sets the optional maximum number of top-scored detection results to return. Any positive numbers -1 (all results are returned)
score_threshold Sets the prediction score threshold that overrides the one provided in the model metadata (if any). Results below this value are rejected. Any float Not set
category_allow_list Sets the optional list of allowed category names. If non-empty, detection results whose category name is not in this set will be filtered out. Duplicate or unknown category names are ignored. This option is mutually exclusive with category_deny_list and using both results in an error. Any strings Not set
category_deny_list Sets the optional list of category names that are not allowed. If non-empty, detection results whose category name is in this set will be filtered out. Duplicate or unknown category names are ignored. This option is mutually exclusive with category_allow_list and using both results in an error. Any strings Not set

Models

The Object Detector API requires an object detection model to be downloaded and stored in your project directory. If you do not already have a model, start with the default, recommended model. The other models presented in this section make trade-offs between latency and accuracy.

The EfficientDet-Lite0 model uses an EfficientNet-Lite0 backbone with a 320x320 input size and BiFPN feature network. The model was trained with the COCO dataset, a large-scale object detection dataset that contains 1.5 million object instances and 80 object labels. See the full list of supported labels. EfficientDet-Lite0 is available as an int8 model or float model. This model is recommended because it strikes a balance between latency and accuracy. It is both accurate and lightweight enough for many use cases.

Download int8 model Download float model

EfficientDet-Lite2 model

The EfficientDet-Lite2 model uses an EfficientNet-Lite2 backbone with a 448x448 input size and BiFPN feature network. The model was trained with the COCO dataset, a large-scale object detection dataset that contains 1.5 million object instances and 80 object labels. See the full list of supported labels. EfficientDet-Lite2 is available as an int8 model or float model. This model is generally more accurate than EfficientDet-Lite0, but is also slower and more memory intensive. This model is appropriate for use cases where accuracy is a greater priority to speed and size.

Download int8 model Download float model

SSD MobileNetV2 model

The SSD MobileNetV2 model uses a MobileNetV2 backbone with a 256x256 input size and SSD feature network. The model was trained with the COCO dataset, a large-scale object detection dataset that contains 1.5 million object instances and 80 object labels. See the full list of supported labels. SSD MobileNetV2 is available as an int8 model or float model. This model is faster and lighter than EfficientDet-Lite0, but is also generally less accurate. This model is appropriate for use cases that require a fast, lightweight model that sacrifices some accuracy.

Download int8 model Download float model

Model benchmarks

Here's the model benchmarks for the above pre-trained models. The latency result is on Pixel 6 and CPU is 1 thread.

Model Name Model size MAP (mean average precision) CPU Latency GPU Latency
EfficientDet-Lite0 float model 14M 24.38% 51.11ms 33.01ms
EfficientDet-Lite0 int8 model 4.4M 24.08% 27.8ms -
EfficientDet-Lite2 float model 21.7M 31.70% 158.51ms 56.62ms
EfficientDet-Lite2 int8 model 7.2M 31.21% 82.45ms -
SSD MobileNetV2 float model 11M 21.20% 31.26ms 24.55ms
SSD MobileNetV2 int8 model 3.3M 20.10% 16.72ms -

Model requirements and metadata

This section describes the requirements for custom models if you decide to build a model to use with this task. Custom models must be in TensorFlow Lite format and must include metadata describing the operating parameters of the model.

Design requirements

Input Shape Description
Input image Float32 tensor of shape[1, height, width, 3] The normalized input image.
Output Shape Description
detection_boxes Float32 tensor of shape [1, num_boxes, 4] Box location of each detected object.
detection_classes Float32 tensor of shape [1, num_boxes] Indices of the class names for each detected object.
detection_scores float32 tensor of shape [1, num_boxes] Prediction scores for each detected object.
num_boxes Float32 tensor of size 1 The number of detected boxes.

Metadata requirements

Parameter Description Description
input_norm_mean The mean value used in the input tensor normalization. The normalized input image.
input_norm_std The field norm used in the input tensor normalization. Box location of each detected object.
label_file_paths The paths to the category tensor label files. If the model does not have any label files, pass an empty list. Indices of the class names for each detected object.
score_calibration_md Information on the score calibration operation in the classification tensor. This parameter is not required if the model does not use score
calibration.
Prediction scores for each detected object.
num_boxes Float32 tensor of size 1 The number of detected boxes.