Language detection guide

The MediaPipe Language Detector task lets you identify the language of a piece of text. This task operates on text data with a machine learning (ML) model and outputs a list of predictions, where each prediction consists of an ISO 639-1 language code and a probability.

Try it!

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, including a recommended model, and code example with recommended configuration options:

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

  • Score threshold - Filter results based on prediction scores
  • Label allowlist and denylist - Specify the categories detected
Task inputs Task outputs
Language Detector accepts the following input data type:
  • String
Language Detector outputs a list of predictions containing:
    • Language code: An ISO 639-1 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) language / locale code (e.g. "en" for English, "uz" for Uzbek, "ja-Latn” for Japanese (romaji)) as a string.
    • Probability: the confidence score for this prediction, expressed as a probability between zero and one as floating point value.

Configurations options

This task has the following configuration options:

Option Name Description Value Range Default Value
max_results Sets the optional maximum number of top-scored language predictions to return. If this value is less than zero, all available results are returned. Any positive numbers -1
score_threshold Sets the prediction score threshold that overrides the one provided in the model metadata (if any). Results below this value are rejected. Any float Not set
category_allowlist Sets the optional list of allowed language codes. If non-empty, language predictions whose language code is not in this set will be filtered out. This option is mutually exclusive with category_denylist and using both results in an error. Any strings Not set
category_denylist Sets the optional list of language codes that are not allowed. If non-empty, language predictions whose language code is in this set will be filtered out. This option is mutually exclusive with category_allowlist and using both results in an error. Any strings Not set

Models

We offer a default, recommended model when you start developing with this task.

This model is built to be lightweight (315 KB) and uses embedding-based, neural network classification architecture. The model identifies language using an ISO 639-1 language code, and can identify 110 languages. For a list of languages supported by the model, see the label file, which lists languages by their ISO 639-1 code.

Model name Input shape Quantization type Model card Versions
Language Detector string UTF-8 none (float32) info Latest

Task benchmarks

Here's the task benchmarks for the whole pipeline based on the above pre-trained models. The latency result is the average latency on Pixel 6 using CPU / GPU.

Model Name CPU Latency GPU Latency
Language Detector 0.31ms -