Attention: This MediaPipe Solutions Preview is an early release. Learn more.

Text embedding guide for C++

Stay organized with collections Save and categorize content based on your preferences.

The MediaPipe Text Embedder task lets you embed text into high-dimensional feature vectors representing its semantic meaning. These instructions show you how to use the Text Embedder with the C++ language.

For more information about the capabilities, models, and configuration options of this task, see the Overview.

Setup

This section describes key steps for setting up your development environment and code projects specifically to use Text Embedder. In google3, to use the MediaPipe Text Embedder task API, make your C++ targets depend on //third_party/mediapipe/tasks/cc/text/text_embedder in your BUILD file.

The MediaPipe Text Embedder task requires a trained model that is compatible with this task. For more information on available trained models for Text Embedder. You can download the recommended MobileBERT-embedding model and store it within your project directory. Specify the path of the model using the model_asset_path parameter in the base options. For more information on available models, see the Models section.

Create the task

The MediaPipe Text Embedder task uses the TextEmbedder::Create function to set up the task. The TextEmbedder::Create function accepts values for configuration options to set the embedder options. For more information on configuration options, see Configuration options.

The following code demonstrates how to build and configure this task.

#include "third_party/mediapipe/tasks/cc/text/text_embedder/text_embedder.h"

using ::mediapipe::tasks::text::text_embedder::TextEmbedder;
using ::mediapipe::tasks::text::text_embedder::TextEmbedderOptions;

// For creating a text embedder instance:
auto options = std::make_unique<TextEmbedderOptions>();
options->base_options.model_asset_path = model_path;
options->embedder_options.quantize = true;

ASSIGN_OR_RETURN(
  std::unique_ptr<TextEmbedder> text_embedder,
  TextEmbedder::Create(std::move(options)));

Configuration options

This task has the following configuration options for C++ applications:

Option Name Description Value Range Default Value
embedder_options.l2_normalize Whether to normalize the returned feature vector with L2 norm. Use this option only if the model does not already contain a native L2_NORMALIZATION TFLite Op. In most cases, this is already the case and L2 normalization is thus achieved through TFLite inference with no need for this option. Boolean False
embedder_options.quantize Whether the returned embedding should be quantized to bytes via scalar quantization. Embeddings are implicitly assumed to be unit-norm and therefore any dimension is guaranteed to have a value in [-1.0, 1.0]. Use the embedder_options.l2_normalize option if this is not the case. Boolean False

Prepare data

Text Embedder works with text (std::string) data. The task handles the data input preprocessing, including tokenization and tensor preprocessing.

All preprocessing is handled within the Embed function. There is no need for additional preprocessing of the input text beforehand.

std::string input_text = "The input text that will be embedded.";

Run the task

The Text Embedder uses the Embed function to trigger inferences. For text embedding, this means returning the embedding vectors for the input text.

The following code demonstrates how execute the processing with the task model.

using ::mediapipe::tasks::text::text_embedder::TextEmbedderResult;

ASSERT_OK_AND_ASSIGN(TextEmbedderResult embedding_result,
                     text_embedder->Embed(input_text));

Handle and display results

The Text Embedder outputs a TextEmbedderResult that contains a list of embeddings (either floating-point or scalar-quantized) for the input text.

The following shows an example of the output data from this task:

TextEmbedderResult:
  Embedding #0 (sole embedding head):
    float_embedding: {0.2345f, 0.1234f, ..., 0.6789f}
    head_index: 0

You can compare the semantic similarity of two embeddings using the TextEmbedder::CosineSimilarity function. See the following code for an example.

// Compute cosine similarity.
ASSERT_OK_AND_ASSIGN(
  double similarity,
  TextEmbedder::CosineSimilarity(embedding_result.embeddings[0],
                                 other_embedding_result.embeddings[0]));