Image generation guide

The MediaPipe Image Generator task lets you generate images based on a text prompt. This task uses a text-to-image model to generate images using diffusion techniques.

The task accepts a text prompt as input, along with an optional condition image that the model can augment and use as a reference for generation. For more on conditioned text-to-image generation, see On-device diffusion plugins for conditioned text-to-image generation.

Image Generator can also generate images based on specific concepts provided to the model during training or retraining. For more information, see customize with LoRA.

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, with code examples that use a default model and the recommended configuration options:

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

You can use the Image Generator to implement the following:

  1. Text-to-image generation - Generate images with a text prompt.
  2. Image generation with condition images - Generate images with a text prompt and a reference image. Image Generator uses condition images in ways similar to ControlNet.
  3. Image generation with LoRA weights - Generate images of specific people, objects, and styles with a text prompt using customized model weights.
Task inputs Task outputs
The Image Generator accepts the following inputs:
  • Text prompt
  • Seed
  • Number of generative iterations
  • Optional: condition image
The Image Generator outputs the following results:
  • Generated image based on the inputs.
  • Optional: Iterative snapshots of the generated image.

Configurations options

This task has the following configuration options:

Option Name Description Value Range
imageGeneratorModelDirectory The image generator model directory storing the model weights. PATH
loraWeightsFilePath Sets the path to LoRA weights file. Optional and only applicable if the model was customized with LoRA. PATH
errorListener Sets an optional error listener. N/A

The task also supports plugin models, which lets users include condition images in the task input, which the foundation model can augment and use as a reference for generation. These condition images can be face landmarks, edge outlines, and depth estimates, which the model uses as additional context and information to generate images.

When adding a plugin model to the foundation model, also configure the plugin options. The Face landmark plugin uses faceConditionOptions, the Canny edge plugin uses edgeConditionOptions, and the Depth plugin uses depthConditionOptions.

Canny edge options

Configure the following options in edgeConditionOptions.

Option Name Description Value Range Default Value
threshold1 First threshold for the hysteresis procedure. Float 100
threshold2 Second threshold for the hysteresis procedure. Float 200
apertureSize Aperture size for the Sobel operator. Typical range is between 3-7. Integer 3
l2Gradient Whether the L2 norm is used to calculate the image gradient magnitude, instead of the default L1 norm. BOOLEAN False
EdgePluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

For more information on how these configuration options work, see Canny edge detector.

Face landmark options

Configure the following options in faceConditionOptions.

Option Name Description Value Range Default Value
minFaceDetectionConfidence The minimum confidence score for the face detection to be considered successful. Float [0.0,1.0] 0.5
minFacePresenceConfidence The minimum confidence score of face presence score in the face landmark detection. Float [0.0,1.0] 0.5
faceModelBaseOptions The BaseOptions object that sets the path for the model that creates the condition image. BaseOptions object N/A
FacePluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

For more information on how these configuration options work, see the Face Landmarker task.

Depth options

Configure the following options in depthConditionOptions.

Option Name Description Value Range Default Value
depthModelBaseOptions The BaseOptions object that sets the path for the model that creates the condition image. BaseOptions object N/A
depthPluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

Models

The Image Generator requires a foundation model, which is a text-to-image AI model that uses diffusion techniques to generate new images. The foundation models listed in this section are lightweight models optimized to run on high-end smartphones.

Plugin models are optional and complement the foundational models, enabling users to provide an additional condition image along with a text prompt, for more specific image generation. Customizing the foundation models using LoRA weights is an option that teach the foundation model about a specific concept, such as an object, person, or style, and inject them into generated images.

Foundation models

The foundation models are latent text-to-image diffusion models that generate images from a text prompt. The Image Generator requires that the foundation model match the runwayml/stable-diffusion-v1-5 EMA-only model format, based on the following model:

The following foundation models are also compatible with the Image Generator:

After downloading a foundation model, use the image_generator_converter to convert the model into the appropriate on-device format for the Image Generator.

Install the necessary dependencies:

$ pip install torch typing_extensions numpy Pillow requests pytorch_lightning absl-py

Run the convert.py script:

$ python3 convert.py --ckpt_path <ckpt_path> --output_path <output_path>

Plugin models

The plugin models in this section are developed by Google and must be used in combination with a foundation model. Plugin models enable Image Generator to accept a condition image along with a text prompt as input, which lets you control the structure of generated images. The plugin models provide capabilities similar to ControlNet, with a novel architecture specifically for on-device diffusion.

The plugin models must be specified in the base options and may require you to download additional model files. Each plugin has unique requirements for the condition image, which can be generated by the Image Generator.

Canny Edge plugin

The Canny Edge plugin accepts a condition image that outlines the intended edges of the generated image. The foundation model uses the edges implied by the condition image, and generates a new image based on the text prompt. The Image Generator contains built-in capabilities to create condition images, and only requires downloading the plugin model.

Download Canny Edge plugin

The Canny Edge plugin contains the following configuration options:

Option Name Description Value Range Default Value
threshold1 First threshold for the hysteresis procedure. Float 100
threshold2 Second threshold for the hysteresis procedure. Float 200
apertureSize Aperture size for the Sobel operator. Typical range is between 3-7. Integer 3
l2Gradient Whether the L2 norm is used to calculate the image gradient magnitude, instead of the default L1 norm. BOOLEAN False
EdgePluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

For more information on how these configuration options work, see Canny edge detector.

Face Landmark plugin

The Face Landmark plugin accepts the output from the MediaPipe Face Landmarker as the condition image. The Face Landmarker provides a detailed face mesh of a single face, which maps the presence and location of facial features. The foundation model uses the facial mapping implied by the condition image, and generates a new face over the mesh.

Download Face landmark plugin

The Face landmark plugin also requires the Face Landmarker model bundle to create the condition image. This model bundle is the same bundle used by the Face Landmarker task.

Download Face landmark model bundle

The Face Landmark plugin contains the following configuration options:

Option Name Description Value Range Default Value
minFaceDetectionConfidence The minimum confidence score for the face detection to be considered successful. Float [0.0,1.0] 0.5
minFacePresenceConfidence The minimum confidence score of face presence score in the face landmark detection. Float [0.0,1.0] 0.5
faceModelBaseOptions The BaseOptions object that sets the path for the model that creates the condition image. BaseOptions object N/A
FacePluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

For more information on how these configuration options work, see the Face Landmarker task.

Depth plugin

The Depth plugin accepts a condition image that specifies the monocular depth of an object. The foundation model uses the condition image to infer the size and depth of the object to be generated, and generates a new image based on the text prompt.

Download Depth plugin

The Depth plugin also requires a Depth estimation model to create the condition image.

Download Depth estimation model

The Depth plugin contains the following configuration options:

Option Name Description Value Range Default Value
depthModelBaseOptions The BaseOptions object that sets the path for the model that creates the condition image. BaseOptions object N/A
depthPluginModelBaseOptions The BaseOptions object that sets the path for the plugin model. BaseOptions object N/A

Customization with LoRA

Customizing a model with LoRA can enable the Image Generator to generate images based on specific concepts, which are identified by unique tokens during training. With the new LoRA weights after training, the model is able to generate images of the new concept when the token is specified in the text prompt.

Creating LoRA weights requires training a foundation model on images of a specific object, person, or style, which enables the model to recognize the new concept and apply it when generating images. If you are creating LoRa weights to generate images of specific people and faces, only use this solution on your face or faces of people who have given you permission to do so.

Below is the output from a customized model trained on images of teapots from the DreamBooth dataset, using the token "monadikos teapot":

Prompt: a monadikos teapot beside a mirror

The customized model received the token in the prompt and injected a teapot that it learned to depict from the LoRA weights, and places it the image beside a mirror as requested in the prompt.

LoRA with Vertex AI

For more information, see the customization guide, which uses Model Garden on Vertex AI to customize a model by applying LoRA weights to a foundation model.