Adversarial Testing for Generative AI

Adversarial testing is a method for systematically evaluating an ML model with the intent of learning how it behaves when provided with malicious or inadvertently harmful input. This guide describes an example adversarial testing workflow for generative AI.

What is adversarial testing?

Testing is a critical part of building robust and safe AI applications. Adversarial testing involves proactively trying to "break" an application by providing it with data most likely to elicit problematic output. Adversarial queries are likely to cause a model to fail in an unsafe manner (i.e., safety policy violations), and might cause errors that are easy for humans to identify, but difficult for machines to recognize.

Queries may be “adversarial” in different ways. Explicitly adversarial queries may contain policy-violating language or express policy-violating points of view, or may probe or attempt to “trick” the model into saying something unsafe, harmful, or offensive. Implicitly adversarial queries may seem innocuous but can contain sensitive topics that are contentious, culturally sensitive, or potentially harmful. These might include information on demographics, health, finance, or religion.

Adversarial testing can help teams improve models and products by exposing current failures to guide mitigation pathways, such as fine tuning, model safeguards or filters. Moreover, it can help inform product launch decisions by measuring risks that may be unmitigated, such as the likelihood that the model with output policy-violating content.

As an emerging best practice for responsible AI, this guide provides an example workflow for adversarial testing for generative models and systems.

Adversarial testing example workflow

Adversarial testing follows a workflow that is similar to standard model evaluation.

Identify and define inputs

The first step in the adversarial testing workflow is determining inputs to learn how a system behaves when intentionally and systematically attacked. Thoughtful inputs can directly influence the efficacy of the testing workflow. The following inputs can help define the scope and objectives of an adversarial test:

  • Product policy and failure modes
  • Use cases
  • Diversity requirements

Product policy and failure modes

Generative AI products should define safety policies that describe product behavior and model outputs that are not allowed (i.e., are considered "unsafe"). The policy should enumerate failure modes that would be considered policy violations. This list of failure modes should be used as the basis for adversarial testing. Some example failure modes might include content that contains profane language, or financial, legal, or medical advice.

Use cases

Another important input to adversarial testing is the use case(s) the generative model or product seeks to serve, so that the test data contains some representation of the ways that users will interact with the product in the real world. Every generative product has slightly different use cases, but some common ones include: fact finding, summarization, and code generation for language models; or image generation of backgrounds by geography or terrain, art or clothing style.

Diversity requirements

Adversarial test datasets should be sufficiently diverse and representative with respect to all target failure modes and use cases. Measuring diversity of test datasets helps identify potential biases and ensures that models are tested extensively with a diverse user population in mind.

Three ways of thinking about diversity include:

  • Lexical diversity: ensure that queries have a range of different lengths (e.g., word count), use a broad vocabulary range, do not contain duplicates, and represent different query formulations (e.g., wh-questions, direct and indirect requests).
  • Semantic diversity: ensure that queries cover a broad range of different topics per policy (e.g., diabetes for health) including sensitive and identity based characteristics (e.g., gender, ethnicity), across different use cases and global contexts.
  • Policy and use case diversity: ensure that queries cover all policy violations (e.g., hate speech) and use cases (e.g., expert advice).

Find or create test dataset(s)

Test datasets for adversarial testing are constructed differently from standard model evaluation test sets. In standard model evaluations, test datasets are typically designed to accurately reflect the distribution of data that the model will encounter in product. For adversarial tests, test data is selected to elicit problematic output from the model by proving the model's behavior on out-of-distribution examples and edge cases that are relevant to safety policies. A high-quality adversarial test set should cover all safety policy dimensions, and maximize coverage of the use cases that the model is intended to support. It should be diverse lexically (e.g., including queries of various lengths and languages) and semantically (e.g., covering different topics and demographics).

Investigate existing test datasets for coverage of safety policies, failure modes, and use cases for text generation and text-to-image models. Teams can use existing datasets to establish a baseline of their products' performance, and then do deeper analyses on specific failure modes their products struggle with.

If existing test datasets are insufficient, teams can generate new data to target specific failure modes and use cases. One way to create new datasets is to start by manually creating a small dataset of queries (i.e., dozens of examples per category), and then expand on this "seed" dataset using data synthesis tools.

Seed datasets should contain examples that are as similar as possible to what the system would encounter in production, and created with the goal of eliciting a policy violation. Highly toxic language is likely to be detected by safety features, so consider creative phrasing and implicitly adversarial inputs.

You may use direct or indirect references to sensitive attributes (e.g., age, gender, race, religion) in your test dataset. Keep in mind that the usage of these terms may vary between cultures. Vary tone, sentence structure, length word choice, and meaning. Examples where multiple labels (e.g., hate speech vs. obscenity) can apply may create noise and duplication, and might not be handled properly by evaluation or training systems.

Adversarial test sets should be analyzed to understand their composition in terms of lexical and semantic diversity, coverage across policy violations and use cases, and overall quality in terms of uniqueness, adversariality, and noise.

Generate model outputs

The next step is to generate model outputs based on the test dataset. The results will inform product teams how their models might perform when exposed to malicious users, or inadvertently harmful inputs. Identifying these system behaviors and patterns of response can provide baseline measurements that can be mitigated in future model development.

Annotate outputs

Once outputs from adversarial testing are generated, annotate them to categorize them into failure modes and/or harms. These labels can help provide safety signals for text and image content. Moreover, the signals can help measure and mitigate harms across models and products.

Safety classifiers can be used to automatically annotate model outputs (or inputs) for policy violations. Accuracy may be low for signals that try to detect constructs that are not strictly defined, such as Hate Speech. For those signals, it is critical to use human raters to check and correct classifier-generated labels for which scores are "uncertain."

In addition to automatic annotation, you can also leverage human raters to annotate a sample of your data. It is important to note that annotating model outputs as part of adversarial testing necessarily involves looking at troubling and potentially harmful text or images, similar to manual content moderation. Additionally, human raters may annotate the same content differently based on their personal background, knowledge or beliefs. It can be helpful to develop guidelines or templates for raters, keeping in mind that the diversity of your rater pool could influence the annotation results.

Report and mitigate

The final step is to summarize test results in a report. Compute metrics and report results to provide safety rates, visualizations, and examples of problematic failures. These results can guide model improvements and inform model safeguards, such as filters or blocklists. Reports are also important for communication with stakeholders and decision makers.

Additional Resources

Google's AI Red Team: the ethical hackers making AI safer

Red Teaming Language Models with Language Models

Product Fairness testing for Machine Learning developers (video):

Product Fairness Testing for Developers (Codelab)