Machine Learning Glossary: Language Evaluation

Stay organized with collections Save and categorize content based on your preferences.

This page contains Language Evaluation glossary terms. For all glossary terms, click here.




Any of a wide range of neural network architecture mechanisms that aggregate information from a set of inputs in a data-dependent manner. A typical attention mechanism might consist of a weighted sum over a set of inputs, where the weight for each input is computed by another part of the neural network.

Refer also to self-attention and multi-head self-attention, which are the building blocks of Transformers.


bag of words


A representation of the words in a phrase or passage, irrespective of order. For example, bag of words represents the following three phrases identically:

  • the dog jumps
  • jumps the dog
  • dog jumps the

Each word is mapped to an index in a sparse vector, where the vector has an index for every word in the vocabulary. For example, the phrase the dog jumps is mapped into a feature vector with non-zero values at the three indices corresponding to the words the, dog, and jumps. The non-zero value can be any of the following:

  • A 1 to indicate the presence of a word.
  • A count of the number of times a word appears in the bag. For example, if the phrase were the maroon dog is a dog with maroon fur, then both maroon and dog would be represented as 2, while the other words would be represented as 1.
  • Some other value, such as the logarithm of the count of the number of times a word appears in the bag.

BERT (Bidirectional Encoder Representations from Transformers)


A model architecture for text representation. A trained BERT model can act as part of a larger model for text classification or other ML tasks.

BERT has the following characteristics:

BERT's variants include:

See Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing for an overview of BERT.



An N-gram in which N=2.



A term used to describe a system that evaluates the text that both precedes and follows a target section of text. In contrast, a unidirectional system only evaluates the text that precedes a target section of text.

For example, consider a masked language model that must determine probabilities for the word(s) representing the underline in the following question:

What is the _____ with you?

A unidirectional language model would have to base its probabilities only on the context provided by the words "What", "is", and "the". In contrast, a bidirectional language model could also gain context from "with" and "you", which might help the model generate better predictions.

bidirectional language model


A language model that determines the probability that a given token is present at a given location in an excerpt of text based on the preceding and following text.

BLEU (Bilingual Evaluation Understudy)


A score between 0.0 and 1.0, inclusive, indicating the quality of a translation between two human languages (for example, between English and Russian). A BLEU score of 1.0 indicates a perfect translation; a BLEU score of 0.0 indicates a terrible translation.


causal language model


Synonym for unidirectional language model.

See bidirectional language model to contrast different directional approaches in language modeling.

crash blossom


A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding. For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.




In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequently paired with an encoder.

In sequence-to-sequence tasks, a decoder starts with the internal state generated by the encoder to predict the next sequence.

Refer to Transformer for the definition of a decoder within the Transformer architecture.



A common approach to self-supervised learning in which:

  1. Noise is artificially added to the dataset.
  2. The model tries to remove the noise.

Denoising enables learning from unlabeled examples. The original dataset serves as the target or label and the noisy data as the input.

Some masked language models use denoising as follows:

  1. Noise is artificially added to an unlabeled sentence by masking some of the tokens.
  2. The model tries to predict the original tokens.




A categorical feature represented as a continuous-valued feature. Typically, an embedding is a translation of a high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English sentence in either of the following two ways:

  • As a million-element (high-dimensional) sparse vector in which all elements are integers. Each cell in the vector represents a separate English word; the value in a cell represents the number of times that word appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1) representing the number of times that word appeared in the sentence.
  • As a several-hundred-element (low-dimensional) dense vector in which each element holds a floating-point value between 0 and 1. This is an embedding.

In TensorFlow, embeddings are trained by backpropagating loss just like any other parameter in a neural network.

embedding space


The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks.

The dot product of two embeddings is a measure of their similarity.



In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequently paired with a decoder. Some Transformers pair encoders with decoders, though other Transformers use only the encoder or only the decoder.

Some systems use the encoder's output as the input to a classification or regression network.

In sequence-to-sequence tasks, an encoder takes an input sequence and returns an internal state (a vector). Then, the decoder uses that internal state to predict the next sequence.

Refer to Transformer for the definition of an encoder in the Transformer architecture.


GPT (Generative Pre-trained Transformer)


A family of Transformer-based large language models developed by OpenAI.

GPT variants can apply to multiple modalities, including:

  • image generation (for example, ImageGPT)
  • text-to-image generation (for example, DALL-E).


LaMDA (Language Model for Dialogue Applications)


A Transformer-based large language model developed by Google trained on a large dialogue dataset that can generate realistic conversational responses.

LaMDA: our breakthrough conversation technology provides an overview.

language model


A model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens.

large language model


An informal term with no strict definition that usually means a language model that has a high number of parameters. Some large language models contain over 100 billion parameters.


masked language model


A language model that predicts the probability of candidate tokens to fill in blanks in a sequence. For instance, a masked language model can calculate probabilities for candidate word(s) to replace the underline in the following sentence:

The ____ in the hat came back.

The literature typically uses the string "MASK" instead of an underline. For example:

The "MASK" in the hat came back.

Most modern masked language models are bidirectional.



A subset of machine learning that discovers or improves a learning algorithm. A meta-learning system can also aim to train a model to quickly learn a new task from a small amount of data or from experience gained in previous tasks. Meta-learning algorithms generally try to achieve the following:

  • Improve/learn hand-engineered features (such as an initializer or an optimizer).
  • Be more data-efficient and compute-efficient.
  • Improve generalization.

Meta-learning is related to few-shot learning.



A high-level data category. For example, numbers, text, images, video, and audio are five different modalities.

model parallelism


A way of scaling training or inference that puts different parts of one model on different devices. Model parallelism enables models that are too big to fit on a single device.

See also data parallelism.

multi-head self-attention


An extension of self-attention that applies the self-attention mechanism multiple times for each position in the input sequence.

Transformers introduced multi-head self-attention.

multimodal model


A model whose inputs and/or outputs include more than one modality. For example, consider a model that takes both an image and a text caption (two modalities) as features, and outputs a score indicating how appropriate the text caption is for the image. So, this model's inputs are multimodal and the output is unimodal.


natural language understanding


Determining a user's intentions based on what the user typed or said. For example, a search engine uses natural language understanding to determine what the user is searching for based on what the user typed or said.



An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly.

N Name(s) for this kind of N-gram Examples
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils

Many natural language understanding models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind. An NLU model based on trigrams would likely predict that the user will next type mice.

Contrast N-grams with bag of words, which are unordered sets of words.



Abbreviation for natural language understanding.




A form of model parallelism in which a model's processing is divided into consecutive stages and each stage is executed on a different device. While a stage is processing one batch, the preceding stage can work on the next batch.

See also staged training.


self-attention (also called self-attention layer)


A neural network layer that transforms a sequence of embeddings (for instance, token embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an attention mechanism.

The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for Transformers and uses dictionary lookup terminology, such as “query”, “key”, and “value”.

A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word's final representation incorporates the representations of other words.

For example, consider the following sentence:

The animal didn't cross the street because it was too tired.

The following illustration (from Transformer: A Novel Neural Network Architecture for Language Understanding) shows a self-attention layer's attention pattern for the pronoun it, with the darkness of each line indicating how much each word contributes to the representation:

The following sentence appears twice: 'The animal didn't cross the
          street because it was too tired.'  Lines connect the word 'it' in
          one sentence to five tokens ('The', 'animal', 'street', 'it', and
          the period) in the other sentence.  The line between 'it' and
          'animal' is strongest.

The self-attention layer highlights words that are relevant to "it". In this case, the attention layer has learned to highlight words that it might refer to, assigning the highest weight to animal.

For a sequence of n tokens, self-attention transforms a sequence of embeddings n separate times, once at each position in the sequence.

Refer also to attention and multi-head self-attention.

sentiment analysis


Using statistical or machine learning algorithms to determine a group's overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using natural language understanding, an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course.

sequence-to-sequence task


A task that converts an input sequence of tokens to an output sequence of tokens. For example, two popular kinds of sequence-to-sequence tasks are:

  • Translators:
    • Sample input sequence: "I love you."
    • Sample output sequence: "Je t'aime."
  • Question answering:
    • Sample input sequence: "Do I need my car in New York City?"
    • Sample output sequence: "No. Please keep your car at home."

staged training


A tactic of training a model in a sequence of discrete stages. The goal can be either to speed up the training process, or to achieve better model quality.

An illustration of the progressive stacking approach is shown below:

  • Stage 1 contains 3 hidden layers, stage 2 contains 6 hidden layers, and stage 3 contains 12 hidden layers.
  • Stage 2 begins training with the weights learned in the 3 hidden layers of Stage 1. Stage 3 begins training with the weights learned in the 6 hidden layers of Stage 2.

Three stages, which are labeled 'Stage 1', 'Stage 2', and 'Stage 3'.
          Each stage contains a different number of layers: Stage 1 contains
          3 layers, Stage 2 contains 6 layers, and Stage 3 contains 12 layers.
          The 3 layers from Stage 1 become the first 3 layers of Stage 2.
          Similarly, the 6 layers from Stage 2 become the first 6 layers of
          Stage 3.

See also pipelining.




In a language model, the atomic unit that the model is training on and making predictions on. A token is typically one of the following:

  • a word—for example, the phrase "dogs like cats" consists of three word tokens: "dogs", "like", and "cats".
  • a character—for example, the phrase "bike fish" consists of nine character tokens. (Note that the blank space counts as one of the tokens.)
  • subwords—in which a single word can be a single token or multiple tokens. A subword consists of a root word, a prefix, or a suffix. For example, a language model that uses subwords as tokens might view the word "dogs" as two tokens (the root word "dog" and the plural suffix "s"). That same language model might view the single word "taller" as two subwords (the root word "tall" and the suffix "er").

In domains outside of language models, tokens can represent other kinds of atomic units. For example, in computer vision, a token might be a subset of an image.



A neural network architecture developed at Google that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on convolutions or recurrent neural networks. A Transformer can be viewed as a stack of self-attention layers.

A Transformer can include any of the following:

An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.

A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the self-attention mechanism to gather information from it.

The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.



An N-gram in which N=3.




A system that only evaluates the text that precedes a target section of text. In contrast, a bidirectional system evaluates both the text that precedes and follows a target section of text. See bidirectional for more details.

unidirectional language model


A language model that bases its probabilities only on the tokens appearing before, not after, the target token(s). Contrast with bidirectional language model.


word embedding


Representing each word in a word set within an embedding; that is, representing each word as a vector of floating-point values between 0.0 and 1.0. Words with similar meanings have more-similar representations than words with different meanings. For example, carrots, celery, and cucumbers would all have relatively similar representations, which would be very different from the representations of airplane, sunglasses, and toothpaste.