This page contains Image Models glossary terms. For all glossary terms, click here.

## A

## augmented reality

A technology that superimposes a computer-generated image on a user's view of the real world, thus providing a composite view.

## B

## bounding box

In an image, the (*x*, *y*) coordinates of a rectangle around an area of
interest, such as the dog in the image below.

## C

## convolution

In mathematics, casually speaking, a mixture of two functions. In machine
learning, a convolution mixes the convolutional filter and the input matrix
in order to train **weights**.

The term "convolution" in machine learning is often a shorthand way of
referring to either **convolutional operation**
or **convolutional layer**.

Without convolutions, a machine learning algorithm would have to learn
a separate weight for every cell in a large **tensor**. For example,
a machine learning algorithm training on 2K x 2K images would be forced to
find 4M separate weights. Thanks to convolutions, a machine learning
algorithm only has to find weights for every cell in the
**convolutional filter**, dramatically reducing
the memory needed to train the model. When the convolutional filter is
applied, it is simply replicated across cells such that each is multiplied
by the filter.

## convolutional filter

One of the two actors in a
**convolutional operation**. (The other actor
is a slice of an input matrix.) A convolutional filter is a matrix having
the same **rank** as the input matrix, but a smaller shape.
For example, given a 28x28 input matrix, the filter could be any 2D matrix
smaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter are
typically set to a constant pattern of ones and zeroes. In machine learning,
convolutional filters are typically seeded with random numbers and then the
network **trains** the ideal values.

## convolutional layer

A layer of a **deep neural network** in which a
**convolutional filter** passes along an input
matrix. For example, consider the following 3x3
**convolutional filter**:

The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:

## convolutional neural network

A **neural network** in which at least one layer is a
**convolutional layer**. A typical convolutional
neural network consists of some combination of the following layers:

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

## convolutional operation

The following two-step mathematical operation:

- Element-wise multiplication of the
**convolutional filter**and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.) - Summation of all the values in the resulting product matrix.

For example, consider the following 5x5 input matrix:

Now imagine the following 2x2 convolutional filter:

Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:

A **convolutional layer** consists of a
series of convolutional operations, each acting on a different slice
of the input matrix.

## D

## data augmentation

Artificially boosting the range and number of
**training** examples
by transforming existing
**examples** to create additional examples. For example,
suppose images are one of your
**features**, but your dataset doesn't
contain enough image examples for the model to learn useful associations.
Ideally, you'd add enough
**labeled** images to your dataset to
enable your model to train properly. If that's not possible, data augmentation
can rotate, stretch, and reflect each image to produce many variants of the
original picture, possibly yielding enough labeled data to enable excellent
training.

## depthwise separable convolutional neural network (sepCNN)

A **convolutional neural network**
architecture based on
Inception,
but where Inception modules are replaced with depthwise separable
convolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.

## downsampling

Overloaded term that can mean either of the following:

- Reducing the amount of information in a
**feature**in order to**train**a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format. - Training on a disproportionately low percentage of over-represented
**class**examples in order to improve model training on under-represented classes. For example, in a**class-imbalanced dataset**, models tend to learn a lot about the**majority class**and not enough about the**minority class**. Downsampling helps balance the amount of training on the majority and minority classes.

## I

## image recognition

A process that classifies object(s), pattern(s), or concept(s) in an image.
Image recognition is also known as **image classification**.

For more information, see ML Practicum: Image Classification.

## intersection over union (IoU)

The intersection of two sets divided by their union. In machine-learning
image-detection tasks, IoU is used to measure the accuracy of the model’s
predicted **bounding box** with respect to the
**ground-truth** bounding box. In this case, the IoU for the
two boxes is the ratio between the overlapping area and the total area, and
its value ranges from 0 (no overlap of predicted bounding box and ground-truth
bounding box) to 1 (predicted bounding box and ground-truth bounding box have
the exact same coordinates).

For example, in the image below:

- The predicted bounding box (the coordinates delimiting where the model predicts the night table in the painting is located) is outlined in purple.
- The ground-truth bounding box (the coordinates delimiting where the night table in the painting is actually located) is outlined in green.

Here, the intersection of the bounding boxes for prediction and ground truth (below left) is 1, and the union of the bounding boxes for prediction and ground truth (below right) is 7, so the IoU is \(\frac{1}{7}\).

## K

## keypoints

The coordinates of particular features in an image. For example, for an
**image recognition** model that distinguishes
flower species, keypoints might be the center of each petal, the stem,
the stamen, and so on.

## L

## landmarks

Synonym for **keypoints**.

## M

## MNIST

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see The MNIST Database of Handwritten Digits.

## P

## pooling

Reducing a matrix (or matrices) created by an earlier
**convolutional layer** to a smaller matrix.
Pooling usually involves taking either the maximum or average value
across the pooled area. For example, suppose we have the
following 3x3 matrix:

A pooling operation, just like a convolutional operation, divides that
matrix into slices and then slides that convolutional operation by
**strides**. For example, suppose the pooling operation
divides the convolutional matrix into 2x2 slices with a 1x1 stride.
As the following diagram illustrates, four pooling operations take place.
Imagine that each pooling operation picks the maximum value of the
four in that slice:

Pooling helps enforce
**translational invariance** in the input matrix.

Pooling for vision applications is known more formally as **spatial pooling**.
Time-series applications usually refer to pooling as **temporal pooling**.
Less formally, pooling is often called **subsampling** or **downsampling**.

## R

## rotational invariance

In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.

See also **translational invariance** and
**size invariance**.

## S

## size invariance

In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.

See also **translational invariance** and
**rotational invariance**.

## spatial pooling

See **pooling**.

## stride

In a convolutional operation or pooling, the delta in each dimension of the next series of input slices. For example, the following animation demonstrates a (1,1) stride during a convolutional operation. Therefore, the next input slice starts one position to the right of the previous input slice. When the operation reaches the right edge, the next slice is all the way over to the left but one position down.

The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride would also be three-dimensional.

## subsampling

See **pooling**.

## T

## translational invariance

In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also **size invariance** and
**rotational invariance**.