Machine Learning Glossary: Image Models

This page contains Image Models glossary terms. For all glossary terms, click here.

A

augmented reality

#image

A technology that superimposes a computer-generated image on a user's view of the real world, thus providing a composite view.

B

bounding box

#image

In an image, the (x, y) coordinates of a rectangle around an area of interest, such as the dog in the image below.

Photograph of a dog sitting on a sofa. A green bounding box
          with top-left coordinates of (275, 1271) and bottom-right
          coordinates of (2954, 2761) circumscribes the dog's body

C

convolution

#image

In mathematics, casually speaking, a mixture of two functions. In machine learning, a convolution mixes the convolutional filter and the input matrix in order to train weights.

The term "convolution" in machine learning is often a shorthand way of referring to either convolutional operation or convolutional layer.

Without convolutions, a machine learning algorithm would have to learn a separate weight for every cell in a large tensor. For example, a machine learning algorithm training on 2K x 2K images would be forced to find 4M separate weights. Thanks to convolutions, a machine learning algorithm only has to find weights for every cell in the convolutional filter, dramatically reducing the memory needed to train the model. When the convolutional filter is applied, it is simply replicated across cells such that each is multiplied by the filter.

convolutional filter

#image

One of the two actors in a convolutional operation. (The other actor is a slice of an input matrix.) A convolutional filter is a matrix having the same rank as the input matrix, but a smaller shape. For example, given a 28x28 input matrix, the filter could be any 2D matrix smaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter are typically set to a constant pattern of ones and zeroes. In machine learning, convolutional filters are typically seeded with random numbers and then the network trains the ideal values.

convolutional layer

#image

A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:

A 3x3 matrix with the following values: [[0,1,0], [1,0,1], [0,1,0]]

The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:

An animation showing two matrices. The first matrix is the 5x5
          matrix: [[128,97,53,201,198], [35,22,25,200,195],
          [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]].
          The second matrix is the 3x3 matrix:
          [[181,303,618], [115,338,605], [169,351,560]].
          The second matrix is calculated by applying the convolutional
          filter [[0, 1, 0], [1, 0, 1], [0, 1, 0]] across
          different 3x3 subsets of the 5x5 matrix.

convolutional neural network

#image

A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:

Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.

convolutional operation

#image

The following two-step mathematical operation:

  1. Element-wise multiplication of the convolutional filter and a slice of an input matrix. (The slice of the input matrix has the same rank and size as the convolutional filter.)
  2. Summation of all the values in the resulting product matrix.

For example, consider the following 5x5 input matrix:

The 5x5 matrix: [[128,97,53,201,198], [35,22,25,200,195],
          [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]].

Now imagine the following 2x2 convolutional filter:

The 2x2 matrix: [[1, 0], [0, 1]]

Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:

Applying the convolutional filter [[1, 0], [0, 1]] to the top-left
          2x2 section of the input matrix, which is [[128,97], [35,22]].
          The convolutional filter leaves the 128 and 22 intact, but zeroes
          out the 97 and 35. Consequently, the convolution operation yields
          the value 150 (128+22).

A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.

D

data augmentation

#image

Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

depthwise separable convolutional neural network (sepCNN)

#image

A convolutional neural network architecture based on Inception, but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions.

downsampling

#image

Overloaded term that can mean either of the following:

  • Reducing the amount of information in a feature in order to train a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format.
  • Training on a disproportionately low percentage of over-represented class examples in order to improve model training on under-represented classes. For example, in a class-imbalanced dataset, models tend to learn a lot about the majority class and not enough about the minority class. Downsampling helps balance the amount of training on the majority and minority classes.

I

image recognition

#image

A process that classifies object(s), pattern(s), or concept(s) in an image. Image recognition is also known as image classification.

For more information, see ML Practicum: Image Classification.

intersection over union (IoU)

#image

The intersection of two sets divided by their union. In machine-learning image-detection tasks, IoU is used to measure the accuracy of the model’s predicted bounding box with respect to the ground-truth bounding box. In this case, the IoU for the two boxes is the ratio between the overlapping area and the total area, and its value ranges from 0 (no overlap of predicted bounding box and ground-truth bounding box) to 1 (predicted bounding box and ground-truth bounding box have the exact same coordinates).

For example, in the image below:

  • The predicted bounding box (the coordinates delimiting where the model predicts the night table in the painting is located) is outlined in purple.
  • The ground-truth bounding box (the coordinates delimiting where the night table in the painting is actually located) is outlined in green.

The Van Gogh painting 'Vincent's Bedroom in Arles', with two different
          bounding boxes around the night table beside the bed. The ground-truth
          bounding box (in green) perfectly circumscribes the night table. The
          predicted bounding box (in purple) is offset 50% down and to the right
          of the ground-truth bounding box; it encloses the bottom-right quarter
          of the night table, but misses the rest of the table.

Here, the intersection of the bounding boxes for prediction and ground truth (below left) is 1, and the union of the bounding boxes for prediction and ground truth (below right) is 7, so the IoU is \(\frac{1}{7}\).

Same image as above, but with each bounding box divided into four
          quadrants. There are seven quadrants total, as the bottom-right
          quadrant of the ground-truth bounding box and the top-left
          quadrant of the predicted bounding box overlap each other. This
          overlapping section (highlighted in green) represents the
          intersection, and has an area of 1. Same image as above, but with each bounding box divided into four
          quadrants. There are seven quadrants total, as the bottom-right
          quadrant of the ground-truth bounding box and the top-left
          quadrant of the predicted bounding box overlap each other.
          The entire interior enclosed by both bounding boxes
          (highlighted in green) represents the union, and has
          an area of 7.

K

keypoints

#image

The coordinates of particular features in an image. For example, for an image recognition model that distinguishes flower species, keypoints might be the center of each petal, the stem, the stamen, and so on.

L

landmarks

#image

Synonym for keypoints.

M

MNIST

#image

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see The MNIST Database of Handwritten Digits.

P

pooling

#image

Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:

The 3x3 matrix [[5,3,1], [8,2,5], [9,4,3]].

A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by strides. For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:

The input matrix is 3x3 with the values: [[5,3,1], [8,2,5], [9,4,3]].
          The top-left 2x2 submatrix of the input matrix is [[5,3], [8,2]], so
          the top-left pooling operation yields the value 8 (which is the
          maximum of 5, 3, 8, and 2). The top-right 2x2 submatrix of the input
          matrix is [[3,1], [2,5]], so the top-right pooling operation yields
          the value 5. The bottom-left 2x2 submatrix of the input matrix is
          [[8,2], [9,4]], so the bottom-left pooling operation yields the value
          9.  The bottom-right 2x2 submatrix of the input matrix is
          [[2,5], [4,3]], so the bottom-right pooling operation yields the value
          5.  In summary, the pooling operation yields the 2x2 matrix
          [[8,5], [9,5]].

Pooling helps enforce translational invariance in the input matrix.

Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.

R

rotational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.

See also translational invariance and size invariance.

S

size invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.

See also translational invariance and rotational invariance.

spatial pooling

#image

See pooling.

stride

#image

In a convolutional operation or pooling, the delta in each dimension of the next series of input slices. For example, the following animation demonstrates a (1,1) stride during a convolutional operation. Therefore, the next input slice starts one position to the right of the previous input slice. When the operation reaches the right edge, the next slice is all the way over to the left but one position down.

An input 5x5 matrix and a 3x3 convolutional filter. Because the
     stride is (1,1), a convolutional filter will be applied 9 times. The first
     convolutional slice evaluates the top-left 3x3 submatrix of the input
     matrix. The second slice evaluates the top-middle 3x3
     submatrix. The third convolutional slice evaluates the top-right 3x3
     submatrix.  The fourth slice evaluates the middle-left 3x3 submatrix.
     The fifth slice evaluates the middle 3x3 submatrix. The sixth slice
     evaluates the middle-right 3x3 submatrix. The seventh slice evaluates
     the bottom-left 3x3 submatrix.  The eighth slice evaluates the
     bottom-middle 3x3 submatrix. The ninth slice evaluates the bottom-right 3x3
     submatrix.

The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride would also be three-dimensional.

subsampling

#image

See pooling.

T

translational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also size invariance and rotational invariance.