Machine Learning Glossary: Clustering

This page contains Clustering glossary terms. For all glossary terms, click here.

A

agglomerative clustering

#clustering

C

centroid

#clustering

The center of a cluster as determined by a k-means or k-median algorithm. For instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.

centroid-based clustering

#clustering

A category of clustering algorithms that organizes data into nonhierarchical clusters. k-means is the most widely used centroid-based clustering algorithm.

Contrast with hierarchical clustering algorithms.

clustering

#clustering

Grouping related examples, particularly during unsupervised learning. Once all the examples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, the k-means algorithm clusters examples based on their proximity to a centroid, as in the following diagram:

A two-dimensional graph in which the x-axis is labeled tree width,
and the y-axis is labeled tree height. The graph contains two
centroids and several dozen data points. The data points are
categorized based on their proximity. That is, the data points
closest to one centroid are categorized as cluster 1, while those
closest to the other centroid are categorized as cluster 2.

A human researcher could then review the clusters and, for example, label cluster 1 as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on an example's distance from a center point, illustrated as follows:

Dozens of data points are arranged in concentric circles, almost
like holes around the center of a dart board. The innermost ring
of data points is categorized as cluster 1, the middle ring
is categorized as cluster 2, and the outermost ring as
cluster 3.

D

divisive clustering

#clustering

See hierarchical clustering.

H

hierarchical clustering

#clustering

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clustering.

K

k-means

#clustering

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

Iteratively determines the best k center points (known as centroids).
Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

A Cartesian plot with several dozen data points.

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

The same Cartesian plot as in the previous illustration, except
with three centroids added.
The previous data points are clustered into three distinct groups,
with each group representing the data points closest to a particular
centroid.

Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

k-median

#clustering

A clustering algorithm closely related to k-means. The practical difference between the two is as follows:

In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:

$$ {\text{Euclidean distance}} = {\sqrt {(2-5)^2 + (2--2)^2}} = 5 $$

k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:

$$ {\text{Manhattan distance}} = \lvert 2-5 \rvert + \lvert 2--2 \rvert = 7 $$

S

similarity measure

#clustering

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

sketching

#clustering

In unsupervised machine learning, a category of algorithms that perform a preliminary similarity analysis on examples. Sketching algorithms use a locality-sensitive hash function to identify points that are likely to be similar, and then group them into buckets.

Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within each bucket.

T

time series analysis

#clustering

A subfield of machine learning and statistics that analyzes temporal data. Many types of machine learning problems require time series analysis, including classification, clustering, forecasting, and anomaly detection. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data.

U

unsupervised machine learning

#clustering

#fundamentals

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Contrast with supervised machine learning.

Click the icon for additional notes.

Another example of unsupervised machine learning is principal component analysis (PCA). For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.