Machine learning datasets can have millions of examples, but not all clustering algorithms scale efficiently. Many clustering algorithms compute the similarity between all pairs of examples, which means their runtime increases as the square of the number of examples \(n\), denoted as \(O(n^2)\) in complexity notation. \(O(n^2)\) algorithms are not practical for datasets with millions of examples.

The **k-means algorithm** has a
complexity of \(O(n)\), meaning that the algorithm scales linearly with \(n\).
This algorithm will be the focus of this course.

## Types of clustering

For an exhaustive list of different approaches to clustering, see A Comprehensive Survey of Clustering Algorithms Xu, D. & Tian, Y. Ann. Data. Sci. (2015) 2: 165. Each approach is best suited to a particular data distribution. This course briefly discusses four common approaches.

### Centroid-based clustering

The **centroid** of a cluster is the
arithmetic mean of all the points in the
cluster. **Centroid-based clustering** organizes the data into non-hierarchical
clusters. Centroid-based clustering algorithms are efficient but sensitive to
initial conditions and outliers. Of these, k-means is the most
widely used. It requires users to define the number of centroids, *k*, and
works well with clusters of roughly equal size.

### Density-based clustering

Density-based clustering connects contiguous areas of high example density into clusters. This allows for the discovery of any number of clusters of any shape. Outliers are not assigned to clusters. These algorithms have difficulty with clusters of different density and data with high dimensions.

### Distribution-based clustering

This clustering approach assumes data is composed of probabilistic
distributions, such as
**Gaussian distributions**. In
Figure 3, the distribution-based algorithm clusters data into three Gaussian
distributions. As distance from the distribution's center increases, the
probability that a point belongs to the distribution decreases. The bands show
that decrease in probability. When you're not comfortable assuming a particular
underlying distribution of the data, you should use a different algorithm.

### Hierarchical clustering

**Hierarchical clustering** creates a tree of clusters. Hierarchical clustering,
not surprisingly, is well suited to hierarchical data, such as taxonomies. See
*Comparison of 61 Sequenced Escherichia coli Genomes*
by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example.
Any number of clusters can be chosen by cutting the tree at the right level.