k-means clustering #
\(k\) -means clustering is an unsupervised learning technique whereby \(n\) data points are partitioned into \(k\) clusters, where the association of a data-point is to the cluster with the closest mean. This gives a Voronoi partitionining of the space.
Problem description #
Given \(n\) observations \(x_1,\ldots,x_n\) and a non-negative integer \(k\), partition the observations into \(k\) disjoint sets \(\mathbf{S}=S_1,\ldots,S_k\) such that the total within-cluster sum of squares (variance) is minimal, i.e. \[ \mathrm{argmin}_{\mathbf{S}}\sum_{i=1}^k\sum_{x\in S_i}\left\|x-\mu_i\right\|^2 \] where \(\mu_i\) is the mean of all the points in \(S_i\).
Algorithm #
This optimisation problem is NP-hard in general.
Naive k-means #
- Initiate a set of “means” \(m_1,\ldots,m_k\). This can be done randomly or with some heuristic.
- At each “assignment step”, assign each observation to the cluster with the nearest mean, or more precisely, define a partition \(S_1,\ldots,S_k\) where a point is assigned to \(S_i\) if \(m_i\) is the nearest mean (break ties arbitrarily).
- Recalculate the means as the centroids of the new clusters and continue until the algorithm converges.
Drawbacks #
- \(k\) must be chosen as an input parameter and finding the “right” value for a specific problem is nontrivial.
- The algorithm may converge to some local minimum which gives unintuitive clusters.
- It also assumes some “spherical” and separable clusters of roughly equal size.