The land of valleys in K-means

Why the K-means algorithm guaranteed to decrease the value of the optimization function in each repetition?

Arghavan Moradi
3 min readDec 15, 2020
Photo by Simon English on Unsplash

K-means clustering is one of the common unsupervised learning algorithms in ML. Since unsupervised data are not labeled, we don’t know what is the exact number of clusters in our data. Thus, predefining K is one of the important steps in K-means.

Suppose that we have “n” data points, then in this method, we assume that we can divide the whole data points into “k” clusters in a way that:

It means the union of all clusters equals the whole data points and there is no intersection between data points in two different clusters.

How does the algorithm work?

  1. Choose the number of K
  2. Randomly assign each K to an observation
  3. Calculate the centroid of each cluster. (Centroid is a vector in size P, while P is the number of features of each data point. For example, if each data point is a movie and each movie has three features: genre, year, and rate then P=3)
  4. Calculate the Euclidean distance of each observation (data point) with the centroids of each k cluster and assign the data point to a cluster with the closest distance.
  5. Repeat steps 3 and 4 to minimize the optimization function.

What is the optimization (loss) function in K-means clustering?

One of the important points to have a good clustering is to have “within-cluster variation (variance)” as small as possible and “between-cluster variation” as much as possible. It means we want all data points within a cluster to have more close values for all P features and on the other side, all data points in different clusters have more different values for all P features.

To define the variation, we can use Euclidean distance and use it as the optimization function, |C_k| is equal to the number of observations (data points) in the cluster C_k:

why the k-means algorithm guaranteed to decrease the value of the optimization function in each repetition?

The result of K-means very depends on step 2 in its algorithm. With a different random start, we can reach different outputs. Why? because the optimization function for K-means is not convex. Thus, each time with a different random start, it returns one of the “local minimum” that caused the deviation of the optimization function to zero. It calls the “land of valleys”.

photo by the author: Land of valleys

In the end, since in K-means, we are trying to minimize the distance with centroids, we can change the loss function to a formula to minimize the centroids as follow:

References

[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical learning (Vol. 112, p. 18). New York: springer.

[2] Hartigan, J.A. and Wong, M.A., 1979. Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1), pp.100–108.

--

--