A Clustering Algorithm: k-means

Rezana Dowra
5 min readJun 24, 2024

--

Lets talk about what is clustering, think about a time your teacher started to put the students in your class into groups, this may bring back traumatic memories and I apologise however, your teacher was actually performing some form of clustering based on specific criteria. It could be based on which students read or wrote well, or which would perform better at sports etc. This is called clustering, the grouping of similar instances into a single cluster based on a specific criteria.

Clustering is similar to classification where each instance will get assigned to a group. However, unlike classification, clustering is an unsupervised task and we will learn how this works with a practical example.

The figure below is a visualisation of the popular Iris dataset, representing three different Iris flower species. This data contain sepal widths and lengths, petal widths and lengths and a corresponding species as its label. This is a labelled dataset, for which classification algorithms such as logistic regression, SVMs or random forest classifiers are well suited.

Iris dataset labelled for Classification

We will use this dataset since we have the labels to purely guide our learning. We hide the labelled data and analyse what we see. Since we cannot use classification algorithms on this type of data, the clustering algorithm is more suited. In the unlabelled data, we notice two distinct sub-clusters at first glance. This is because our view does not take into consideration all features at once. When the clustering algorithm uses both of these features combined it is able to identify three clusters fairly well.

K-Means

K-means is an unsupervised learning algorithm used to classify unlabelled data into clusters based on features in the input dataset. The algorithm works iteratively to assign each data point to one of K groups based on feature similarity.

A centroid is a data point at the center of a cluster. The algorithm requires the scientist to provide the K value and then iteratively arranges each data point around a centroid using a distance based calculation, the squared Euclidean distance.

The centroids are recomputed and updated based on the mean of all data points to the centroids cluster. The algorithm will iterate between assigning a point to a centroid and recomputing a centroids value until a stopping criteria is met. This stopping point means no data point updates its cluster and the algorithm is guaranteed to converge.

How to choose K?

One method is the elbow method, this is used to determine the optimal number of clusters. It involves running k-means with different values of k and understanding the density (inertia) for each value of k. The goal is to identify where adding more clusters does not improve clustering, an elbow shape appears when you plot this data. Making it easier to select a K value.

The silhouette method is another technique to determine the optimal number of clusters (K) in clustering algorithms like K-Means. It provides a measure of how similar an object is to its own cluster compared to other clusters. The optimal number of clusters is the one that maximises the average silhouette score, indicating well-defined and well-separated clusters.

Pseudo code

The pseudocode is outlined below. Here we can see clearly how the training is performed.

1. Initialise K cluster centroids
2. Repeat until convergence:
a. Assign each data point to the nearest cluster centroid (based on squared Euclidean distance)
b. Update each cluster centroid to be the mean of all data points assigned to the centroid.
3. Convergence criteria:
a. Stop when the cluster assignments no longer change (or change minimally) between iterations
b. Or specify a maximum number of iterations.

Details:

- Let X be the dataset with m data points and n features
- Let K be the number for clusters
- Let `centroids` be the array of size `k` to store the centroids

Initialization:
- Randomly initialize `centroids within the range of X

Main Loop (until convergence):
- Repeat {
- Assign each data point in `X` to the nearest centroid
- Update each centroid to be the mean of the points assigned to it
}

- Convergence check:
- Check if the assignments of data points to clusters remain unchanged from previous iteration

End algorithm when convergence is met.

Implementation

The below link contains a repository of the k-means alogrithm, it has two important classes. One has a native implementation of the k-means algorithm and the other is the kmeans using the iris dataset with the sklearn library. In the means class we can view the implementation of selecting K using the elbow method as well as view the visualisation of the data.

The below visualisation represents the clustering output of kmeans with k=3. Here you can see the clustering performed fairly well. Since this is unsupervised we will not get the species in the target data, the algorithm is using a logical number grouping.

Clustering after running k-means

The three centroids found:

[[0.64988426 0.42361111 0.7740113  2.05625   ]
[0.19611111 0.595 0.07830508 0.246 ]
[0.44818376 0.30769231 0.55867014 1.325 ]]

Applications in the real world

This algorithm can be used for a variety of applications including:

  • Customer Segmentation: You can cluster customers base on there purchases. You can also use this data to produce recommender systems.
  • Anomaly Detection: Any instance that has a low affinity to all clusters in likely to be an anomaly.
  • Image Segmentation: By clustering pixels according to their colour, then replacing each color with the mean, we can start to separate objects in an image which can be used for object detection systems.

In conclusion the k-means is at the heart of helping us group similar instances together. It is a simple yet powerful algorithm which can be used as a starting point in which deeper processing can lead to some great discoveries.

I hope you learned something.

-R

--

--

No responses yet