Clustering

16 min readOct 13, 2022

Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.

This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on clustering can be understood independent of the full course.

Note: This article requires knowledge of machine learning concepts.

Introduction

In classification we group data by labels that we know upfront. In clustering we want to discover those labels from raw data.

Outcomes of this article:

Defining clustering
Describing algorithms for clustering such as K-means and Gaussian mixture models.
Describe scalable clustering algorithm for handling big data sets.
Preview big data health care applications for clustering.

Health care applications

There are many applications and use cases for clustering in healthcare.

Patient stratification is about grouping patients into clusters such that patients in the same cluster share common characteristics.
Disease Hierarchy Discovery is about learning a hierarchical structure about diseases and how they relate to each other.
Phenotyping is about converting raw data into meaningful clinical concepts or phenotypes.

What is clustering?

Clustering is the grouping of data into sets that share similar characteristics to its close proximity neighbour's.

Example of how to cluster matrix horizontally

Let’s say for example we have a matrix of patients and diseases. If we want to apply clustering on the patients this means we want to learn a function f that partitions this data into a set of groups of patients each with the respective diseases. Once we have those patient groups we can utilise those to support applications such as phenotyping and patient stratification.

Clustering Algorithms

We will discuss two types of clustering algorithms; classical and scalable. For classical clustering we will discuss (1) K-means (2) Hierarchical clustering (3) Gaussian mixture model. For scalable we will discuss (1) Mini-batch k-means and (2) DBScans

Classic Clustering algorithms

K-means

The input is a set of data points (x1, x2 … xn) and the parameter k indicating the number of cluster we wish to create.

The output is K clusters (s1, s2, … sk) with a subset of the data points . The union of these clusters is all the data points.

The objective of K-means clustering is to minimise the sum of all the data points x corresponding to its respective centres ui which is the centre of cluster si. This optimises the assignment of each data point into each of the k clusters.

The K-means algorithm

First specify k
Initialise K centres
Assign each data point x to the closest centre
Once we have all the assignments we update the K centre by averaging all the points in the group and reassigning K to the new updated value.
Finally we check if the algorithm converged or not — this means we no longer can update k, or we have reached a maximum iterations
Then we can output the cluster.

Visual illustration of K-means with k = 2. In the below example you can see (a) sets two random points for k. (b) the points are now clustered (c) the k centres are updated in (d and e) until convergence in (f)

Hierarchical Clustering

The goal is to learn a hierarchy over n data points. The algorithm produces a hierarchy such that similar data points are grouped into a tree structure.

There are two popular ways to approach this

Agglomerative and Divisive methods for hierarchical clustering

Bottom up approach called agglomerative. Each data point is considered as its own cluster. Then iteratively group these smaller clusters into bigger clusters until only one cluster remains. This method is more efficient than the next and is therefore more popular in practice.
Top down approach called divisive. This is where you group all data into one cluster, then iteratively divide the data into smaller clusters and continue until each data point is a single cluster.

Agglomerative clustering starts with calculating the distance between each data point to create a distance matrix. This will be a n x n matrix.

Then we initialise each distance as a cluster, therefore there will be n clusters. There after we merge two closest clusters. Once this completed update the distance matrix and iteratively (the distance matrix will be off size n-1 x n-1. The merging of two closest clusters and creating new distance matrix is iteratively done until there is only one cluster.

We have looked at KMeans and Hierarchial Clustering methods. These are hard clustering methods since every data point only belongs to a single cluster.

Gaussian Mixture Model

This is a soft clustering method. What does that mean? Each data point can belong to multiple clustering with varying degrees.

The Gaussian distribution is the most popular continuous probability distribution and is also known as the normal distribution. The shape of Gaussian is a bell curve.

The Gaussian distribution has two variables mu and sigma. Mu is the centre of the bell curve, that is the mean and the sigma is the variance which captures the spread of the data.

When the variance is small the spread will be small which means all the probability will be concentrated around the mean. Alternatively if the variance is large the probability will be scattered. This means that events that are far from the mean can still happen with a large probability.

The Gaussian Mixture Model is mathematically represented as the probability that a data point x is the weighted sum of the over k gaussian distributions. Think about having multiple Gaussian distributions like the graph above.

This is achieved by multiplying the a mixing coefficient by the mean and variance for the cluster k. The mixing coefficient is represented as pi k in the below image and this tells us how big the cluster k is.

Mu and sigma should now be more familiar to you. Mu k in the below image is the mean of cluster k/centre of cluster k. Sigma k is the variance for cluster k

The goal of the GMM is to figure out the parameters, that is pi k, mu k and sigma k given all the data points X.

GMM Expectation Maximisation (EM)

To compute the GMM we use a popular strategy called expectation maximisation or EM Algorithm.

First initialise all the parameters which include, pi k, mu k and sigma k. Thereafter perform the Expectation step by assigning each point X in the data a score for each cluster. This score will be = gamma n k for each cluster k. That is data point Xn will have k scores, one for each cluster. This score will tell us how likely data point Xn was generated from cluster k.

Once we have all the assignments we Maximise the assignments by updating those model parameters, pi k, mu k and sigma k using the scores we have learned from the assignments in the previous step.

Thereafter you check if the convergence criteria is met. For example likelihood is no longer changing.

Once convergence is met we return all the model parameters for Gaussian mixture model.

GMM steps in more details:

GMM initialisation: In order to run EM algorithms for Gaussian mixture models we have to find the mixing coefficient (pi), the mean(mu) and the variance(sigma) for each cluster. This first step is important to get right.

A bad initialisation can result in numerous iterations until convergence. A good initialisation can result in a quicker convergence, making it a more efficient algorithm.

In the above example we can see two clusters (blue and orange) being initialised. This is an example of a bad initialisation since no data points exist within these starting points. The algorithm will have to go through many forms of iterations to eventually converge and correct this.

There are some tips that could help you have a good starting point when initialising. That is the mu k can be set to the output of running k means to determine the centres. The covariance matrix sigma k can be computed by using all the data points from the corresponding clusters. The mixing coefficient pi k can be computed by size of cluster k divided by total population n.

This will be a much better initialisation when compared a random initialisation as shown with the circles.

GMM E Step: in this step we assign each data point in X n a score gamma nk for each cluster k.
The way in which we can calculate this value is shown in the below image. The equation below shows gamma nk have a numerator of two terms.
One is the mixing coefficient for cluster k and the other is the probability of X n in cluster k. The denominator is to normalise this score between [0,1]

In the above example we see the output of creating a score for each of the orange and blue gaussians. The formula provides a score for each cluster.

In the example we see one point at the bottom whose score is [0.5, 0.5] meaning this point could belong to the blue gaussian at 50% and the same for the orange.
The other score vector at the top of the diagram has a [0.8, 0.2] expectation score values for the corresponding blue and orange distribution.

The E step creates this score for each data point.

GMM M Step: This step includes updating the model parameters pi k, mu k and sigma k using all the assignment scores we have learned in the previous step. This is done for each cluster k.

First we define an auxiliary sum N k which equals the sum of all gamma n for a specific k. Intuitively N k is the size of cluster k.

There after we calculate mu k which is the weighted sum of the data points in cluster k divided by N k, that is the size of cluster. Intuitively mu k is the centre of cluster k.

The covariance sigma k is calculated using weighted data points multiplied by the the covariance of the data point X n. The covariance is calculated by taking the data point then subtracting the mu k ( the centre point/ average of cluster k) transposed, multiplied by the same data point subtracting mu k. All divided by Nk which is the size of the cluster. Intuitively sigma k is the covariance of all data points in cluster k

The mixing coefficient pi k is proportional to the size of the cluster N k. In other words pi k is the prior probability for data points that belong to cluster k.

GMM Visualisation: The below images show a series of updates to the gaussian models ( red and blue). We start with initialising the two models. in the first image at the top left.

The next image to the to the right of the initialisation is the assignment scores per data point being set. The colour shows symbolises the which cluster has the higher scores on in the vector.

There after we update the gaussian mixture model parameters based on the assignment scores, in the third picture we can see the red and blue circles elongating to suit the data. In this image we update mu k ( the mean/cluster centres), sigma k and pi k.

Thereafter we repeat the M step and E step until convergence.
In the image above we see L = 20 for the final graph. Meaning it took 20 iterations to converge.

K Means vs Gaussian Mixture Model

Clustering: K means — every data point belongs to one cluster and one cluster only. Gaussian Mixture Model- every data point belongs to multiple clusters with varying weights

Parameter: K means — we only have the centre of the cluster as a parameter. Gaussian Mixture Model cares about the centre of the cluster, the, mixture coefficient and the covariance.

Algorithms: K means and Gaussian Mixture Model both use the EM algorithm. Were they iteratively update the clustering assignment and model parameters.

Scalable Clustering algorithms

So far we have covered classical clustering algorithm. Next we will discuss scalable clustering algorithms that is mini-batch k-means and DB scan.

The one problem with K-means is that it needs to assign all the data points to cluster centres at each iteration. When we have big data this can be costly.

Mini-batch K-means

This is K means algorithm, but instead of using the whole dataset we work with smaller batches of data.

Instead of performing global assignments, mini-batch does the same global scale work but now on a small batches.

At a high-level we start by initialising the k centres as c. This can be done by random selecting these centres.

Then we want to update the centres, we do this by iterating t times and for each time:

sample b data points to form a mini batch, let us call it M
assign the data points (M) to the closest cluster c.
update the centres based on the assignments in M.

In the above image we can see the pseudo algorithm for mini-batch K means.

B. Assign points in M to the closest centre in c. For each point x in M we want to determine the closest centre c to x. We then cache this result in a hash-map. This is done so that we can later retrieve this centre.

C. Update the centre based on assignments M. (1) In this step we go through all the data points x in the mini-batch and retrieve the corresponding centre from the previous step. (2) Then we increment the corresponding centre count by one.
(3) Then we set the step size as the inverse of centre count. (4) Finally we update the centre c by 1-eta * old_centre + eta * x as shown below. Intuitively we move the old data centre towards the data points by eta

As the centre count increases the step count, the step size becomes smaller. As a result the centre update becomes smaller over time. Therefore based on popularity the centre will become more stable.

Big O:

t = t time iterations
b = batch size
K = number of clusters
d = dimension of data points

O(t*b*K*d) is the big O notation for this algorithm. The most expensive step in the algorithm is to assign data points to the closest centre. Since we have b data points in each batch and K centres. We have to compute K * b comparisons and each data point is of dimensionality d.

DB Scan

DB scan stands for Density Based Spatial Clustering of Applications with Noise.

The main intuition of DB Scan is to define cluster as areas of high density separated by ares of low density. This results in clusters being any shape

DBScan Key Concepts

Before we explain the algorithm we need to introduce some key concepts.

First how do we measure the density? The density of a point p is defined as the number of points within absolute distance of p. For example lets say a point p has 3 other points within a absolute distance of 1 therefore this point p’s density = 3.

For the same absolute distance every point in a dataset could have a density value. Now to understand how to interpret p’s density we need to understand what is considered a dense region? A data point p is dense if the density of p ≥ some threshold the minimum data points.

Core: these are points in the dense region.

Border: points within epsilon radius to a core point.

Noise: all the other points outside of the epsilon distance are noise.

DBScan Algoritm

Once we have defined epsilon and minimum data point threshold and the core points.

For each core point c , create edges to points with epsilon distance.
Example for points c1 -> [ edge 1 = n3, edge 2 = n50, edge 3 = n20 ….]
Define N, this is a set of nodes in the graph ( which is all the data points in the data set)
If there are no core points in the the set N, then we are done.
If there are still core points in the set N, then we pick a random core point c.
4.1. Then we find the connected component X from c, we have the edges for c, we are looking for the nodes that make up X.
4.2. Now that we have formed a cluster X with core c, we remove all the nodes in X from N to get the remaining population. N = N \ X ( the set difference)
Then go back to step 3 and continue the iteration.

Clustering Evaluation Metrics

So far we have looked at clustering algorithms next we will introduce a set of evaluation metrics for clustering. These include Rand Index, Mutual Information and Silhouette Coefficient.

Rand Index and Mutual Information both require ground truth clustering results. Silhouette Coefficient does not require ground truths.

Rand Index

Also referred to as RI.

n will be used to describe the data point, X will be the clustering assignments and Y the ground truths. We define the the nodes that belong to X1 and the nodes that come from Y1.

Thereafter we compute the term a, and a is the number of pairs that belong to both X1 and Y1. The intersection of the two sets. Then we calculate the number of pairs that belongs to different clusters in X and Y — Lets cal this term b.

To calculate RI we add a to b and divide by the number of pairs. The number of pairs is n(n-1)/2

RI is between [0, 1] where 0 is a bad score and 1 is a perfect clustering.

Mutual Information

This is a concept from information theory. MI measures how much one random variables tells us about another.

In our case the two random variables are clustering assignment x {x1, x2, … xk} and the ground truth assignments y {y1, y2, … yk}

The Entropy is used to measure uncertainty of a variable, therefore the entropy of x -> H(x) is defined as the sum of the probability p (x) times the log of p (x). For every x we obtain a value between 0 and 1 where values closer to 0 mean it is deterministic, and 1 means the variable is random.

The MI of x and y is the sum over both x and y of the joint probability p(x, y) times the log of p (x, y) divided by the marginal probability p (x) times marginal probability p(y) .

The range of MI is between 0 and the entropy of x that is [0, H(x)]. When M1 is closer to 0 it means x and y are independent and when M1 is closer to the entropy of x it means that y is completely determined by x.

If we apply MI, to interpret the result, the higher the value the better the clustering assignment.

Sometimes the normalised value of MI is used (as seen in the image above)
To calculate this we take the MI of x and y divided by the square root of the entropy of x multiplied by the entropy of y.

Summary of RI and MI

In many instances we may not know the ground truths, therefore both these metrics pose a limitation.

Silhouette Coefficient

This is another popular evaluation metric where we do not require any ground truth information.

The crux of this metric is that we want to know how close a point x is to its own cluster vs x closeness to lets say the next nearest cluster.
And if x is closer to its own cluster the silhouette score will be low and we can infer that x is right where it belongs.

We do this by defining x as the data point, c as the cluster containing x and c` as the nearest cluster to c.

We then calculate two values a and b. a is defined as the average distance of x to all the data points in cluster c. b is the average distance of x to all the data points in cluster c`.

Finally we can calculate the silhouette coefficient as b -a divided by the max of b and a. As shown in the image above.

The intuition is that if the assignment of x is correct the difference between b and s should be large. This means this is a good clustering assignment.

If x on the other hand is assigned to a cluster which is so close to another cluster, then the difference between b and a will be small, and the silhouette coefficient will be small. This means it is a bad clustering assignment.

Silhouette coefficient is bounded but the one disadvantage is it cannot be used for clustering algorithms that do not produce spherical clusters like db scan.

Quick Reminder: full course summary can be found on Big Data for Health Informatics Course

Hope you learned something.

-R