Graph Analysis
Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.
This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on graph analysis can be understood independent of the full course.
Note: This article requires knowledge of machine learning concepts.
Introduction
Graph analysis is commonly used in search engines and social networks. The content of the article will cover graph analysis in the context of search engine tasks.
We will cover some core algorithms used by search engines. Just like search engines can find clusters of web pages that are related to one another, it can also find a cluster of patients or conditions that are related to one another. We will discuss similarity graph and spectral clustering.
Agenda
Today we will cover two important graph based algorithms.
- PageRank: given a large directed graph, this algorithms ranks all the nodes on this graph based on their importance.
- Spectral Clustering: a clustering algorithm based on page partitioning.
PageRank
PageRank is an algorithm founded by googles co-founder to rank web pages. The traditional way to rank the importance of a web pages is based on the content.
However content based analysis is very susceptible to spend (meaning it can be expensive to search using this way). These co-founders found a very smart way to rank pages based on the link structures between the pages instead of using the content in the page.
For example given the small directed graph below we can rank these four nodes based on the link structure instead of the content inside those webpages.
The intuition behind page rank is is more high quality page links to you, then you are considered highly ranked.
PageRank operates on a directed network. For example given a directed graph like below we can use the PageRank algorithm to figure out what nodes are important and what nodes are not important.
The red nodes are considered important/ higher ranked because there are many incoming edges to them. Smaller nodes are considered less important because they do not have many nodes connected to them.
Next lets see how we can formulate this intuition into a mathematical algorithm.
We start by representing the graph as an adjacency matrix. Every row corresponds to a source node and every column is a destination node. We will label this matrix as A. Once we have A, we can further normalise each row to make them sum to 1.
Every non zero element will contain the probability from jumping from the source page to the destination.
PageRank can be described as a recursion function:
A: is the adjacency matrix above.
q: value corresponding to all the page ranks which is the sum of browsing and teleporting
c’s: total is 100, between browsing and teleporting.
browsing: c is the weight between 1 and 0 assigned to the browsing and A is the adjacency matrix. ( note the remainder will be assigned to teleporting c)
teleporting: N is the number of node in the entire graph, and e is the all one vector.
MapReduce PageRank
To implement PageRank using MapReduce we have to partition the computation into map phase and reduce phase.
In the map phase we will distribute the existing PageRank from the source to the destination following the outgoing links. The pseudo code for the map function in the image below.
The input is a key value pair. The key is a page x and the associated value is the current PageRank value for this page and a list of its outgoing links. The output is another key value pair, where key is the page and value is the partial sum of the PageRank.
We will first emit the page x with a value 0 to make sure all pages are emitted. Then we follow the out going links. Foreach out going link we will emit the corresponding page, and a portion of that existing PageRanks value to the corresponding page (q/m). Where m is the total number of outgoing links from page x.
Let us use the above example and implement the map function. We will use the numbers as ranks. Page 1 has three outgoing links but first the algorithm emits (x , 0) that is (q1 0), as well as (q1 , q2/3), (q1, q3/3), (q1, q4/3). The image below shows what this will look like. The same is done for q2, q3, and q4.
The reduce function takes in the key value pair, key is the page and value is the list of partial sum of the PageRank for x. The output is another key value pair; the page x and the corresponding PageRank.
The algorithm is simple, we neutralise the page rank to be 0. We sum up the partial value from the value list for each page qx. This will give us the page rank from the browsing part.
We then renormalise this to take into account the teleporting parts. Finally we emit this page x and its updated page rank.
The Hadoop system will perform the shuffling operation by grouping all the partial sum for each page together to get this list of partial sum and create the input key pairs.
To complete the final PageRank we need to iteratively run this many times.
Spectral Clustering
In traditional clustering algorithms given the input data as a matrix we want to learn a function f that partitions the data into different classifications.
In a traditional clustering setting this function f is directly applied to the input data matrix. While in spectral clustering this function is more involved.
First let us start with discussing how do we construct this function in the setting of spectral clustering.
The first step of spectral clustering is to construct a graph. The input to the spectral clustering in our example are patient vectors. We want to connect all these patients together.
Based on their similarity we want to construct a similarity graph. In the above image every node is a patient and every edge is a similarity between two nodes (two patients). Once we have this graph representation we can store this efficiently using a matrix.
Similar to what we described in page ranks we could use the same adjacency matrix representation
The second step of spectral clustering is to find the top k eigen vectors of this graph.
For example in the above image: W represents the top k eigen vectors. The middle block is an eigen matrix with the diagonal as the eigen value.
In the third step we want to group these patients into K groups using the eigen value.
This is the high level algorithm for spectral clustering.
Depending on how you implement each step. There are many different variations of spectral clustering.
The next discussion includes how do we actually create the similarity graph.
Similarity Graph
The similarity graph is built on the local relationships between patients. Similar patients will have a stronger relationship.
There are various ways of building such graphs:
- Epsilon neighbourhood
- KNN — K nearest neighbours
- Fully connected graphs (assigning different weights to the addresses)
E Neighbourhood Graph
Epsilson nieghbourhood graph. We will connect patients within epsilon distance.
We measure the distance between two patients and those within the epsilon distance will form an edge, and those that are not will ultimately not be close enough to develop an edge.
K Nearest Neighbour Graph
This algorithm finds clusters based on the distances of the nodes to each other. It then associates lowest distances of a node to a cluster.
In the example below, we say we want to find k=2 that is find the closest two nearest neighbors for each node.
Fully Connected Graph
Another way to construct the similarity graph is to construct a fully connected graph but paramatize the edges based on similarity using a function. This function could be a Gaussian Kernel or a Radial Basis Function.
Spectral Clustering Algorithm
Depending on how each step is implemented we will have a variation of results. Examples of spectral clustering:
This performs well in practice even when the data is noisy.
Extra Notes: The Geometry of Kernelised Spectral Clustering: http://arxiv.org/pdf/1404.7552.pdf
Quick Reminder: full course summary can be found on Big Data for Health Informatics Course
Hope you learned something.
-R