Spark

11 min readOct 14, 2022

Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.

This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on spark can be understood independent of the full course.

Note: This article requires knowledge of machine learning concepts.

Introduction

We have previously covered Map Reduce. A distributed fault tolerant system for processing large datasets. However Map Reduce is not efficient/sufficient for supporting iterative workloads which many machine learning algorithms require.

Spark is another big data system that provides better performance by using distributed memory across many machines. This article will cover:

Resilient Distributed Dataset (RDD)
Spark and iterative algorithms
Health care applications using Spark

Motivation

In the previous article we have introduced Hadoop and Map Reduce. Hadoop is a big data system that works on acyclic data flow from stable storage to stable storage.

That is the input and output of Hadoop jobs are stable storage systems for example hard disks in the distributed environment. The computing job follows a Map Reduce paradigm which forms an acyclic data flow graph.

The Hadoop system ensures fault tolerance and the benefit of such a design is that at runtime we can decide where to run tasks and automatically recover from failures. So the Map and Reduce function can run on different machines in a distributed environment.

The limitations of Hadoop is that it does not perform well when the workload involves cycles. Hadoop is inefficient for applications that repeatedly reuse a working set of data.

There are many applications that require this functionality such as applications that use iterative algorithms and interactive data mining tools.

Iterative algorithms include machine learning algorithms that often require repeated computation on the same data set such as clustering or classification. Graph analysis uses graph algorithms such as spectral clustering or page rank computation.

Interactive data mining tools such as R and python used to explore data form hypothesis, validate those hypothesis and repeat on the same data set.

For the above reasons any many more, it would be necessary to provide an efficient big data platform.

Iteration in Map Reduce

To illustrate the inefficiency of Hadoop using one concrete example. Assume we want to build a machine learning model on a training set. Typically a ML algorithm starts some initial model. Then we run the algorithm for one iteration for example we could implement this as a map reduce job and the resulting model will be stored on the disk.

We repeat this process with training data and we continue with many iterations until convergence.

If you look at this pattern illustrated above, you will notice we repeated reload the same data from disk to memory in order to perform the computation. We also have to repeatedly have to write the results for the model parameters to the disk.

This repeated read writes are the inefficiencies of Hadoop.

Workload Illustration

Iterative algorithms are implemented by applying some computation logic on the same data input repeatedly in order to generate different iteration of result dataset. These results are often pertaining to the model parameters.

Interactive computation is performing some computation, and across iterations this computation can be different.

The requirement is that we want this iteration to be fast so that we can perform this work in an interactive manner.

The key objective supporting iterative algorithms and interactive computing is to keep the working data set in memory so that we can perform the required operations fast.

For iterative algorithms we want to load the entire data set into a distributed memory across many machines. By having all the data in memory we don’t have to read or write to disk.

For interactive computation the idea is very similar. We want to keep the intermediate results all in memory so that the next iteration can be performed immediately.

Challenge

The challenge with keeping everything in memory is that it is not fault tolerant. To build a distributed system that is both fault tolerant and efficient is challenging. Hadoop guarantees fault tolerance by using disk but memory is not a stable storage. Memory is only efficient.

A lot of existing distributed storage system depend on fine grain updates such as databases, key value stores, distributed memory.

In order to make this storage abstraction fault tolerant we have to replicate data like what we have done with Hadoop. We replicate data several times and store it on different machines or we have to keep track of what operation has happened using logs. The one disadvantage with this is the expense.

Solution RDD

To solve this challenge. Spark created RDD which strikes a balance between the granularity of the computation and the efficiency for enabling fault tolerance.

RDD stands for Resilient Distributed Datasets and provide an interface based on coarse grained transformations off the entire dataset such as map or group-by or join etc. These are called coarse grained because the operation is applied on the entire data set ( not one data element).

While working on spark you may apply different transformations on RDDs. These transformation actions create a logical execution plan for all tasks executed historically. This logical execution plan is also popular as lineage graph. This means that efficient fault recovery can be done using lineage. This is like a tree, which keeps track of all the transformations across different RDD.

In the process, we may lose any RDD as if any fault arises in a machine. By applying the same computation on that node, we can recover our same dataset again. We can recompute if failure happened because we have the operation being logged. We can then apply the same computations by using lineage graph.

Hence, This process is fault tolerance or self-recovery process.

This is all done in memory therefore there is no cost if nothing fails. Thus this is an efficient mechanism enabling fault tolerance.

RDD Recovery

How can RDD recover from failure. Considering both iterative algorithm and interactive computation algorithms.

If interactive steps fail we can go back to the previous passing step and apply the step and get the state of the data from the previous RDD.
For iterative algorithms if only a part of the data has failed in memory we can load the corresponding chunk from the stable storage to refresh that memory. Results from each iteration is also stored in memory.

This is the main idea behind Spark, which is a big data system built on top of RDD

Spark Stack

Spark Core contains the basic functionality of Spark including components for scheduling, memory management, fault recovery and interacting with storage systems and more.

Spark core is the home API for building and manipulating RDDs. RDD is the main programming abstraction represented as a collection of data points distributed across many compute nodes and can be manipulated in parallel.

Spark provides many libraries built on top of Spark technology such as:

Spark SQL: is a package for working with structured data. It allows querying data in a SQL like syntax.
Spark Streaming real-time: enables processing realtime data.
MLib Machine Learning: provide many machine learning algorithms are designed to scale to a cluster of computers.
GraphX graph processing: this is the graph processing engine for Spark. It can manipulate large graph structures.

Under the hood Spark is designed to efficiently scale up from one machine to many machines. To achieve this flexibility Spark can run over a variety of cluster managers such as Standalone Scheduler on a single machine, Hadoop YARN and Mesos.

Spark Programming Interface

The core interface is written in Scala, however python and java is supported. There are three main APIs:

Resilient Distributed Datasets (RDD): This API allows the creation and manipulation of RDDs
Operations on RDD: Such as transformation of one RDD to another, an action that is to operate on an RDD in order to compute some results.
Restricted shared variable: This API allows the sharing of a global variable across machines such a broadcast and accumulator.

RDD Transformations

Examples of RDD transformations:

Operations: map() vs flatmap()

In the above example we are given a RDD that is RDD1 which has a list of symptoms i.e. sprained ankle.. etc

In the above example we want to tokenise this RDD. To tokenise a string we get the list of words. Above we can see when we apply rdd1.map(tokenise) we get mappedRDD which is a set of list, one for each element in RDD.

To create a flatmap we add all elements into a single dimension. In the above example we apply tokenise for RDD1 using a flatmap function as the result RDD is flatMapppedRDD with one complete list.

Operations: distinct()

The result of this will be to collect only the unique words in a new list stored in resulting RDD

Another popular operation is: union()

Here we want to combine RDD1 to RDD2 using the union function. This is a relatively cheap operation.

This function only gives us what is in both RDD1 and RDD2. This operation is expensive because it involves sorting getting distinct values and then getting whats overlapping between two RDDs.

This function will tell us what is on RDD1 that does not exist in RDD2. This is expensive due to similar reasons for intersection function.

Spark Operations

Spark provide many operations which are grouped into two main parent categories:

Transformation: which create a new RDD from an existing RDD after applying the types of transformation shown above.
Actions: which returns a result to the driver program. Think of this like the reduce phase in Map Reduce.

These operations are just a selected few there are many more.

Shared Variable

The way to create a shared variable is too create a broadcast variable that allows us to send a large read only value to worker nodes.

Fault Tolerance

RDD tracks the lineage information of all those different transformations and they can recompute efficiently if lost partitions happen

In the above example we can see that we want to first apply a filter to the data and then a map. What is actually happening in the background is that the HDFS is reads the data, then filter is applied sequentially. On filter a new RDD is created. If anything has failed we can reread the data.

Thereafter the map function is applied on the filtered RDD — if there is any failure on this step we can use the lineage to get filtered RDD and recompute the map portion.

Example Logistic Regression

Logistic Regression is an iterative algorithm used for classification. The goal of LR is to find the best line that separates two sets of points.

LR starts with a random line as a base, thereafter for each point we compute the gradient of all the data points. Then we update the line function by moving toward the gradient. These steps are iteratively applied until the line converges.

Since the input and output are all kept in memory these updates can be done very quickly

For example in the above image we can see how LR is programmed using spark.

First we read the data and cache it in the variable data. Then we initialise the model parameters w.

The iterations applied perform a map and reduce step to do a gradient descent. In this step we update the gradient of the model parameters.

The performance of Spark vs Hadoop for LR is outlined in the above image. We can see as the iterations progress Spark is more efficient.
The first iteration for Spark is expense because of the caching step, thereafter it remains fairly efficient compared to Hadoop.

Example Disease Risk Prediction

Let us look at a health care example.

We have a matrix R with every row containing patients and every column associated with diseases. The goal is to predict disease risk based on existing disease diagnosis.

This is really a collaborative filtering problem.We can approach this problem by modelling R that has many missing values as a product of two matrices. These are patient and disease features matrices, A and B respectfully.

We can achieve this through Alternating Least Squares (ALS) algorithm. In the next section we will go over how to achieve this using spark.

Serial ALS

First we can read the data matrix into R. Then we can initialise A and B randomly. Then we can update A and B alternatively.

We loop through all patients and for each patient i we want to update A and B alternatively.

For A we will update patients. Lets assume we have U patients so for each patient we will update that corresponding patient based on matrix B and the input matrix R. Let us say we have a updatePatient function which is a linear regression function performing the update.

This will achieve updating patient i, now we want to update A and B alternatively. We can see in the spark code A has up to U patients and B has up to M diseases therefore these are the ranges used in the update statement.

Naive Spark ALS

The ALS algorithm we have seen in the previous two sections can be run in parallel.

The results of updating A and B can be collected and passed back to the driver node. The code spark.parallelize gives us this functionality.

The issue with this is that the big data matrix R has to be sent to all the worker nodes in each iteration. This can be costly and inefficient if we have a large data matrix. The solution to this is we use broadcast. When we mark R as a broadcast variable and this creates a read only variable.
When we want to use this matrix R we can access it with R.value

Broadcast has been proven to improve performance.

Quick Reminder: full course summary can be found on Big Data for Health Informatics Course

Hope you learned something.

-R