MapReduce

Rezana Dowra
10 min readAug 11, 2022

--

Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.

This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article will outline MapReduce in a way that can be understood independent of the full course.

Note: This article requires knowledge of machine learning concepts

Introduction

We have previously covered Predictive Modelling on Big Data. MapReduce is a programming tool for processing Big Data. This is a powerful system that can be used to build predictive models based on Big Data.

Key goals of this article

  • Understanding what is Hadoop
  • Understanding what is MapReduce and how the system works
  • Understanding how MapReduce takes care of fault tolerance in a distributed environment.
  • Highlight analytics that can be performed with MapReduce
  • Limitations to be aware of when using MapReduce

Hadoop MapReduce

What is Hadoop?

Hadoop is an eco-system which provides an environment which is reliable, scalable and ready for distributed computing.

  1. This is a programming model for developers to specify parallel computation algorithms.
  2. It is an execution environment. Hadoop is the Java implementation of MapReduce and Hadoop Distributed File System (HFDS).
  3. It is also a software package.

What is MapReduce?

MapReduce is a submodule of Hadoop which is a programming model and is used to process huge datasets which sits on Hadoop Distributed File System (HFDS)

There are many software tools that ease the development effort for data scientist tasks such as data processing, extraction, transform and loading process, statistic computation and analytic modelling using Hadoop

Hadoop and MapReduce enable a powerful Big Data ecosystem by providing the combination of the following capabilities:

  1. Distributed computation through programming interface MapReduce
  2. Distributed storage for large data set through Hadoop Distributed File System (HDFS)
  3. Fault tolerant systems in order to cope with many possible points of failures that may occur.

The Computational Process

A brief history: MapReduce was developed by Google in 2004 as a proprietor software to support their search engine activities. Thereafter Apache Hadoop was developed as an open source software that mimic original Googles MapReduce. Hadoop was written in Java for distributed storage and distributed processing of large dataset.

The computation process

To program on Hadoop we have to use programming abstraction MapReduce which provide a very limited but powerful programming paradigm for parallel processing on large data set.

All the algorithms running on MapReduce or Hadoop have to specified as MapReduce programs. This constraint exists to support scalable, parallel and fault tolerant implementation of data processing algorithms.

To utilise Hadoop MapReduce for data analytics we have to understand and master common patterns for computation using MapReduce

Machine Learning via Aggregation Statistics

The pattern for writing a data mining or machine learning algorithm using Hadoop is to specify a machine learning algorithm as computing aggregation statistics. Assume we want to implement a machine learning algorithm for “identifying the most common risk factors in heart failure”.

We need to decompose the algorithm into a smaller set of computational units. We can achieve this with specifying a map function f. Where f will be applied to heart failure patients in a large dataset.

Aggregating the results via a reduce function.

The result from the map function f will be aggregated by a reduce function.
The reduce function will perform the aggregation statistic on the result of the map function f.

MapReduce abstraction

Assume we have a large database of patients is stored in a Hadoop Distributed File System. In this HDFS we store each patient as a separate record and each record contain the individual patient history data we want to analyse. This data contains diagnostics of diseases, medication, procedures and clinic notes etc.

In order to obtain all diseases from a list of patients we can write a map reduce function that will search each record and pull out a list of diseases.

To write a MapReduce program we need to specify the map function f and the reduce function and MapReduce abstractions is implemented in this two-step process.

Code for the map function f

In the first step the map function f will search each patient record and the output is a set of key value pairs. In our case the format is <disease, 1>

All the output from the map function f will be processed internally by Hadoop. All the output will be shuffled and the same key value pairs are combined.

Shuffling and combination phase

In the above image the result of the shuffle and combine phase is on the right. This concludes the first step of the two-step process, this is the map step.

In the second step we perform the reduce step. The output from the map function f will form the input into the reduce function. The reduce function will take the disease key and a list of disease values (counts in our code example).

Code for the reduce function which sums the instances of a disease

The result of running the reduce function will create new key value pair which consists of the key as the disease and the value as the frequency of the disease in the patient records. This concludes the two-step process.

Output of the reduce function after aggregating the number of times each disease appears on the patient records.

The two-step process outlined may seem trivial upfront, however this process is crucial when dealing with large data sets that contain billions of records.

MapReduce System

We looked at the high level ideas on how to write a MapReduce programs and touched on how they work. Here we will look into how MapReduce works and why it requires the two-step process outlined previously.

The reasons for this process is because in reality the we deal with large data sets which are too big to be processed on a single machine. The data itself is stored in partitions and each partition can be processed in parallel on multiple machines. MapReduces systems have two components, mappers and reducers.

Each mapper will handle each partition which is a subset of records in the entire data set. This processing occurs in parallel. This process iteratively goes through each record to combine and shuffle the data creating intermediate results that are sent to the corresponding reducer.

Reducers will divide the work by processing the intermediate results from the mappers. A reducer responsibility will be determine in its definition. Reducers can perform different functions in parallel.

Once the reducer receives the intermediate results from the mapper they can start generating the final output by applying the reduce function.

MapReduce System outlining three distinct phases.

To summarise this MapReduce system , there are three distinct phases.
1. Phase 1 — Map Phase where we perform the map function, combine and pre-aggregate functions.

2. Phase 2 — Shuffle Phase where we send the intermediate results to the corresponding reducers

3. Phase 3— Reduce Phase where the final output is generated.

MapReduce Fault Recovery

One of the key functionalities produced by Hadoop is fault tolerance — on a large cluster environment there are multiple points of failures. A mapper can fail to combine and aggregate or the reducer may fail and ultimately resulting in incorrect outcomes directly impacting our research.

In large distributed environment failure and recovery is expensive for us to debug and resolve and so it is important that the system be built to cater for the numerous points of failures. Hadoop can restart failed steps and recompute the desired results in a way that minimises the re-computation.

Distributed File Systems

A crucial that is a part of Hadoop is the Distributed File System. To illustrate the idea behind the HDFS — Given a large file it is impossible to store it on a single machine.

Often we split the file into different partitions and then store it onto different machines called workers. The system stores multiple copies of the same partition on different workers, this is how a large file is distributed.

The key benefits of this design is :

  1. Being able to access multiple partitions of a single file concurrently. The read speed is faster than accessing the single file on a single machine thus improving the performance and scalability.
  2. The other important benefit is fault tolerance — if any of the workers fail we still can retrieve the large file since the data is duplicated and distributed. Increasing the reliability and availability of the data.

HDFS is the backhand file system to store all the data we want to process using MapReduce program.

MapReduce Design

The key motivator on MapReduce design is minimise functionality, and keep it simple in order make the system in more reliable and more scalable. At the same time the minimum functionality enable powerful computational algorithms such as machine learning.

Limitations vs Benefits

Given a machine learning algorithm there are some limitations in using MapReduce. These include:

  • Inability to directly access data — due to huge processing requirements
  • Specifying the map and reduce function and compute in a very restricted form, the aggregation query.

Despite these limitations MapReduce is still a well suited platform for implementing Machine Learning algorithms. Together with Hadoop the benefits include:

  • A distributed file system and computation system.
  • Ensured fault tolerance.
  • Straggler mitigation as an extension of fault tolerance, it identifies weak workers and restarts the same computation on a different worker and allow both workers to run, which ever worker finishes first the result is processed and the other worker is stopped.

Analytics with MapReduce

Classification with KNN (K-Nearest Neighbour): How to use MapReduce to write KNN classifier.
Assume we want to predict if a patient will have heart failure.

How to create a KNN classfier with MapReduce

In the above diagram all patients are plotted on the graph, the x axis being cholesterol and y axis representing blood pressure.
All points are marked with a 1 or a 0. Patients marked with a 1 mean they have been diagnosed with heart failure and those with 0 have not.

Assume we want to determine if a new patient (marked in the red) will have heart failure or not.

Since we are using MapReduce to achieve this we need to create the map and reduce functions. We now that in Hadoop the data is distributed and this means the data is partitioned. The green and brown colour over the patients indicate the partition in which this data falls.

Assume we want to determine KNN for k =3. We can achieve this by writing a map function which produces the KNN for each partition of data. The result of this map function will result in a local KNN classifier. This is shown on the diagram by the green and brown circles encompassing the 3 data points closest to the red query point per green and brown partition.

In order to achieve the global neighbours and the correct classification which considers all the data in all partitions we write a reduce function.
The reduce function as defined above takes 3 closest points to the red query point for each partition of data — and then produces the new set of 3 closest points of data which is now the global result. This is depicted in the yellow circle.

Linear Regression: How can we also use MapReduce to implement Linear regression. As an example we want to build a Linear Regression Model that maps patient information to hear disease risk, and we want to find out all the coefficients with all those patient features.

In the below diagram the brown block is depicting the patient input matrix which is n by d. n is the number of patients and d is the number of features. In the red block we have the output target variable which is n by 1. Every row in the output target corresponding to the heart disease risk associated with that patient. Then finally the green block is a d by 1 matrix which is the coefficients (or weightings) of the input matrix.

Linear Regression Model

How can we build a model using MapReduce. With statistics we can solve this problem using by computing the optimal coefficient using the maximum likelihood estimation. This can be computed by taking the inverse of X transpose X multiplied by X transposed Y

Using statistical method MLE to compute linear regression coefficients.

With large datasets this computation cannot be done on a single machine. We can use MapReduce to write this program. To write this we can create two mappers:

  1. Mapper 1: inverse of X transpose X
  2. Mapper 2: X transpose Y

We can create a reduce function to sum the output of each mapper. We can use the same reduce function for both mappers.

Linear regression solution using MapReduce

Logistic Regression as a Limitation to MapReduce

Logistic Regression requires us to find the optimal line to classify the data using gradient descent.

When we create this as a MapReduce program. We have to specify a map function to create the gradient for each patient and apply it to each data record. We can then aggregate the data in the reduce function. This is applied to each patient.
Then we update the parameters of the logistic regression. This requires us to load the data two times. In this iteration and multistage computation — MapReduce is not optimised.

MapReduce Summary

  1. MapReduce is best for single pass such as computing histograms but is not not good for multi pass such as iterative operational algorithms such as logistic regression.
  2. This is good for uniform distribution keys. If the distribution keys are skewed the work will be overloaded on the reducer vs uniformly distributed across many reducers.
  3. Good for no synchronisation is required. During map phase every task is processed independently and in parallel, the same applies for the reduce phase

Quick Reminder: full course summary can be found on Big Data for Health Informatics Course

Hope you learned something.

-R

--

--