Ensemble Methods
Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.
This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on ensemble methods can be understood independent of the full course.
Note: This article requires knowledge of machine learning concepts.
Introduction
The last time we covered metrics for classification and regression algorithms. In this article we will focus on how to develop these models using big data. We will cover:
- Algorithms used in big data analytics (like stochastic gradient descent)
- Ensemble methods as a powerful class of machine learning algorithms that lead to the best performing models
- Two criteria; bias and variance and the trade-off between them
- Bagging and boosting
Gradient Descent
This is a basic optimisation method that has been widely used in machine learning and data mining applications. How to apply gradient descent method for classification and regression problems?
The input to any classification or regression problem is a training dataset that consists of N pairs of data points X and the corresponding target Y and the final output of the model is usually some parameter say theta.
For example the theta in linear regression are the linear coefficients to each feature in the input data set.
Gradient Descent Method (GDM) for Classification
Gradient descent for classification at a high level consists of three steps.
The first step is to determine some likelihood function of the input data D given theta. The likelihood function is a fundamental concept in statistical inference. It indicates how likely a particular population is to produce an observed sample. This is the joint distribution of the data points given the model parameter.
It is easier to use log likelihood function instead of likelihood function. The log transformation is monotonically (in other words always) increasing and leads to the same optimal as the likelihood function but the resulting terms are easier to manipulate.
The second step is to find the derivative, also called the gradient of (log) likelihood function given theta. This is a crucial step and most of the computation often occurs here. Reminder the derivative function tells you the rate of change of f for any given x, which is equivalent to telling you the slope of the graph of y for any given x. When the derivative is positive, the function is increasing. When the derivative is negative, the function is decreasing.
The third step occurs after we have computed the gradient. We will update theta. We will do this by moving theta towards the direction of the gradient (in other words theta + derivative value will be the new theta)
These three steps are repeated until theta converges this means, the theta is no longer changing. This process is the high level illustration of gradient descent.
The algorithm itself contains extra tuning parameters for example the learning rate (alpha) which controls how much we update the theta based on the gradient. This can be learned through cross validation.
Gradient descent is the simplest gradient based optimisation algorithm
GDM for Linear Regression
To illustrate gradient descent for linear regression. Assume we are given a dataset D and every row n corresponding to a patient and every patient is characterised by a set of features X and the target outcome variable Y. Let us assume this is a cost the patient incurs.
The goal of this model is to map the input feature X to the the output cost Y.
The first step is to specify the log likelihood function for linear regression. This function is depicted in the diagram that follows. This data assumes a Gaussian distribution. The log likelihood function involve some constants which we subtract a sum of squared terms (in the diagram below). This summation term is the sum over all patients. For each patient i in the summation we will take the true cost y subtract what the model predicts (times beta, which is the coefficients)
The intuition is we want to learn a linear model that convert input features such as age, gender etc to the target and in this example the target is the cost this patient incurred.
Once we have the log likelihood function we will calculate the derivative to find the gradient of this function. This is the second step. The intuition here is that we have a set of parameters one for each feature and we want to know the gradient of each feature.
In the diagram above the derivative function defined is the summation of all patients in which we compute a linear term. This term is the difference between the target and the features multiplied by one of the features, we will perform this calculation for each feature.
The motivation for the computation is to understand the the importance or weight of a specific feature; example age, if we want to know how important a feature like age in this linear regression model. We traverse the entire dataset and compute the derivative of the age feature.
Once we have this gradient calculated we can move the the final step which is updating the coefficients by moving towards the gradient direction. Note in this diagram above the update function has the step size defined (n squared_. This is important in understanding how much we move towards the gradient by.
The above three steps are repeated until the beta converges; that is beta is no longer changing in the update model. Beta is the coefficients of the linear regression model.
Stochastic Gradient Descent (SGD) Method
This is a variant of gradient descent for handling big data sets. In the traditional gradient descent method we first compute the likelihood function, then the derivate (the gradient) and the perform the update function iteratively until this converges. This is done for the entire data set and many times.
This process is too expensive for big data sets and this is where SGD method can be used to handle this deficiency.
With SGD method instead of computing the likelihood function over the entire dataset, the idea is to compute the likelihood and gradient function on a random (stochastic) subset of the data. This is subset update is referred to as a mini batch update.
The data is distributed into many subsets (or partitions) and all can be computed in parallel since there is not dependency on the data. This concurrency makes SGD more favourable over GDM.
SGD for Linear Regression
How can we use SGD for linear regression. Assume we are using the same dataset as before. Every row is a patient, every patient has features such as age, gender etc and the target y is the cost they will incur.
This is the same process as mentioned before; first we compute the log likelihood followed by the gradient function and then the update function. The only difference here is that we do this process for one patient at a time and then once the three steps are completed we perform the same steps for another patient.
The key difference here is that we do not need to do each step for all the patients, this makes it computationally easier as well.
Ensemble Methods
This is a popular method for classification. Ensemble methods combine multiple methods to produce a better model.
Ensemble method is a group of base learning algorithms (such as random forests). An ensemble method more generally refers to combining models using different algorithms. They often outperform individual classification models.
Ensemble Method Steps
First we are given an input dataset which we derive new datasets for subsequent model training. These datasets can be easier generated independently or sequentially.
Second, each dataset will be used to train a separate model. Those models can be independently trained among the models or dependencies can exist.
Lastly we construct an aggregation function to combine the result from all the models. This aggregation function can be as simple as taking the average of the models or a weighted average of the models.
Since each step can be performed in more than one way, this results in varying combinations of ensemble methods that can be developed.
Bias and Variance
In order to explain how Ensemble methods work we need to understand Bias and Variance.
Bias is the prediction error for wrong modelling assumption. For example if the model assumes a linear relationship between the input and target, however in reality the data does not follow this linear trend then difference will be due to the bias of the model.
Variance refers to the error from the sensitivity to small fluctuation in the training data set.
Ideally we want a model to have both low bias and low variance. Illustrated in the following example.
The x axis shows high and low variance and the y axis shows the high and low bias. The red centre in each figure indicates the ground truth target and each blue dot represents the prediction of the model.
- High bias and low variance: the model prediction are clustered in one position but not on the target. In this case the model can be easily fixed by shifting all the prediction.
- High bias and high variance: the models prediction are scattered and away from the target. This is the worst possible situation.
- Low bias and low variance: all the model predictions are concentrated on the target. This is the optimal solution.
- Low bias and high variance: the models prediction are scatter around the centre.
Bias Variance Trade-off
What is the relationship between these two model performance outcomes? As the complexity of the model increases bias decreases. This is because a flexible or complex model can often fit the data better. However error from variance will increase as complexity increases. This is because a complex model is more susceptible to overfitting.
The total error is the error due to bias plus the error coming from the variance. Our model requires the optimal complexity that balance the right amount of bias and variance and lead to the minimal error.
This will change from each task and dataset. So we cannot find the best performing algorithm for every dataset in all cases. Therefore its important to understand the trade-offs and search for the best model for your task.
In order to understand model complexity better and the trade-off between variance and bias the above graph provides some key details when comparing these 4 linear regression models.
D is the most simple model, but it does not fit the data well as you can see there is a high variance.
A is also a simple model, but more complex than D. A is also a worse performing model compared to B and C
B is a more complex model, and the data seems to have low variance and low bias. Making B the best model
C overfits the data and has the most complex model.
Bagging
Bagging is a popular ensemble method. It is simple yet powerful ensemble strategy.
Given an input dataset the idea behind bagging is to take repeated bootstrap samples from input data to generate sample dataset. Here bootstrap sampling means sampling with replacement from the original dataset.
Based on the sample dataset we train separate models. Finally we classify a new data point by taking a majority vote or average of these models
Random Forest — a classical bagging algorithm
In a random forest the underlying models are Decision trees. How does random forest work?
Imagine we have a large database with patient data again n patients with x number of features. In our example patient features are demographic data etc.
The random forest will randomly sample subset of patient and a subset of their features to construct a sub-matrix. These sub-matrices will be used to build decision trees.
The decision trees are constructed by recursively selecting a feature from the sub-matrix and then split based on that feature until all the features are used.
This generated multiple trees, when we want to make a prediction based we process the patient data through each tree and then take the average of all trees.
Why does Bagging work
Bagging reduces variance of the final model without increasing bias.
Since the final model is to take the mean over multiple models thus the mean of the final model is the same as the mean of the individual models. This suggests that the bias is the same.
However the variance of the model can be reduced by averaging. This intuition comes from simple statistics.
The Var(X) variance of a random variable X can be reduced by a factor of T if we measure the variance of the mean over T independent identically distributed (IID) samples of x.
In bagging all the models are created from bootstrap samples which can be considered as IID therefore the variance of the final model can be greatly reduced by averaging
Boosting
Boosting involves incrementally building the model one at a time and emphasising the later model to the training instances that the previous model misclassified.
For example we start building model M1 with the first dataset D1. Then we test the performance of model M1 on dataset D2. Based on the misclassification that occurred on D2 we train another model M2 to correct these.
Then we repeat that process, by combining M1 and M2 together and testing against D3. Then we create M3 which will correct the mistakes of the combined model.
Ultimately incrementally improving the model as we train and test new datasets.
The final model becomes a weighted average of the trained models. In some cases boosting has shown to yield better accuracy than bagging. Boosting also is more likely to overfit the training data. The most popular boosting algorithm is Ada-Boost.
Comparing Bagging and Boosting
- Combining models
- Bagging: we use simple average to combine all the models because the models are developed independently
- Boosting: we use weighted average because there is a sequential dependency between all those models. This is because the order matters. - Parallel computing
- Bagging: This is easily achieved because the models are independent, we can use different machines or cores.
- Boosting: This is harder since the output of one model could be the input into another model and so we cannot take full advantage of concurrency. - Sensitivity to noise in the data:
- Bagging: less sensitive to noise, the models being combined are often much more diverse since they are built independently on separate samples.
- Boosting: is more sensitive to noise, because of the sequential dependency and the later models which try to correct the errors made by earlier models. This can result in overfitting the data. - Accuracy of models
- Bagging: often achieve good performance in all cases
- Boosting: can perform better in most cases, however it can perform much worse in some cases due to overfitting.
Advantages and Disadvantages of using Ensemble methods
Advantages include:
- Ensemble methods are often simple
- Almost no parameters (except T, how many models to be combined)
- Flexible as you can combine various algorithms in various ways
- Theres also theoretical guarantee why it works
Disadvantages include:
- This stems mostly from computational challenges because it is expensive to compute multiple models
- Both training and scoring need to deal with multiple models
Quick Reminder: full course summary can be found on Big Data for Health Informatics Course
Hope you learned something.
-R