Predictive modelling
Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.
This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course, however I will be discussing Predictive Modelling in a way that can be understood independent of the full course.
Note: This article requires knowledge of machine learning concepts.
Introduction
What is predictive modelling
This is a process of modelling historic data for predicting future events. For example we want to use EHR (Electronic Health Record) that we have available to build a model of predicting heart failures.
Key Goals of this article
How to develop a good predictive model.
We will use EHR as a use case. The motivation for this is due to the rise in interest in EHR data as a major data source for clinical predictive modelling research therefore it is important to learn how to develop a predictive model using EHR data.
The Predictive Modelling Pipeline
Predictive modelling is not a single algorithm but rather a computational pipeline that include multiple steps:
- Prediction Target- At the first stage we determine the the prediction we want to make. For example “How likely is a patient to develop lung cancer in the future?”
There are infinite targets that exist we should select a target that is both interesting and possible to answer. - Cohort Construction- We then gather relevant data, in our example we would need patient record data.
- Feature Construction- Next we defined all the potentially relevant features for this study.
- Feature Selection- We then select only the relevant features which will help us with the prediction target.
- Predictive Model- Now we can compute the predictive model using various ML algorithms.
- Performance Evaluation- We evaluate the performance of the model
This process is iterative and stops only when we are satisfied with the results.
Heart Failure use case for Predictive Modelling
1. Defining the Prediction Target:
“Detecting heart failure”
The motivations for early detection of heart failure is that this is a complex disease. There is no widely accepted definition for this disease and the complexity exists because of several ideologies, diverse clinical features and numerous clinical subsets.
If we can detect heart failure earlier the short term the benefits include; reducing patient hospitalisation, introducing early intervention and reduce mortality. The long term benefits will improve clinical guidelines for heart failure prevention.
2. The Cohort construction
Cohort construction is about defining the subset of the patient population.
For the entire population target there exists a subset of patients who are relevant. These patients are called the study population. Often impossible to obtain all the data in the study population so we use a subset of the study population and this is referred to as the study dataset.
We can consider patients either prospectively or retrospectively, meaning patients who may experience heart failures or those who already have.
Note: Prospective patients are first identified and then we collect data about them whereas retrospective patients are identified first and then we trace back historical patient records for data collection.
We can also consider patients in a cohort study or a case-base study. These alternatives provide us with 4 combinations of a study dataset. This is defined in the below matrix.
Note: Cohort refers to a group of people with shared characteristics. The key is to define an inclusion and exclusion criteria.
A case control study compares two groups of people: those with the disease under study (cases) and a very similar group of people who do not have the disease (controls)
For example when we select patients based on cohort study we want to target those patients with heart failure readmission. This data should contain both positive and negative examples.
For case-control studies we combine patients who have developed the disease with a control group of patients that have not.
In order to determine which patients form part of a control group we need to create a set of matching control criteria. This can include matching patients in a similar age group, gender, location). In a case control we first identify the case patients and then match.
3. Feature construction
The goal of this step is to assemble potentially relevant features in order to predict the target outcome.
Raw patient data arrives as a sequence of events over time. Key periods on this timeline is worth understanding their relation to feature construction :
- Diagnosis date: This is the date the target outcome occurred— in our case study group this is the date a patient was diagnosed with heart failure. Since control patients do not have a diagnosis date in theory we could use the same date as the case study dates.
- Prediction window: Time to determine the outcome of the diagnosis.
- Index date: Date we want the predictive model to make a prediction about the target outcome before the diagnosis date.
- Observation window: This is a time prior to index date. This is the time period in which we construct features. There are numerous features we have may have access to, for example we can gather lifestyle information, we can gather patient clinical data and average it.
The length of the prediction window and observation window impact the models performances. A large prediction window and small observation window is the most useful model.
This is because we want to predict far into the future with less data, therefore small observation data.
4. Feature Selection
In this step we look at the features from raw data in the observation window. The goal of feature selection is to find the truly predictive features to include in the model in other words select a subset of features which we believe is responsible for the target outcome. There are various feature types that we can abstract features from in the observation period:
- Demographics
- Diagnosis
- Lab Results
- Vitals
- Medications
- Symptoms
However not all these feature types provide us with relevant data for the target. These features may differ based on various target outcomes. There are existing studies which can assist in determining the valuable features alternatively through trial and error you can determine which features yield the best results as predictive modelling is iterative process.
5. Predictive models
Building a predictive model is to create a function that matches the input features to the output target.
Based on the value of the target the model can either be a regression problem or a classification problem.
Regression problems are defined by a continuous target (y) value. Popular algorithms to solve these problems include linear regression and generalised additive models.
Classification problems are defined by target (y) being categorical. Popular algorithms for solving these include logistic regression, support vector machines, decision trees, random forest etc.
6. Perfomance Evaluation
This is the final step of the predictive model pipeline. To asses how good our model is we train it on some samples and test on unseen samples, ideally from future data.
How to measures the performance of a model:
- Training error: this is not very useful measure of a predictive models performance as we can easily overfit the data by using a complex model which does not generalise future samples well.
- Testing error: is the key metric as it is a much better approximation of the true performance of model on future samples.
Cross validation uses the testing error to measure performance of the model. CV iteratively splits the data into training and testing sets. We build the model on the training set and validate the model on the testing set. This is performed iteratively and finally the testing error seen in each iteration is averaged and computed as a performance metrics. There are three common methods to perform cross validation:
- Leave-1-out CV: we use a single data entry as a test sample and use the remaining data as training. We iterate through each of the data making each entry a testing sample. We average the testing error across all iterations.
- K-fold CV: this is similar to leave-1-out except we divide the data into k folds, each fold will be a testing set. Example we have 10 data entries we want to use K=2. We create 2 folds with 5 samples each (10/2) and then calculate average the testing error for each fold/iteration.
- Randomised CV: we randomly split the data into training and testing data. The results of the testing error is averaged over all the splits. The advantage of Randomised CV over K-fold is that the proportion of the training and testing set is does not depend on the number of folds. The disadvantage of this method is that some of the data may never be selected into the testing set.
Note: some people use validation set and testing set interchangeably.
Quick Reminder: full course summary can be found on Big Data for Health Informatics Course
Hope you learned something.
-R