Computational Phenotyping
Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.
This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on computational phenotyping can be understood independent of the full course.
Note: This article requires knowledge of machine learning concepts.
Introduction
We are going to present a healthcare application of clustering called phenotyping. A phenotype is a medical concept such as diseases or conditions. We know many phenotypes of patients based on existing medical knowledge such as major diseases, however there are many more phenotypes and their subtypes out there that have not been discovered.
Computational phenotyping is a way to use data available to us to discover those novel phenotypes.
Phenotypes aren't reserved for disease diagnosis, we can use those phenotypes for predicting health care costs, readmission risk and supporting genomic studies. etc
Computational Phenotyping
Computational phenotyping extract phenotypes from Electronic Health Records (EHR)
This converts raw electronic health care record through phenotyping algorithms into a set of meaningful medical concepts.
For example a phenotype can be a disease such as type 2 diabetes. The raw data consists of many different sources such as demographics about patients, diagnosis code, medication information etc.
There are many reasons why phenotypes are not represented consistently or reliably in the raw data.
- The data can be noisy, missing data and the main purpose of this data is to support clinical administrative operations such as billing
- The data is not design to support research.
- There are overlapping and redundant data.
Phenotyping is the process of deriving research grade phenotypes.
Phenotyping algorithm for Type 2 Diabetes
The goal here is to determine whether a patient has type 2 diabetes from the EHR data.
The input into the algorithm is the EHR data of the patient. We analyse the EHR data for each step in the workflow. There are many paths that can result in a case for the disease. This decision flow is a phenotyping algorithm.
Applications for phenotyping
- Genomic studying: the relationship between phenotypic and genotypic data.
- Clinical predictive modelling: building an accurate, robust and interpretable prediction model about the disease onset and other related target such as hospitalisation
- Pragmatic clinical trials: comparing treatment effectiveness in the real world clinical environment using observation data.
- Healthcare quality measurement: this is about measuring the efficiency and quality of care across hospitals
All these applications depend on phenotyping algorithms
Genomic Wide-Association Study (GWAS)
This is an approach that involves scanning biomarkers such as Single Nucleotide Polymorphism (SNPs) from DNA of many people in order to find genetic variation associated with a particular disease phenotypes.
Once new genetic associations are identified. Researchers can use that information to develop better strategies to detect, treat and prevent diseases.
How are these studies conducted?
- Begin with a population and identify the disease phenotypes in patients
- Sort patients into two groups namely; control and cases.
- Cases are group is made up of patients with disease phenotypes
- Controls are the group of patients similar to the cases but without the disease phenotypes - Obtain DNA samples from all the patients (cases and controls)
- Observe each participants genomes for genetic variations, which are called SNPs (Single-Nucleotide Polymorphism)
If a certain generic variation have found to be significantly more frequent in the cases (patients with the disease) compared to the control than that generic variation are said to be associated with the disease.
- Once we obtain the SNPs we compute the frequencies of the SNPs on cases and controls.
- Based on the frequency we calculate the odds ratio
- Then we calculate the corresponding p-value for the odds ratio
If the p-value is small then we can conclude that the genetic variation is significant. The associated genetic variations can serve as powerful indicators to the region of human genome that may cause the disease.
In the above diagram you can see we have identified a population (10000), created the two groups of cases(4000) and controls(6000). We have extracted SNPs from all patient DNA.
For the first SNP(1) we see the control group show the frequency of a G variation of 44.6% and the cases show a 52.6% frequency of G. When calculating the p-value we find a very low value. This indicates a high significance.
We can conduct the same calculation on SNP2 and find the p-value is 0.33 which is not significant.
For this study we need to know high quality phenotypes on the cases and controls in order to perform this calculation, that is why the phenotyping algorithm is very important.
Why do care about Phenotyping algorithms in genomic study?
We need rich and deep phenotypic data in order to analyse genomic data. Especially as sequencing technology improves, the cost of generating genomic data is dropping fast over time. While the cost of computing or Moores law cannot keep with this improvement of sequencing technology.
This means we will have more and more genomic data in the future about many individuals. However due to the complexity of generating high phenotypic data is actually increasing while the cost of genomic data is dropping.
We need invent better phenotyping algorithms to decrease the cost of acquiring high quality phenotypic data to support genomic studies.
Clinical Predictive Modelling
Phenotyping algorithms can also help with clinical predictive modelling. We have covered Predictive Modelling previously.
To recap this starts with raw EHR data as input into a predictive modelling algorithm to produce a model.
There are many issues when using the raw data, because of this reason we require converting the raw data into phenotypes using a phenotyping algorithm. Once this is accomplished we can pass this data into the predictive modelling algorithm to produce a model.
Now we can see the use of phenotyping and the benefits. We can remove noise, we can gather data from various sources since the output of the phenotyping algorithm is to standardise the data. We can also simplify the data as required.
Pragmatic Clinical Trials
Another application of phenotyping algorithms is to support pragmatic clinical trials.
Clinical trials can be described as either traditional or pragmatic.
Traditional — generally measure efficacy (effectiveness), that is the benefit a treatment produces under ideal conditions. Characteristics of a traditional clinical trial include
- One condition is measured at a time.
- One drug is tested at a time.
- Randomisation occurs, this means some patients are given the drugs and some patients are given a placebo. This is important to help deal with the bias in the clinical research.
- Careful selection of homogenous population with very strict inclusion and exclusion criteria.
- Carefully controlled environment.
Pragmatic — deals with real world patients which often have multiple conditions coexisting. These are the varying characteristics of a pragmatic clinical trial. They almost the opposite to traditional:
- Patients with multiple conditions are selected
- Patients can potentially take multiple drugs at the same time as they may have preexisting conditions that need to be managed through the duration of the trial.
- No randomisation of drugs given to patients, this is not possible.
- Any patient can be selected, there exists a limited ability to set a strict patient criteria
- Real world environment
High quality phenotyping algorithms are important for pragmatic trials because as a safety precaution and for true trial results we need to know what disease conditions a patient has and what medication they are currently on as they all can be derived as phenotypes.
Healthcare quality measurement
It is important to compare healthcare quality measures across hospitals. One way to achieve this is to have hospitals sending their raw data to a central repository.
This central service will now have to aggregate all those raw information n order to compute all those healthcare quality measures. This can be difficult since each hospital can represent their raw data in any format and the central repository has to figure out how to process each hospitals data differently.
The most scalable way to deal with this problem is to process all the data through phenotyping first and them obtain the high quality phenotypic information and then share it with the central repository.
Now the central side can aggregate this information to compute the healthcare quality measures across hospitals. High quality and consistent phenotypic data is crucial to enable this health care quality measure comparison across hospitals.
Phenotyping methods
There are two main categories of phenotyping methods:
- Supervised Learning: We use labelled data called a training set to train a model how to yield a desired output. This model is a function and we perform function approximation and we can apply this function on unseen data to predict the outcome.
- Unsupervised Learning: When we have the input data and we want to identify patterns in data sets containing data points that are neither classified nor labeled. We uses machine learning algorithms to analyse and cluster the data.
Supervised Learning Phenotyping methods
- Expert-defined rules: This is the most widely adopted method. The example for type 2 diabetes above of the flow chart depicts this. This method is manually developed and often uses boolean logic/scoring threshold or decision tree on domain expertise. Then the logic is iteratively enhanced through validation of the EHR data.
- The advantage of this approach is that it provides a human interpretable algorithm.
- Another advantage is that the revisions of this algorithm can be low since an expert can come up with a good algorithm to start with
- Disadvantage is the effort and time that goes into developing such an algorithm and it requires clinical and informatic knowledge.
- Another disadvantage is that this approach cannot be used to identify new phenotypes that are not well understood by clinical experts. - Classification: We can use supervised machine learning algorithm; classification to train a classifier to differentiate the cases and controls.
- The disadvantages of this approach is that can sometimes be difficult to interpret the model and the model will require a significant amount of training data.
- Another disadvantage is the data from various hospitals may not be in a workable format. This may make the model learn features that are unique to a specific hospital.
Unsupervised Learning — they provide approaches to cluster EHR data into patient groups corresponding to phenotypes or subtypes. Unsupervised learning does not require expert labels which reduces the time used for a manual chart review. However the validation of the resulting phenotypes can be challenging since they are no ground truth about what these phenotypes are.
This often requires very large amounts of treating data they do not carry the cost of manually labelling individuals as cases or controls. Examples:
- Dimensionality Reduction: refers to techniques for reducing the number of input variables in training data. When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.
- Tensor factorisation: The goal of tensor decomposition is to obtain a compact representation of a given tensor. What is a tensor? A tensor is a multidimensional array with any amount of rows, columns, vector spaces etc and each point in this array as its own coordinate.
Quick Reminder: full course summary can be found on Big Data for Health Informatics Course
Hope you learned something.
-R