Big Data for Health Informatics Course outline

4 min readAug 4, 2022

Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.

This article include the summaries from the course Big Data for Health Informatics at GaTech (this course is a part of the Machine Learning Specialisation)
It is intended for you to follow the series of articles. At the end of the series you should achieve the following learning goals:

understanding health care data
understanding different analytic algorithms
understanding big data systems

The learning goals will allow you build models on health care. For example models for individual disease risk predictions, recommending treatments, cluster patients into groups with common characteristics and find similar patients.

Introduction

Background on the health care industry in US

Healthcare industry is huge, the overall spending is 3.8 Trillion USD.
This includes massive waste — this is estimated at 764 Billion USD. Apart from the financial loss there are massive problems with quality of health care that result in loss of life.

The four vs in big data for healthcare systems

Volume
Variety
Velocity — data is coming in in real time
Veracity- a lot of noise, errors, missing data, false alarms

Big Data in healthcare

Healthcare generates huge amounts of data. For example each human genome requires 200 GB of raw data, for medical images a single fRMI is 300 GBs. Medical data was estimated to be 100 Petabytes and this continues to grow.
There is also a lot of clinical administration data generated as well. Data from checkups and on body sensors like smart devices etc.

The huge variety of data make it difficult for data scientist to find patterns in data and help patients.

The Data Scientist

What skills do Data scientist need:

Maths and statistics
Domain knowledge and skills
Programming and Databases
Communication and Visualisation

Course Overview

Topics include: Big data applications, algorithms that is used, software systems and are built

Healthcare applications

Predictive modelling- is about using historic data to make future prediction outcome
Computational phenotyping- turning messy electronic health records into meaningful clinic concepts
Patient similarity- uses health data to cluster and group patients

Predictive modelling — The challenges faced:

We have millions of patient data + each of their diagnosis information + medication information + …
There are so many models to be built, this is not a single algorithm, it is a sequence of computational tasks — this is a pipeline with many options which spawn many other pipelines to be compared

Computational Phenotyping — This is raw patient data it consists of:

Demographic information
Diagnosis
Medication
Clinical notes
Procedures
Lab test
…. patient medical history

Phenotyping is when we convert the above raw patient data into medical concepts (phenotypes)

Example of how this is done could be looking at a phenotyping algorithm for type 2 diabetes

EHR: Electronic house record of a patient

Logical workflow for diagnosing a patient with type 2 diabetes

When you follow the above flow, you may enquire about why there are so many checks in place on the patient record. Why cant we just query to see if the patient has type 2 diabetes. The reason for this extensive and complicated workflow is because of the lack of quality of the data in the patient record. These checks cater for errors in the data.

Patient Similarity- Recap this is grouping patients with similar characteristics.

This is case base reasoning where the doctor will look at previous patients and then groups them accordingly.
If a doctor does this manually each doctor will only have a view of their patients.It would be better to add the patient to a global database and expand the group to patients seen by any doctor

Big Data Algorithms

Classification- labelling data based on their features
Clustering- grouping data with similar features
Dimensionality Reduction- reduce the feature set to include the features that are important for the predictions
Graph analysis- create a network of patient and diseases and how they relate to each other.

Big Data Systems

We need big data systems to handle big data:

Hadoop — distributed disk-based big data system
Spark- distributed in memory data system

Course note summaries for each topic covered in the lessons:

Hope you learned something.

-R