AI Ethics — The BS of big data and Stats 101
This article serves as my personal notes from the course AI Ethics at Gatech.
Overview
This module will cover some underlying components of big data and basic statistics concepts. These stats 101 techniques will include descriptive and inferential statistics.
We will also cover issues that can occur when working with big data from sampling bias to causation vs correlation.
Why Statistics?
The importance of statistics is due to how much of data we produce to make sense of all this data.
Statistical techniques help us make sense of all this data. Statistics is the science of organising, presenting, analysing and interpreting data to assist in making affective decisions.
Statistical analysis is used to manipulate and summarise data to help when making decisions
Unfortunately statistics if used incorrectly can also be used to lie to us, to warp the truth. Bad statistics could be used to make bad decisions. Statistics can be computed and interpreted in many different ways.
Being literate in statistics means you are able to understand massive amounts of data. Being good at understanding statistics can also help you develop and design good algorithms as well understand the complexity of the data you work with and their limitations.
How to mislead through poor sampling
In order to analyse and interpret data we must first collect it. Some definitions of terms you will see being used later:
Sample is the data that is collected and Population is where the sample is collected from. Sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others.
How to mislead through poor analysis
Poor analysis can also be used to mislead. Data analysis is a process of gathering, modelling and transforming data with the goal of highlighting useful information to support decision making.
What data we share with our algorithms is important in being intentional about. In the below example we see a case of misleading through poor analysis.
In the above image it seems like Graph B performs far worse than Graph A. Through visual analysis it could be seen that Graph A is fairly steady. However this analysis is not correct as both graphs represent the same data.
How to mislead through poor interpretation
Interpreting data involves displaying data in some useful way like a graph or chart.
A graphical representation of data can be easily manipulated as we have seen in the previous image. Here are some examples:
In the above example a study shows Germany workers are more motivated and work more hours. In the graph on the left this seems to be exaggerated visually when compared to France or Italy.
When in introduce the 0 axis to the graph to create the image on the right we can see that the difference between the countries are pretty small and that the first image portrayed a different conclusion.
Python and Stats 101
Defining data analytics
Statistics or data analytics are going to be used interchangeably. Statistics can be sub divided into two categories.
- Descriptive Analytics: used to present the data in an summarised way.
- Inferential Analytics: used to make an inference on the population.
What is the difference between big data and data analytics?
Data analytics is focused gaining meaningful insights regardless on the size of the data.
Big data focuses on the same things however the size of the data requires special consideration and different methods to process and deal with the data.
AI/ML and Deep learning are all different concepts and in this discussion they will be represented as one. AI is the ability of a computer to replicate or imitate a human. Machine Learning is a subset of AI. ML is the process a computer can undergo to improve its processing through incorporating new data into an existing statistical model. Deep Learning is a subset of ML where artificial neural networks learn large amounts of data.
Collectively AI/ML and DL represents intelligence and Data analytics is for insight. In recent times ML is used more often as a data analytics tool to solve complex problems.
Both AI and data analytics tools require quality data.
All about the data
The most important parts of stats beyond the algorithms is the data.
A key fact about data is that we do not have all the data available to us to use. We often rely on population samples which are a subset of the entire population, to perform some statistics on them and then use the outcome to make an inference on the entire population.
Data is the facts and figures collected, summarised and interpreted. Data can be represented in
- Quantitative or qualitative values
- Continuous: number that are ranged
- Ordinal/Rank: represent data in order
- Categories/ Discrete: indivisible categories
How to spot a misleading graph
- Distorting the scale allows the user to be easily manipulated. This is common on bar graphs.
- Cherry picking is another way to manipulate a graph — this is when a time range to represent data is intentionally omitted to exclude its impact.
- Selecting only some data points to be represented could distort the truth.
- Even if a graph is correct leaving out relevant data can provide a misleading impression.
- A graph cannot tell you much if you do not know the full significance of what is being presented.
When you see a graph the next time look at the LABEL, VALUES, SCALE, AXIS and the CONTEXT
Descriptive Statistics
Types of Studies and Sampling errors
Descriptive analytics are the methods for organising, summarising and presenting data in an informative way. They include frequency tables, histogram, mean and variance.
Inferential analytics are the methods and techniques used to determine something about the population on the basis of a sample. This can be used to model trends. Inferential can be used for different types of studies.
- Experimental Studies: this is where one variable is manipulated and a second variable is observed
- Correlation Studies: Determine whether a relationship between two variables exist.
- Quasi-Experimental Studies: compare groups against variables that make them differentiable (e.g male/female)
Sometimes studies cause ethical issues because of their objectives of their studies.
Samples are a subset of a population. Sampling scheme are the samples from which you will be deriving your statistics. This can introduce various biases if the sample does not represent the population. If the sample represents a specific group of the population then your analysis could be completely wrong.
The difference between the population and the sample statistics is called the sample error.
Median, Mean and Mode
Centre measurement is a summary measure of the overall dataset (the average). There are numerous ways that represent this; for example mean, mode, median, geometric mean etc.
The mean can be calculated by summing all the values and dividing by the total number of values that exist. The median is calculated by ordering the value and then taking the middle value.
Do you know when to use the mean and when to use the median?
The mean is used for symmetric data. The median is less sensitive to outliers as compared the mean. The median is thus better for highly skewed distribution ( example family income, housing prices)
Another measure of average is mode, mode is the most frequent occurring number.
How to mislead with averages
The above example are the income of a population in an area. Depends on how you this data is communicated you will get varying degrees of average being understood.
Frequency Distribution
Another descriptive stat that is useful is the frequency distribution. This allows you to tally up the number of times a specific data item occurs. Think of it as a popularity stat.
The cumulative frequency distribution is the running total of frequency distribution. It is useful to the know this statistic as the cumulative frequency can tell you the total number of items at different stages in the dataset. Think of a moving frequency distribution value.
One can use this stat to manipulate data as the cumulative frequency distribution will show the previous value combined into the current value thus making it confusing for the receiver to understand actual growth difference.
Variability
This is also known as dispersion and is used to measure the amount of scatter in a dataset in order to understand how an average can be used to characterise a set of observation.
There are a many ways to compute variability:
- Range: the difference between min and max. This is a crude measure of variability
- Variance: computes the average of the squares of the deviations of the observation from the mean.
- Standard deviation: the square root of the variance
- Inter quartile range: what is a quartile? this is a representation of ordered data that is divided into 4 parts. We can use the quartiles associated with the data using a five number summary
- The smallest number that is the minimum (1st quartile)
- The first quartile (25th percentile)
- The median (in quartile 2)
- The 3rd quartile (75th percentile)
- The maximum (4th quartile)
Inferential Statistics: Sampling bias
Inferential Statistics Introduction
Inferential statistics is the practice of drawing inferences about an individual based on data from a similar group of people.
This is the numerical indication of how likely it is that a given event will occur. Statistical probability is the odds given what we have observed in the sample did not occur because of errors (random and or systematic)
In other words the probability associated with a statistic is the level of confidence we have the same group that we measured actually represents the total population.
This means that anytime you see some prediction, remember that there is a probability that this prediction is not true.
Issues that may arise using inference are due to the the samples are chosen to not truly represent the population.
Simpson’s Paradox
The is a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
It is another way in which statistics can be abused by some to establish the truth, while at the same time mislead us.
Biased Sampling
A major source of error in algorithms is due to the selection of data/sample of data selected.
Bias = (mean of means)- (true mean)
The true mean is the real mean, of the entire population. This is sometimes hard to establish especially when working with big data.
The statistical definition of bias is; if the mean of the means is equal to the true mean then the bias is zero and we say that the estimator is unbiased.
The mean of means is called expected value of the estimator.
In reality we cannot collect all the data, thus bias will exist and vary depending on how we sample data. This is called sampling bias. Based on how the data is collected there are various types of sampling biases:
- Area bias: this is a bias that is introduced when we conduct a study in a specific area that does not included a representative sampling of the population being studied.
- Selection bias: Bias introduced by the selection of individuals, groups or data analysis in such a way that proper randomisation is not achieved. That is not samples are not given equal opportunity to be included in the study, thereby ensuring that the sample obtained is not representative of the population.
- Self selection bias: When participants self elect to take part in study that may correlate with traits that affect the study, again causing non-representative problems.
This is bias sampling, with non probability sampling. - Leading Questions bias: This is where the study will ask a question that will give them a clue to the desired answer or leads respondents to answer in a certain way. This may cause participants to agree with the direction of the leading question.
- Social desirability bias: This is a type of response bias, where the participants will respond in a manner to seem more favourable to others.
Biased Sampling Example
In the above image we have 4 graphs each representing varying degrees of variance and bias.
a. Graph (a) shows a high variance and high bias — look at the grouping of bars that appear closer to the left hand side of the centre arrow as well as the distribution
b. Graph (b) shows a low variance and low bias — look at the bar positions and heights relative to the centre arrow. This is the most desirable.
c. Graph (c) shows a high variance and low bias — as we can see the distribution of bars are more or less equal thus the low bias however the left and right side of the graph are less reflective.
d. Graph (d) shows a low variance and high bias — all data is leans towards the far right indicating high bias.
Graph (b) is the most desirable graph since a good sampling method has both low variance and low bias. This means that true mean is very close to the mean of means.
However in this example we can see that graph (b) looks like a Gaussian population distribution, which we know in reality, that this is not the case. Sometimes theory does not match the real world phenomenon.
Types of Randomised Sampling
Due to all the biases that may arise from sampling biases there are tools that we can employ to decrease this.
Randomisation: this is a process that makes sure that on average the sample looks like the rest of the population set.
Randomising enables us to make rigorous probabilistic statements concerning the possible error in sample.
If done right it should help us minimise the biases or difference between the inferential statistics associated with the sample set vs the inferential statistics associated with the population. That is; what is true for the sample vs what is true for the population.
There are various types of randomisation methods, namely:
- Random Sampling
- Systematic sampling and Systematic random sampling
- Stratified Random Sampling
- Cluster Random Sampling
- Non probability Sampling
Random Sampling: this is like picking a name out of a hat. That is a sample is randomly selected in such a way that every possible sample of the same size is equally likely to be chosen. This is one of the simplest methods of sampling.
Systematic Sampling: When we sequentially order all data and we select every nth piece of data.
Systematic Random Sampling: similar to systematic sampling, but this time the starting the sampling from a randomly selected individual. This may not always be well suited for every type of sampling.
Stratified Random Sampling: This is where data is divided into subgroups or strata. Strata is specific characteristics. Random sampling is then applied within each strata vs the entire group.
Cluster Random Sampling : Sometimes stratifying isn't possible. Maybe the data does not have the resolution needed for labelling or it may be too costly to develop a complete list of desired population characteristics.
We can achieve a result similar to stratifying without the high costs by splitting the population into clusters of similar parts that can make sampling more practical.
Each cluster should be a miniature version of the entire population. Then we could select one or a few clusters at random and select a simple random sample from each chosen cluster.
If each cluster fairly represents the full population, cluster random sampling should produce an unbiased sample.
—
The difficulties that come with selecting samples that accurately represent a population or group is challenging. Unfortunately this is the data that drives companies inferences.
The high cost of obtaining data has companies looking on the internet for data.
Non-probability sampling: participants are chosen such that their chance of being selected is not known. No one yet has figured out how to select a representative sample of internet users. This is the same data that is used to feed our algorithms.
Inferential Statistics: Causation and Correlation
Correlation vs Causation
Correlation tells us that two variables are related. For example the more time you spend exercising the more calories you will burn. There is also a concept of negative correlation, that is the increase in one variable would lead to a decrease in another variable. For example the more class discussion a student spends on on social media may cause a decrease in their grade.
There are two main types of relationships reflected in correlation:
- X causes Y or Y causes X: Causal Relationship
- X and Y are caused by a third variable Z: Spurious relationship
To determine causation you need to perform a randomisation test. This is where you observe variation in the variable (X) assumed to cause of change in the other variable (Y) and then measure the change in the other variable (Y)
Correlation does not imply causation.
The correlation coefficient summarises the associations between two variables. Based on its value you can determine whether the association between two variables is strong or weak.
Correlation coefficient = 0 : this indicates no linear relationship.
Correlation coefficient = + 1: indicates a perfect positive linear relationship. That is as one variable increases in its values the other variable increases in its values in the exact linear line.
Correlation coefficient = -1: indicates the perfect negative linear relationship. That is as as one variable increases along the linear line, the other variable decreases.
Correlation coefficient [0, 0.39] : are typically indicate weak to very weak positive linear relationship.
Correlation coefficient [0.4, 0.6] : indicate a moderate positive linear relationship.
Correlation coefficient [0.7, 1]: indicate a strong too very strong linear relationship.
and the inverse is true for the negative values.
Correlation tells us two variable are related however it does not tell us why.
Relationships
A strong relationship between two variables does not mean that the changes in one variable, causes the changes in the other.
The relationship between two variables are often influenced by another variables that are lurking in the background. A lurking variable that is either unrecorded or unused in analysis.
These two relationships maybe mistaken for causation. There are two main relationships that may be mistaken for causation:
- Common response: this refers to the possibility that a change in a lurking variable is causing a change in both our explanatory variable and our responses variable. Example we have three variables X, Y and lurking variable Z. We establish that X and Y have a relationship when we analysis this further we could determine that either changes in X causes Y or changes in Y causes X. However it could be possible that Z the lurking variable is changing and this could be impacting both X and Y.
- Confounding: refers to the possibility that either the change in our explanatory variable is causing changes in the response variable OR that a change in the lurking variable is causing changes in the response variable. Let us use the same example as previously used. We have X, this time we will call this the explanatory variable, Y the response variable and Z our lurking variable. Confounding refers to us analysing Y responding to changes in X, but in reality it could be that Y is simply responding to changes in Z. Another way to understand this is the word confounding is like confusion, that is there is a confusion in what is responsible for the change in Y.
Repeat correlation does not imply causation
Inferential Statistics: Confidence
Empirical Rule
The bell curve is used to describe a mathematical concept called a normal or Gaussian distribution. Bell curve refers to the shape that is created when a line is plotted over this distribution.
Many tools used in statistics examine data under the assumption that the population variables are normally distributed. If the sample size is large enough then the sampling distribution will also be nearly normally distributed. If this is the case then the sampling distribution can be determine by two values; that is the mean and standard deviation.
The Gaussian bell curve is one of the fundamental concepts on which many of our statistical approaches are based. Most of the applications of statistics in the real world rely on the assumption that the data being analysed is distributed on the shape of a bell curve.
In order to validate this assumption we have a concept called an empirical rule.
The empirical rules is used to estimate the probability of an event occurring. The empirical rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean.
The basic point of this rule is easy to grasp. 68% of data points for a normal distribution will fall within one one standard deviation of the mean. 95% of the data will lie within 2 standard deviations of the mean and 99.7% of the data will be within 3 standard deviations.
For the empirical rule to work, data must follow a normal distribution. The curse of the Bell Curve comes from the fact that we often use the Bell Curve in situations that there are no resemblance to a normal distribution. Many real life phenomena do not follow the curve and yet assume the simplicity of the bell curve is so highly tempting.
For example wealth distribution, the assumption that wealth is normally distributed will not allow the worlds wealthiest individuals to exist.
A normal distribution is more like an exception rather than a norm.
Let us try and use the empirical rule to make assumptions about our data. For example if we were to predict IQ. IQ is a measure of people’s cognitive abilities in relationship to their age group. IQ scores are normally distributed with a mean of 100 and a standard deviation of 15.
How would you use the empirical rule based on the mean and standard deviation of the population to show that 95% have an IQ score between 70 and 130?
Note: 95% of the population fall within 2 standard deviation according to the empirical rule.
Empirical Rule in action:
The upper bound of the range: 100 + 2 (15) = 130
The lower bound of the range: 100 -2(15) = 70
Most times you will have to calculate the mean and standard deviation of the data and then use the empirical rule.
The sample mean should give us an unbiased representation of the true mean. Since we very rarely have access to all the data, the sample mean is the closest to this.
Empirical Rule:
Upper range: Mean + x( Standard_Deviation )
Lower range: Mean — x( Standard_Deviation )
Where x is the standard deviation you are interested in determining the values for ( within the empirical rule). Example if the data standard deviation is 15 and you want to know what is the range (the upper and lower bounds) for 95% of the population then x is the standard deviation of 2, and 15 will be used as the standard deviation value in the above formula.
Population Proportions and Margins of Error
Another concept related to the empirical rule we could say with a certain degree of confidence — if the sample was large enough and representative, then the proportion of the sample would be approximately the same as the proportion of the population.
How confident we are can be expressed by a percentage.
We have just seen that approximately 95% of the area of a normal curve lie within +- 2 standard deviations of the mean.
This means that we are 95% certain that the population proportion is within +- 2 standard deviations of the sample proportion. +- 2 is our margin of error.
The percentage margin of error that this represents will depend on the population size. That is at 95% of confidence the margin of error:
In other words the confidence level is 95% plus or minus 3%
Lets take an example of a company who surveys customers and the results show that 50% of the respondents say its customer service is very good. If the same level of confidence and margin of error exists. This means that if we conduct this survey 1000 times. The percentage of who say the service was very good will range between 47% and 53% most of the time (95% of the time)
We can also estimate a population proportion using a confidence interval. If we build the margin of error around the true value it will capture 95% of all the samples, but if we instead look at it from the point of view of a sample set.
If we build the margin of error around the sample statistic. We would then have a 95% chance of capturing the true value.
Sample Size vs Margin of Error
When it comes to margin of error the size of the population size does not matter.
The margin of error measure how accurately the results of a poll reflect the true feeling of the population.
As sample size increases the margin of error decreases. However there is a cost associated with this. The below table depicts the decrease of the margin of error with higher sample sizes.
Hope you learned something.
-R