Deep Neural Networks
Topics of concern is health care analytics and data mining. Health care applications and health care data intersected with data science and big data analytics. Understanding algorithms for processing big data.
This article forms a part of a series of articles under the topic Big Data for Health Informatics Course
You can visit the above link to understand this topic in context of the full course. This article on Deep Neural Networks can be understood independent of the full course.
Note: This article requires knowledge of machine learning concepts.
Introduction
A neuron
To describe deep neural networks we start by understanding a the most straightforward form of a neural network which is made up of a single neuron.
A neuron is a computational unit that takes n input values [n1, …, nk] and their associate weights [w1, …, wk] as well as a bias term b. Applies a computations process (activation function) and then produces an output y .
The computational process
This computational process involves a linear combination and a non linear activation. More specifically the linear combination produces an intermediate output Z.
Z is the sum of wi (the weights) * xi (the input data) + b (bias). Then we pass Z through a non-linear activation function g (z) which produces the final output y.
Depending on the task of the neural network, y can be either binary for classification problems or numerical for regression problems.
To learn a model for this single neuron we need to specify the:
- input
- non-linear activation function g(z)
- the bias term b
in order to learn the weight w1 …, wk from data.
Activation Functions
The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it.
This describes the non-linear transformations on a neuron. This needs to be specified by the modeller. It is not learned from the data.
For activation functions there are a few preset popular choices:
- Sigmoid Function
- Tanh
- Rectify linear
Sigmoid Function
The input for can be real arbitrary values and the output can be in the range [0 ,1].
This can be naturally interpreted as the probability of an event. For example as the probability of having heart disease. As a result Sigmoid function is a popular choice for classification applications/tasks.
The Sigmoid function has a vanishing gradient problem. Neural network learning is based on gradient based optimisation. If the gradient is too close to zero the optimisation process will not be able to make progress. This is called the vanishing gradient problem.
For example if we look at the graph above, we can see the gradient of both ends of the sigmoid function is close to zero. That is the graph tapers at both ends, when x is very small or when x is very large.
Tanh Function
Tanh is another popular gradient function. The output of TANH function is centred around zero and bounded between [-1, 1].
This is a rescaled and shifted version of Sigmoid. This function has larger gradients, but still has the gradient vanishing problem when the x is far away from zero.
Rectified Linear Function
This is a simple and more modern activation function. This is also called ReLU. This is specified as the maximum over (0, x); zero and the input x. The output of ReLU is between (0, infinity); zero and infinity. This is not bounded unlike Sigmoid and tanh.
Visually it is a linearly increasing curve when x is greater than zero. The threshold is set to zero. This does not suffer from the vanishing gradient problem.
Activation Functions Summary
Activation functions are crucial building blocks for neural networks. They are a few popular choices which we just covered:
- Sigmoid
- Tanh
- ReLU
The above graph shows their relative relationships plotted. We notice Sigmoid and Tanh are bounded in a small range while ReLU has no upper bound. Only a lower bound of zero.
To train a neural we need to select an activation function as a part of the neural network architecture. Which activation function to choose is highly dependent on the application.
Train a Single Neuron
Now we will consider the most simple neural network, with a single neuron. Our goal is to figure our how do we learn the model parameters. In particular we want to learn:
- The weights w1 … wk
- and the bias term b
We break down the computation of the neuron into two steps.
- The linear combination
- the non-linear transformation
The linear combination computes the weighted sum w1 to wn times the input x1 to xn plus the bias term b. We call this intermediate result z.
The non linear transformation is the activation function over at z to produce the output y.
In a supervised setting we want y to be close to the target value t.
In order to measure the quality of an output we need to specify a loss function. This loss function will provide a difference between the output y and the target t.
For example we can use a squared loss or a squared Euclidean like in the Simple Neural Network defined above.
The main goal is to minimise the loss function on the training data by adjusting the weight of the neural network.
Stochastic Gradient Descent
In order to use a neural network as a model we need to learn the parameter of the neuron. These learned parameters are :
- the weights w1 to wn
- and the bias b
The goal is to adjust the weight and bias such that the output is getting closer to the target.
One computationally efficient way to achieve this is to use stochastic gradient descent algorithm (SGD).
SGD takes as input the training dataset and the learning rate. It first initialise all the weights and the bias term to be the some small random value. We then iterate over each training example with x as the input and t as the target label.
For each instance:
- first we compute all the gradient vector with respect to weight w and the bias term b
- We then simply update the weight vector w (we do this until there is no difference in the update) we use the learning rate to move in the direction of the update.
- we also update the bias term b in the opposite direction of the gradient.
The key for neural network learning is how do we perform these parameter calculations efficiently.
The SGD algorithm is covered in more depth on a previous lecture.
Forward Computation for a Neuron
To update a neuron based on training data we need to perform two passes over the network.
- One forward pass to compute the output y and the loss
- One backward pass to compute the gradient for each parameter.
In this case we first compute the linear combination z and g. In the above example g is the sigmoid function. Note these two terms have derivatives defined to the right of each equation.
These will be used in the backward pass.
Backward Computation for a Neuron
After the forward pass we know the output y and the loss. Now we can perform a backward pass to determine all the derivative of those parameters.
Mathematically we will apply the chain rule of the derivatives from output to input. For example we can calculate the derivative of the loss function L with respect to the the weight w i. In order to achieve this we need to specify the product of three derivatives first.
- The derivative of the loss function L wrt the target y
- The derivative of the target y wrt the linear combination z
- The derivative of the linear combination z wrt the weights w i
As we can see in the image above the final result is the derivative of the loss function wrt the weights represented as = (y -t) y ( 1- y) x i
This will provide us with the derivative for the weights w i
Similarly using chain rule, the derivative of the loss function L with respect to the term bias b; is the product of three derivatives as well.
- The derivative of the loss function L wrt the target y
- The derivative of the target y wrt the linear combination z
- The derivative of the linear combination z wrt the bias b
As you can see in the image above the calculation is similar to weights however the bias is represented as = (y -t) y ( 1- y)
Now that we have these derivatives for both bias and weights, we can use them in any gradient based optimisation algorithms such as SGD algorithm we have covered earlier. This will be used to find the optimum weights and the bias.
Multilayer Neural Network
The structure of a neural network and important components and their roles as numerically labelled in the below image:
- Input layer, input units/data: This is the feature data
- Input layer, input bias: This is the input data corresponding bias factor.
- Hidden layer, hidden units: The values from this layer is not available during the training. It is hidden. There can be numerous hidden layers in a single neural network.
- Hidden layer, hidden bias: bias factor for the hidden layer.
- Outer layer, output unit/data: This is the prediction result after the neural network performs a run during training / testing. Y is represented with superscript indicating the layer level. We will later see that this is a common theme used.
- Weights: These are network parameters. Weights have a subscript of i & j; this notation is used to depict index of the weight between the input node i (index) and the hidden node j (index). Weights also have a superscript x, this is used to store the layer details.
In the image above the example weight w has i and j and x all equal to 1. This means this is the weight of the first input node to the first node in the hidden layer between the first layer. - Bias: The notation of the bias is similar to the weight except its just the location of the hidden node as the subscript alone and the location of the layer as the superscript
- Activation function: This is denoted as g superscript 2 showing that this is in the second layer. This function is supplied to the intermediate linear combinations of the associated weight and input data. The input into the activation function are from the first layer in the above example.
Train a multilayer Neural Network
To train such a multilayer neural network we use stochastic gradient descent. A recap of SGD is found in this article under the title Stochastic Gradient Descent.
The input to the SGD algorithm is the training data (x) and the learning rate (eda). We first initialise the weights and bias to a small random value. We then iterate over some steps until we witness a convergence of weights and bias.
During this iteration we :
- compute the derivative of the loss function L with respect to w
- and compute the derivative of the loss function L with respect to b
- Finally foreach weight and bias we update the old values with the newly computed derivative results, only moving eda( the learning rate) closer.
The key is how to compute these derivatives for an arbitrary deep neural network.
Forward Computation
The forward pass off a neural network is important for scoring a new data point to get the output value y. This step is also a crucial building block for learning the new network.
Here we are going to illustrate the forward computation step by step using an example.
Assume we are given 3 input data points [x1, x2, x3]. We need to compute 3 linear combinations for [h1, h2, h3] per input data point. These are the three hidden units.
In particular z1 indicates the linear combination for unit h1. This is specified as sum of each weight [w11, w21, w31 ] times each input x [x1, x2, x3 ] plus bias b1.
After this linear combination is completed by z1, a1 performs the activation function which takes as input z1. We do this for each unit in the hidden node [h1, h2, h3] and calculate [z1, z2, z3] which form input into [a1, a2, a3]
Finally we calculate the output y by summing all the results of the activation results [a1, a2, a3] multiplied by the weights [w1, w2, w3] between the hidden layer and the output later plus the bias from the hidden layer.
In the above example we see a fully connected neural network. This means that all units between layers are connected for example x1, x2, x3 are all connected with h1, h2, h3 which in turn all connected with y.
Each unit computes the linear combination z first, then followed by non-linear activation g. As we can see there is a lot of symmetry here. Multiple neurons are involved to perform simple operations and collectively they help to learn a complex mapping from input to output.
Forward Computation: Vector Form
This same forward computation can be represented by a more compact vector notations. The matrix in the first input layer becomes a vector. The bias will also be a vector. This will make the linear combinations for all those weights become a matrix vector multiplication.
We extend the activation function to apply to vectors in an element-wise fashion. So likewise we have a linear combination for the output layer (middle layer)
Forward Computation: General Form
So far we have seen a simple architecture for this forward computation, however this forward computation is not limited to this simple architecture.
A more general architecture from layer L to layer L+1. Can be simplified with the equation below.
In the above image we can see:
- z is calculated between input vector a and matrix of weights plus bias vector
- similarly the activation applies for each element of z
Forward Computation Summary
The advantage of a vector form is for efficient processing.
Gradient Descent for Neural Networks
As a quick recap.
We initialise the weights and the bias to a small random number and then for each training example we try and compute the corresponding derivative with respect to the weights w and the bias b, thereafter we do a simple update based on the gradient for w and b.
The key is to efficiently compute this compute these two derivatives (weight w and bias b). This is where we have the back propagation algorithm.
Backward Propagation
Backward propagation is a widely used algorithm for training feedforward artificial neural networks.
We use back propagation to compute the gradient of the loss with respect to the weight.
First we will look at the derivative of Loss with respect to the weight.
This results in a derivative which is a product of two terms:
- The input from the previous term a
- The second term is more complicated. This is the derivative of the loss with respect to the linear combination of all weights for a hidden node. Particularly we measure how much a node h in the hidden unit is responsible for the loss function
We then use the chain rule to determine the derivative of the loss with respect to each weight per layer.
Second we will look at the derivative of the Loss with respect to the bias function.
We will use the chain rule to break down the derivative:
- The first term is delta — and this is actually similar to the previous derivation for w. The delta j is the derivative of the loss with respect to the linear combination z j.
- The second term using the same logic will lead to a constant
The result seen in the below image is just delta j.
Now we know how to compute these two derivative terms. We want to efficiently compute delta j term for arbitrary neural networks.
In Summary now we have two derivative terms:
To recap for derivative of Loss with respect to weight seen on the left of the above image, involves understanding the input from the previous layer (a) and the delta term.
And in the bias case we can see on the right equation it’s just the delta value.
We need to figure out how to compute delta j efficiently. We can compute this in a backward fashion from the output layer to the input layer.
Here we see how to compute the delta term in a backward fashion. In the above image we can see that at the very end of the network we have delta k (that is the last layer).
Here we see we want to calculate the derivative of the Loss function with respect to the linear combination z for the 4th layer.
In this case this results in the difference between actual and predicted value.
Now that we have this delta (k) for layer 4, now we want to compute the delta for layer 3. The image above depicts the next step.
We want to compute the previous layers derivative of the loss function with respect to linear combination z of the 3rd layer:
- the first term will be delta (k) 4th layer
- the second term is the derivative of the activation function, because linear combination z 3 is input to the activation function.
The final result is -(t — y) * derivative of the activation function
Now we will step further back to the second layer.
We want to calculate delta 2. In the second layer we have multiple hidden units [h1, h2, h3]. What this means is we will calculate 3 deltas per hidden unit in the second layer.
Here we will look at a general way of computing delta for each hidden unit it the second layer. Again we apply the chain rule to find this answer. We see in the above equation we have parts to the equation:
- The first part is the partial derivative of the loss function with respect o z (linear combination for the 3rd layer). This results in delta 3. Something we just computed above.
- And second the derivative of linear combination z3 with respect to linear combination z2. We use the chain rule on this equation and we find it is made up of:
1. Sum of weights x activation function / activation function + bias
2. The above multiplied by delta of the activation function with respect to linear combination z.
This results in: delta 3 * weights 2 * derivative of the activation function with respect to z
Backward Propagation Summary
In summary delta j is a the summation of the weights multiplied by the neighbouring layer times the derivative of the activation function with respect to the linear combination at the same layer as j.
This generic example is for one layer and we can increase this by adding more layers and extending the function.
The important thing is that we can compute any of these derivatives with a single backward propagation. Therfore we dont have to go through this network mutliple times. Just one pass backwards to compute all the gradients for all this different deltas.
Then we can multiply the delta with the corresponding input, we can compute the derivative of W (weight) and or B (bias).
Backward Propagation Algorithm
Summarise the backward propagation algorithms. This algorithm has two parts:
- Part 1: We have the forward pass:
Starting with the input x, go forward to the output later. Compute and store intermediate layer variables. - Part 2: Backward pass to adjust the weights:
First we compute the delta function for output unit.
Then going backwards to compute the rest of the delta function, using this to compute the derivative of the loss function with respect to both the weights and bias
Quick Reminder: full course summary can be found on Big Data for Health Informatics Course
Hope you learned something.
-R