Gradient Descent

8 min readSep 15, 2022

The goal of this article is to provide you with an understanding of gradient descent.

This will be achieved by showing you the geometry and intuition behind this algorithm that it its implementation becomes obvious to achieve once you are able to understand this.

We will make the general algorithm concrete through following an example to help you understand linear regression.

So lets get right in…

The below figure shows a succession of lines that are an increasingly good fit for a collection of points on the plane.
Linear = line and regression=return to a former state so Linear Regression simply means computing a line that best fits the state of the data.
To compute the best line, we use gradient descent

Performing gradient descent to solve a linear regression problem

The goal is to compute the best-fit to the given data points. Each iteration of gradient descent successively computes a better line.

How to think about gradient descent

This is a really simple algorithm to understand and as we progress through you should see why. The goal of gradient descent is to minimise error between predicted and actual y, in other words minimise the cost function.

Gradient descent solves unconstrained optimisation meaning — given a real valued function f (defined on n-dimension space) we want to minimise the function f (w) (where w is a real valued number).

A small warm up

Let us assume that n = 1 that is the dimension of space is simply one. Here we are looking at a single plane.

This should give you an easy enough understanding to develop an intuition for gradient descent. Note this case can be generalised to any number of n-dimensions as the same principal is followed.

In the below image we have a parabola, what would it mean to minimise f via a greedy local search. If you start at point x₀ and move one point to the right we notice f will increase. If we if we move in the opposite direction to one point to the left we can see f decrease.

The aim of gradient descent is to minimise f . Now what do think happens if we start at x₁. The opposite occurs, moving one space to the right will decrease f and moving one to the left will increase f. In both cases the algorithm will be able to terminate the bottom — that is the global and local minima in our example.

Therefore put more formally, the case of n =1 follows:

while f’(x) != 0 :

(i) if f’(x) > 0 ; move x a little left

(ii) if f’(x) < 0 ; move x a little right

What does this mean exactly? So while the gradient of input x is not 0, we are not at the global minima we should perform one of the two rules.
If we find the gradient increasing (i) then we want to move x to the left or (ii) if we see the gradient decreasing move x to the right.

The gradient is the derivative and at each step we use the gradient to decide which direction to move in.

So far we have seen a simple and easy to follow example. However in reality not all functions like our example are as easy and simple. Assume we now have to minimise f for the below graph.

Now if we start at x₀ and follow the previous solution we get the derivative/gradient of moving right, we see the gradient has increased. We get the derivative of moving left and we see the gradient is decreasing so this is the direction we want to move towards to minimise f.
We notice x₀ brings us to the basin on the left. Visually we can verify that this is in fact the global minima and we have successfully minimised f.

What do you think will happen if we start at x₁, following the same rules applied earlier? We will end up at the basin on the right.

The examples shown to you, the parabola and not a parabola exhibit two obvious differences:

In the parabola no matter which point you start at, you end up at the same termination point. In the other graph (not a parabola) we find ourselves with different termination results based on where you begin.
In the parabola we are guaranteed to get to the global minima, however in the second example we could be a local minima. This means there are no ways to improve f.

Why are do these two graphs produce varying results. This is due the property of convexity. The parabola is a convex graph that is all points are above the ultimate lowest point. Mathematically a function is convex if and only if:

f(1/2 w + 1/2 z ) ≤ 1/2 f (w) + 1/2 f( z ) for all w and z

That means if we take the average of the points of w and z and apply f you will get a smaller number than if you first apply f to w and z and then average them.

A stronger condition is

f (λ w + (1 — λ) z ) ≤ λ f (w) + (1 — λ)f( z ) for all w, z ∈ R and λ ∈ [0,1]

The graph f l(left of the less than equal too) will always be smaller or equal to a single chord on the graph.

To see this, suppose x is sub-optimal and x* is optimal. As λ moves from 0 to 1, the expression λ f(x*) + (1 − λ)f(x) changes linearly from f(x) to f(x*). Inequality implies that moving toward x* from x can only decrease f. Thus gradient descent will not get stuck at x.

The above definitions apply for any dimension.

Linear Functions

In almost all applications of gradient descent the number of dimensions are greater than 1. Already with n = 2 we see an immediate complication: from a point w ∈ R, there’s an infinite number of directions in which we could move, not just 2. To develop our intuition, we first consider the rather silly case of linear functions, meaning functions of the form

f(w) = (c transpose) w + b

where c ∈ R is an n-vector and b ∈ R is a scalar.

Unconstrained minimisation of a linear function is a trivial problem, because (assuming c != 0) it is possible to make the objective function arbitrarily negative.

Gradient Descent: The general case

The algorithm has three parameters:

w₀ — This is the starting point — where will you begin the traversal of the graph and this can be chosen randomly. For convex functions f will converge towards the global minima always, however as we have seen for non-convex functions this is not true. Where you select w₀ will affect the number of iterations until you find the global minima and this should be selected as your best guess.
ϵ — This is the stopping rule. Since ϵ >0, gradient descent generally does not halt at an actual local minimum. Smaller values of ϵ means more iterations before stopping but a higher quality solution at termination. Think of this as how much more should we explore.
α — the step size. Gradient descent is flexible to allow different alpha values to be used in different iterations.

while || ∇f (w) || ₂ > ϵ do (perform the update rule)

w := w - α .∇f (w)

If we want to see what the update rule looks like in some coordinate jth:

The update rule can be thought of as performing n updates in parallel (one per coordinate of j)

Application: Linear Regression

In linear regression, the input is m datapoints x¹, x², …. , xᵐ each an n-dimensional vector. Also given a real world label y¹,y²…. , yᵐ for each data point.

Each data point i could correspond to a 5th-grade student the ith y could correspond to the test score earned by a student. The values in vector x could be features that describe the the student i such as the average household income in the students neighbourhood etc.

The goal is to compute the best linear function such that h(x^i) ≈ y^i for every i. Every linear function h can be written as:

linear function f with real valued coefficients w₀, w₁… wₙ

We can further simplify the notation by giving every data point a dummy zeroth coordinate equal to 1. Then the coefficient of the dummy coordinate plays the role of the intercept w₀ From now on, we assume that the data points have been preprocessed in that way and the coordinates are named {1, 2, 3, …, n}. We now can associate w ∈ R n with the linear function:

The two most common motivations for computing a best fit linear function are prediction and data analysis. In the first scenario one uses the given “Labeled data” to identify the linear function h, with hope that this function generalises. In the second scenario the goal is to understand the relationships between each feature of the data point and and the labels as well as the relationship between the different features. As a simple example its interesting to know when one of the n features is more strongly correlated with the label y^i.

Mean Square Errors

To complete this article how will one know if they have achieved the best fit? We could use minimising the Mean Square Error value of a linear function to determine a best fit.
The mean square error is the difference between the prediction and the actual y value.

There are a couple of reasons to choose MSE as an objective function

The measure has several nice mathematical properties
Minimising MSE is similar to maximising the linear functions likelihood of data. The Bayesian justification is the data labeled y produced from x by applying h(w) and then adding an independent Gaussian noise.
One nice property of the MSE is that it is a convex function of its variables w. The rough argument is: each function Eᵢ(w) is linear in w, and linear functions are convex; taking the square only makes these functions “more convex;” of convex functions is again convex. In particular, the only local minimum of the MSE function is the global minimum.

In Summary

The goal of gradient descent is to minimise a function via greedy local search.
Gradient descent scales well to large data sets.
Gradient descent provably solves many convex problems.

Quick Reminder: full course summary can be found on Big Data for Health Informatics Course

Hope you learned something.

-R

Reference link:

https://timroughgarden.org/s16/l/l5.pdf