top of page

Gradient Descent
The Practical Way

gradient decent.png

In this chapter, we will explore the basics of the Gradient Descent algorithm and how you can create it yourself using Python.

1.1 - What is Gradient Descent?

Machine Learning is the autonomous process by which a machine can "learn" to accomplish a task rather than a programmer providing a step-by-step solution. This process is useful when solving a problem with no obvious or rule-based answer.

Gradient Descent is one of many machine learning algorithms used to optimise an AI model. 

gradient decent labeled.png

In this tutorial, we will explore the basics of gradient descent and learn how to create it from scratch using the Python langauge.

The Aim
2.1 - The Aim of Gradient Descent


A critical component of optimising an AI model is establishing the error—by how much do the model's predictions differ from the target values in our real-world training data?

The aim of gradient descent is to tune the model's parameters to reduce this error. Treating this as a minimisation problem, we can attempt to find the combination of model parameter values that will result in the lowest error achievable.

gradient decent.png

Each change in the model parameters will lead to a change in the model's accuracy. For a simple problem, such as the example above, we can visualise these results as an "error surface," otherwise referred to as a "cost surface." This netted shape represents the calculated error values for all of the individual combinations of the variables at play, in this case, the inputs named "Weight Value" and "Bias Value".


At some point on this surface, there will be a region where the error value is at its lowest point, known as the "global minimum." Unlike other low points, referred to as "local minimums", the global minimum reveals the optimum parameter values for the model. Ultimately, finding this global minimum is the main goal of the gradient descent algorithm.

2.2 - The Approach


In reality, searching every combination to find the global minimum is not an efficient use of time or computer resources. Additionally, most real-world problems require many more variables than the 2D and 3D examples used on this page. This is important to remember, considering that most graphs can only represent up to 3 dimensions plus colour.


However, using gradient descent, we can begin with a random combination of parameters and iteratively "descend" towards the minimum point, therefore expecting to vastly reduce overhead. A common analogy for this process is to imagine a ball placed on a hill, letting it roll down to the lowest point and finding where it stopped.


To accomplish this task, we will be taking the following approach:

  1. Initialise model parameters

  2. Derive the gradient for the error at this point

  3. Adjust the model parameters based on the gradient

  4. Repeat steps 2 and 3 until the training criteria have been satisfied

The Walkthrough
3.1 - Establishing the Model


Ultimately, an AI model is a complex set of functions produced to compute a solution to a given problem. Even when including an error function on top to assess the model, we can still represent the overall problem as one large equation.


To demonstrate gradient descent more easily, we can substitute a complex AI model with an error function with a simplistic quadratic equation. Therefore, if we find the minimum of our placeholder function, we have "learned" a solution for our imaginary model. This will enable us to experiment with gradient descent without needing a problem to solve or gathering data for this problem.


A quadratic equation gets its name when one or more of the input terms is raised to the power of 2, giving the line a curved "bowl" like shape. The equation y = x^2 + 2 was selected for three reasons. Firstly, the equation only has one variable to tune, simplifying the problem. Secondly, the x^2 produces a simple shape with a single global minimum. And lastly, the + 2 raises the line above the x-axis, allowing the line to be more easily seen at its lowest points.

3.2 - Initialise the Model Parameters

The overall model is represented by the equation:

y = x^2 + 2

The dependent variable "y" acts as the output, with the independent variable "x" representing the input. The other parameters of the model, such as the ^2 and +2 are locked and cannot be changed.

The goal is to find the minimum of the function, or in other words, uncover what value of "x" provides the lowest result in "y".

start point.png

To improve the model, we first need a starting point. For this scenario, we don't need a complicated method of choosing a starting point, so we can select any point at random, such as x = -4. After this, we can slowly nudge the value of x so that the y output reaches the lowest point, as signified by the arrow.

3.3 - Derive the Gradient


After selecting a starting point, we can begin to move towards the optimum point. In order to do this, we need to establish two important values for each variable we will adjust.

  • How much does the value need to be changed?

  • In what direction (+/-) does the value need to be changed?

These values can be calculated using the gradient at this point. A gradient represents a rate of change. For example, the gradient can be applied to a slope to measure steepness. Moving along a steep slope will cause a large change in height, whereas a slight slope will lead to a smaller change in height.

The value of the gradient reveals how drastic of a change is needed. For example, if the rate of change is high, then the model is far from the optimum point. However, if the gradient is low, then less change is needed. Therefore, a gradient close to or equal to zero would suggest that the model has reached optimum and no further changes are needed.

The direction of the gradient indicates the direction in which the variable needs to be altered. For example, a negative gradient (pointing down) will show that the value needs to be increased. However, a positive gradient (pointing up) reveals that the parameter needs to decrease.​

increase decrease.png

In short, two important rules can be followed:

  • The closer the gradient approaches zero, the less of a change is needed.

  • The direction to adjust a parameter is opposite to the direction of the gradient.

It is also important to note that although the gradient indicates a "slope," it represents specifically a rate of change. The point is not to draw a line on a graph but to uncover how quickly the value of one variable is changing with respect to another.

3.4 - Making a Change


The next stage of gradient descent is to update the model parameters. This is thanks to one equation in particular, at the heart of this machine learning algorithm:

New_Value = Old_Value - (Step_Size * Gradient)


When updating model parameters, it is important to make changes in controlled amounts. Too large of a change can lead to "overshooting", making it impossible to get to the true optimum point. Alternatively, making too small of a change may be more accurate but will take too long. Therefore, we can introduce a "step size", an appropriately small number chosen to nudge the parameter by the right amount.


This step size is multiplied by the gradient to further control the intensity of the change. For example, when the gradient is high, we can make a larger change, and when the gradient is low, we can make a much finer change.

The direction of the gradient is the opposite of the direction in which the parameter is to be adjusted. To overcome this, we can update the value by subtracting the amount to adjust it.

gradient decent labeled.png

The general characteristic of the algorithm is displayed in the image above.


At the start, there are large distances between updated points as the algorithm quickly approaches the general region of the solution. Near the end, gradually smaller distances are visible when more precise changes are made. Once at the global minimum, the changes are so small that nothing seems to change, and the training can be stopped.

Overall, we can update the parameter by taking its current value and subtracting a controlled amount based on the gradient.

The training is done over many iterations, gradually improving the model's accuracy. This loop can be stopped once the training criteria are met, such as when the error score is deemed low enough or when the gradient gets close to zero.

Quick-Fire Quiz

Before we progress into the next section, let's test your knowledge so far with some quick-fire questions!


Q1. Using the image below, select the answer that best describes the following gradient.
gradient pos neg.png
The Background
4.1 - What is the Gradient


The gradient is a crucial component of the gradient descent algorithm, as the name suggests. Therefore, it is important to understand how exactly the gradient is calculated.


The gradient, also known as the slope, represents a rate of change. As the gradient increases, the rate at which values change over time increases. Take the equation for a straight line:

y = mx + c


y = (gradient * x) + bias

On a typical straight-line graph, this is demonstrated by a greater gradient creating a steeper line, causing values to rise more rapidly. Alternatively, a lesser gradient would lead to a more shallow line tilt.


The bias, or "c", controls the y-intercept, where the line cuts through the y-axis. The bias can be said to raise or lower a line, but as we are purely demonstrating gradient, the bias has been ignored for the examples below.


The gradient is typically controlled by a multiplication of one or many input variables. For example, as shown in the image above, the changes to the slope of the line are proportional to the numbers being multiplied to the X term in the equation. For functions utilising multiple input values, each dimension can have an individual gradient applied.

4.2 - Finding the Gradient


In many cases, the gradient may not be given. There are a variety of equations avaliable, that can calculate, or at least very closely estimate, the gradient. The simplest and most common method is the rise-over-run equation.

rise over run equation.PNG

The rise-over-run equation calculates the rate of change between two points. This is achieved by dividing the resultant change in the y-axis (Δy) by the measured change in the x-axis (Δx). Applied to a straight-line graph, any two points can be chosen, as the gradient of a straight line is constant.

4.3 - The Rise-Over-Run Approach


The rise-over-run equation can look quite daunting at first, so let's work through the following example.

We have been provided with a mystery line, the output of an unknown straight-line equation with an unknown gradient and unknown bias. Through applying rise-over-run, we can make short work of achieving these values.

The process to take is as follows:

  1. Choose any two points (Something with whole number values would be easiest)

  2. Calculate the difference in y (y of point 2 - the y of point 1)

  3. Calculate the difference in x (x of point 2 - the x of point 1)

  4. Divide the change in y by the change in x

rise over run.png

In the example above, I have chosen two points that conveniently cross on axis lines. 

  • Point 1 = (2,2)

  • Point 2 = (6,4)

This gives us the x and y values of:

  • X1: 2, Y1: 2

  • X2: 6, Y2: 4

4.4 - Calculating the Gradient


With two points selected, the next stage is to calculate the differences. The first step is to measure the difference in y values by subtracting the first point's y from the second point's y.

Δy = y2 - y1


This provides the result of:

Δy = 4 - 2 = 2

The second measurement is the difference of the x values by subtracting the first point's x from the second point's x.

Δx = x2 - x1


This provides the result of:

Δx = 6 - 2 = 4


From these results, we can conclude that a difference of 4 on the x-axis will lead to a difference of 2 on the y-axis.


Finally, dividing these differences gives us the overall rate of change, revealing the missing gradient value.

rise over run equation 2.PNG

We can conclude that the gradient of the mystery line is equal to 0.5. In other words, the change in y is half that of a change in x.


The previous graph shows that the y-intercept, where the line crosses the y-axis, is at the value y = 1. Taking this intercept value into consideration, we can choose any point on the line, subtract 1 from the y value at this point, and see that the reduced y value is exactly half (0.5 times) that of the x value that produces it.

4.5 - Non-Linear Gradients


The rise-over-run method is a great way to calculate the gradient. However, it does have some crucial drawbacks that affect its suitability for use within gradient descent.

For example, as stated previously, the gradient of a straight line is constant. That means that anywhere on the line, or for any two points we choose, the gradient will be the same. However, as displayed by the image below, the gradient is different for each point we examine for our scenario's non-linear equation.

diferent gradients.png

This effectively makes the rise-over-run incompatible with our specific use case. Bolstered by the fact that any scenario we cover through these tutorials will result in a non-linear equation to minimise, we need to investigate an alternative.

4.6 - Introducing the Derivative


Rather than generalising a gradient across the entire function's output, we need to adopt a method of deriving the gradient for a specific point we need to examine.

As gradient descent involves improving a given model, we can use the characteristics of the model's known equation as a foundation to formulate a way of calculating the gradient. Though the solution may not be known, the model itself should be.

The common way of achieving this is by calculating the derivative of the equation, a new equation that will provide the gradient for any value we plug in.

The derivative has a complex background concept, so we will only cover what we need to know. However, the idea is primarily based on the rise-over-run method, but it calculates the gradient using "two points" with an almost infinitely small distance between them. This effectively calculates the gradient at a single point, which is exactly what we need.

infinate small points.png

Fortunately, there is a range of popular mathematical "rules" that allow us to simplify this process by bypassing most of this maths.

4.7 - Introducing the Derivative


The derivative is used to measure an instantaneous rate of change, such as the gradient at a specific point. It can also be focused on a specific input variable, as we will demonstrate in later tutorials.

The following equation calculates the derivative of the target equation with respect to the input variable "x".

derivative question.PNG

In other words, this equation asks the question:

If I make a small change in x, how much will this change y?

In the case of derivatives, the symbol signifying "a change in ..." is the letter "d".

4.8 - Introducing the Power Rule


The derivative can be calculated by utilising a selection of tools called the Derivative Rules. These rules act as shortcuts to simplify the mathematics in particular scenarios.


Take our quadratic equation:

y = x^2 + 2

The rule we will be using in this scenario is the power rule.

The power rule.png

The power rule is a shortcut specifically designed to differentiate x^n, where n can represent any number the x term is raised to the power off.

The process is as follows:

  1. Select an x term in the form x^n

  2. Next, take that power of n, and multiply it to the term’s coefficient (the number before x).

  3. Lastly, we reduce the original power by 1.

  4. After these steps, repeat for any remaining terms in the equation in that form.

This will produce the new derivative equation that can calculate the rate of change at any given point for our non-linear equation.

4.9 - Additional Useful Rules


Many rules can be combined to compute or simplify the derivative. Below is a range of other must-knows to add to your derivation tool kit.

addditional rules.png

We now have an established toolset that can be applied to our original equation to get our full derivative with respect to x.

4.10 - Using the Power Rule


Below is the start-to-finish workflow for finding the derivative of our quadratic equation, concluding the background section of this tutorial.

final derivative.png

In other words, the gradient at a specific point would be twice that of the x value at that point.

4.11 - Demonstrating the Derivative


Now that we have the derivative, we can plot this equation as an additional function against the original line. This will allow us to perform some visual tests to ensure the gradient is working as we expect.

gradient line.png

Selecting some points on the grid lines allows us to read the values on the graph more accurately without printing the values in a table. From these preliminary tests, it can be seen that the gradient is, in fact, double the x value and that the gradient line looks even across and proportional to the example line.

More importantly, the gradient line is aligned with the centre of the example line, and we see a negative gradient on the left-hand side and a positive gradient on the right-hand side of the centre. Moreover, the further from the middle the x value is, the greater the rate of change.


The characteristics of the gradient match that of the example line, which suggests we have successfully found the derivative and captured the essence of the example line.

Quick Fire Quiz #2

Before we progress into the next section, let's test your knowledge so far with some quick-fire questions!


Q1. A mistake was made whilst calculating the derivative. Which line did the error occur?
derivative 1.png
The Code Along
5.1 - Setting up the Environment


In this tutorial section, we will progress through a code-along walkthrough of the gradient descent process.

To get started, we need to initiate a suitable programming environment to begin coding. I have selected Jupyter Notebook, a free-to-use web-based environment you can install and run locally on your computer. In addition to being free, I have chosen this tool because it permits programs to be created using "cells", small segments of live code that can be run and altered independently, perfect for experimenting with making changes to a larger program.

With a new project made, the next step is to import libraries. Libraries are a selection of pre-made programs and functions that we can include in our programs to make our job easier. The libraries I have chosen are:

  • Matplotlib - A free visualisation tool that allows us to create graphs effortlessly.

  • NumPy - A free mathematics library that contains many optimised mathematics tools.

Python 3

# Import Essential Libraries # -------------------------- # Import graph library from matplotlib import pyplot as plt # Import mathematics library import numpy as np

Libraries can be imported as a broader package or further specified to include specific tools. A tool imported into a program can also be provided with an alias using the "as" keyword. An alias acts as a nickname that you can use in your program. For example, if I want to use the "pyplot" graphing tool within the "matplotlib" package, I can refer to it using the alias "plt" to make my code shorter and easier to read.

5.2 - Declaring the Function


The first stage of the process is to establish the model we intend to optimise.


We can add the function we intend to use by declaring it as a function. A function has two main sections: the function header and the function body. The function header defines the function's name, along with any inputs we need to provide. The function body is where we place the actual behaviour of our function, in this case, returning the result of a calculation.

example function 2d.PNG

Python 3

# An example function to optimise def example_function(x_in): # Represents the equation: y = x^2 + 2 return (x_in ** 2) + 2

The quadratic equation we have shown before only takes in one input, the x value, which has been provided as an input parameter named "x_in." The "x_in" variable is known as a local variable, a variable only available within our function, similar to the functions having its own copy of the value we give it. This way, the function is free to work with this variable without affecting any other code outside the function's scope.

The equation is then entered in the function body as a return statement, providing us with the answer once it has been calculated. It is worth noting that raising a value to a power is accomplished using double asterisks.

5.3 - Testing the Function


A simple 2D function such as this can be tested by plotting it on a graph. This will allow us to visually check that the function is performing the way we expect it to before we attempt to optimise it. Though not a required step, catching issues early will limit confusion if problems occur later.

To test the function, we will need to do four things:

  1. Create input values for the x-axis

  2. Run the function to generate values for the y-axis

  3. To plot the function to a graph

  4. Display the graph

test the line.PNG

Python 3

# Create x-axis values x = np.linspace(-5, 5, 20) # Generate the y-values by calling the function function_test = example_function(x) # Plot the test line plt.plot(x, function_test, c='b') # Label the line on the graph plt.legend(["y = x^2 + 2"]) # Show the graph

The chosen equation is the quadratic we have used previously. The equation only takes in one input, the x value, which has been provided as a input paramter names "x_in". The "x_in" variable is known as a local variable, a variable only avaliable within our function, simular to the functions having its own copy of the value we give it, thats its free to work with without effecting any other code ourseide the functions scope.

The equation is then entered in the function body, as a return statement to provide us witht he answer once it has been calculated. It is worth noting that raising a value to a power is accomplished using double asterisks.

The Derivative - Lab
1 - What are potential dividers?

To initiate the repair, we first need to strip down the robot to expose the error, as the basic checks show the fault lies within the robot. Below is a picture of the top face of the robot in question.

bottom of page